Configuring Amazon S3 as a Spark Data Source
by Brian Uri!, 2016-03-05
This recipe provides the steps needed to securely expose data in Amazon Simple Storage Service (S3) for consumption by a Spark application.
The resultant configuration works with the s3a protocol.
- You need an
and a general understanding of how AWS billing works.
- You need an EC2 instance with the AWS command line tools installed, so you can test the connection. The instance created in either Tutorial #2: Installing Spark on Amazon EC2 or
Managing a Spark Cluster with the spark-ec2 Script is sufficient.
- This recipe is independent of any specific version of Spark or Hadoop. All work is done through Amazon Web Services (AWS).
⇖ Introducing Amazon S3
Amazon S3 is a key-value object store that can be used as a data source to your Spark cluster. You can store unlimited data in S3
although there is a 5 TB maximum on individual files. Data is organized into S3 buckets with various options for access control and versioning.
The monthly cost is based upon the number of API calls your application makes and the amount of space
your data takes up ($0.023 per GB per month, as of March 2020). Transfer of data between S3 and an EC2 instance is free.
There are no S3 libraries in the core Apache Spark project. Spark uses libraries from Hadoop to connect to S3, and the integration between Spark, Hadoop, and the AWS services
can feel a little finicky. Your success in getting things working is very dependent on specific versions of the various libraries.
Because of this, the Spark side is covered in a separate recipe (Configuring Spark to Use Amazon S3) and this recipe focuses solely on the S3 side.
- By using S3 as a data source, you lose the ability to position your data as closely as possible to your cluster (data locality). A common pattern
to work around this is to load data from S3 into a local HDFS store in the cluster, and then operate upon it from there.
- You can mimic the behavior of subdirectories on a filesystem by using object keys resembling file paths, such as s3a://myBucket/2016/03/03/output.log,
but S3 is not a true hierarchical filesystem. You cannot assume that a wildcard in a key will be processed as it would on a real filesystem. In particular, if you use
Spark Streaming with S3, nested directories are not supported. If you want to grab all files in all subdirectories, you'll need to do some extra coding on your side to resolve
the subdirectories first so you can send explicit path requests to S3.
- S3 provides Read-After-Write consistency for new objects and Eventual consistency for object updates and deletions. If your data changes, you cannot guarantee that every worker
node requesting the data will see the authoritative newest version right away -- all requesters will eventually see the changes.
Creating a Bucket
If you don't already have an S3 bucket created, you can create one for the purposes of this recipe.
- Login to your AWS Management Console and select the
S3 service. The S3 dashboard provides a filesystem-like view of the object
store (with forward slashes treated like directory separators) for ease of understanding and browsing.
- Select Create Bucket. Set the Bucket name to a unique value
like sparkour-data. Bucket names must be unique across all of S3, so it's a good idea to assign a unique hyphenated
prefix to your bucket names.
- Set the Region to the same region as your Spark cluster. In my case, I selected US East (N. Virginia).
Finally, select Create. You should see the new bucket in the list.
- Select the bucket name in the list to browse inside of it. Select Upload, then Add files
to upload a simple text file. Once you have picked a file from your local machine, select Upload. We try to
download this file later on to test our connection.
- Exit this dashboard and return to the list of AWS service offerings by selecting the "AWS" icon in the upper left corner.
⇖ Configuring Access Control
The way you secure your bucket depends on the communication protocol you intend to use in your Spark application. We skip over the Access Control steps required
for two older protocols:
- The s3 protocol is supported in Hadoop, but does not work with Apache Spark unless you are using the AWS version of Spark in Elastic MapReduce (EMR).
- The s3n protocol is Hadoop's older protocol for connecting to S3. This deprecated protocol has major limitations, including a brittle security approach
that requires the use of AWS secret API keys to run.
We focus on the s3a protocol, which is the most modern protocol available. Implemented directly on top of AWS APIs, s3a is scalable, handles files up to 5 TB in size, and supports authentication with
Identity and Access Management (IAM) Roles. With IAM Roles, you assign an IAM Role to your worker nodes and then attach policies granting access to your S3 bucket. No secret keys are
involved, and the risk of accidentally disseminating keys or committing them in version control is reduced.
s3a support was introduced in Hadoop 2.6.0, but several important issues were corrected in Hadoop 2.7.0 and Hadoop 2.8.0.
You should consider 2.7.2 to be the minimum required version of Hadoop to use this protocol.
Configuring Your Bucket for s3a
To support the role-based style of authentication, we create a policy that can be attached to an IAM Role on your worker nodes.
- From the AWS Management Console and select the
Identity & Access Management service.
- Navigate to Policies in the left side menu, and then select
Create policy at the top of the page.
This starts a wizard workflow to create a new policy.
- On Step 1. Create policy, select the JSON tab. (You can use one of the other wizard options
if the sample policy below is insufficient for your needs).
- In the editing area, paste in the following policy, altering the sparkour-data to match your bucket name. Then, select Review policy
- On Step 2. Review Policy, set the Name to a memorable value. The example policy grants read/write permissions to the bucket, so we call it
sparkour-data-S3-RW. Set the Description to Grant read/write access to the sparkour-data bucket.
- Select Create Policy. You return to the Policies dashboard and should be able to find your policy in the list
(You may need to filter out the Amazon-managed policies).
- Now, let's attach the policy to an IAM Role. If you have created an EC2 instance using one of the recipes listed in the Prerequisites of this recipe, it should
have an IAM Role assigned to it that we can use.
- Navigate to Roles in the left side menu, and then select
the name of the Role in the table (selecting the checkbox to the left of the name is insufficient).
- On the Summary page that appears, select Attach policies in the Permissions tab.
Select the policy you just created. You may need to filter the table if it's lost among the many Amazon-managed policies.
Select Attach policy. You return the Summary page and should see the policy attached.
- To test the result of this policy, SSH into an EC2 instance and try to download the file you placed in your bucket earlier.
- If this copy worked correctly, the permissions are set up properly, and you are now ready to configure Spark to work with s3a.
⇖ Next Steps
This recipe configures just the AWS side of the S3 equation. Once your bucket is set up correctly and can be accessed from the EC2 instance, the recipe,
Configuring Spark to Use Amazon S3, will help you configure Spark itself.
Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically?
If so, please help me improve Sparkour by opening a ticket on the Issues page.
You can also discuss this recipe with others in the Sparkour community on the Discussion page.