Configuring Amazon S3 as a Spark Data Source

by Brian Uri!, 2016-03-05

Synopsis

This recipe provides the steps needed to securely expose data in Amazon Simple Storage Service (S3) for consumption by a Spark application. The resultant configuration works with the s3a protocol.

Prerequisites

You need an AWS account and a general understanding of how AWS billing works.
You need an EC2 instance with the AWS command line tools installed, so you can test the connection. The instance created in either Tutorial #2: Installing Spark on Amazon EC2 or Managing a Spark Cluster with the spark-ec2 Script is sufficient.

Target Versions

This recipe is independent of any specific version of Spark or Hadoop. All work is done through Amazon Web Services (AWS).

⇖ Introducing Amazon S3

Amazon S3 is a key-value object store that can be used as a data source to your Spark cluster. You can store unlimited data in S3 although there is a 5 TB maximum on individual files. Data is organized into S3 buckets with various options for access control and versioning. The monthly cost is based upon the number of API calls your application makes and the amount of space your data takes up ($0.023 per GB per month, as of March 2020). Transfer of data between S3 and an EC2 instance is free.

There are no S3 libraries in the core Apache Spark project. Spark uses libraries from Hadoop to connect to S3, and the integration between Spark, Hadoop, and the AWS services can feel a little finicky. Your success in getting things working is very dependent on specific versions of the various libraries. Because of this, the Spark side is covered in a separate recipe (Configuring Spark to Use Amazon S3) and this recipe focuses solely on the S3 side.

Important Limitations

By using S3 as a data source, you lose the ability to position your data as closely as possible to your cluster (data locality). A common pattern to work around this is to load data from S3 into a local HDFS store in the cluster, and then operate upon it from there.
You can mimic the behavior of subdirectories on a filesystem by using object keys resembling file paths, such as s3a://myBucket/2016/03/03/output.log, but S3 is not a true hierarchical filesystem. You cannot assume that a wildcard in a key will be processed as it would on a real filesystem. In particular, if you use Spark Streaming with S3, nested directories are not supported. If you want to grab all files in all subdirectories, you'll need to do some extra coding on your side to resolve the subdirectories first so you can send explicit path requests to S3.
S3 provides Read-After-Write consistency for new objects and Eventual consistency for object updates and deletions. If your data changes, you cannot guarantee that every worker node requesting the data will see the authoritative newest version right away -- all requesters will eventually see the changes.

Creating a Bucket

If you don't already have an S3 bucket created, you can create one for the purposes of this recipe.

Login to your AWS Management Console and select the S3 service. The S3 dashboard provides a filesystem-like view of the object store (with forward slashes treated like directory separators) for ease of understanding and browsing.
Select Create Bucket. Set the Bucket name to a unique value like sparkour-data. Bucket names must be unique across all of S3, so it's a good idea to assign a unique hyphenated prefix to your bucket names.
Set the Region to the same region as your Spark cluster. In my case, I selected US East (N. Virginia). Finally, select Create. You should see the new bucket in the list.
Select the bucket name in the list to browse inside of it. Select Upload, then Add files to upload a simple text file. Once you have picked a file from your local machine, select Upload. We try to download this file later on to test our connection.
Exit this dashboard and return to the list of AWS service offerings by selecting the "AWS" icon in the upper left corner.

⇖ Configuring Access Control

The way you secure your bucket depends on the communication protocol you intend to use in your Spark application. We skip over the Access Control steps required for two older protocols:

The s3 protocol is supported in Hadoop, but does not work with Apache Spark unless you are using the AWS version of Spark in Elastic MapReduce (EMR).
The s3n protocol is Hadoop's older protocol for connecting to S3. This deprecated protocol has major limitations, including a brittle security approach that requires the use of AWS secret API keys to run.

We focus on the s3a protocol, which is the most modern protocol available. Implemented directly on top of AWS APIs, s3a is scalable, handles files up to 5 TB in size, and supports authentication with Identity and Access Management (IAM) Roles. With IAM Roles, you assign an IAM Role to your worker nodes and then attach policies granting access to your S3 bucket. No secret keys are involved, and the risk of accidentally disseminating keys or committing them in version control is reduced.

s3a support was introduced in Hadoop 2.6.0, but several important issues were corrected in Hadoop 2.7.0 and Hadoop 2.8.0. You should consider 2.7.2 to be the minimum required version of Hadoop to use this protocol.

Configuring Your Bucket for s3a

To support the role-based style of authentication, we create a policy that can be attached to an IAM Role on your worker nodes.

From the AWS Management Console and select the Identity & Access Management service.
Navigate to Policies in the left side menu, and then select Create policy at the top of the page. This starts a wizard workflow to create a new policy.
On Step 1. Create policy, select the JSON tab. (You can use one of the other wizard options if the sample policy below is insufficient for your needs).
In the editing area, paste in the following policy, altering the sparkour-data to match your bucket name. Then, select Review policy

On Step 2. Review Policy, set the Name to a memorable value. The example policy grants read/write permissions to the bucket, so we call it sparkour-data-S3-RW. Set the Description to Grant read/write access to the sparkour-data bucket.
Select Create Policy. You return to the Policies dashboard and should be able to find your policy in the list (You may need to filter out the Amazon-managed policies).
Now, let's attach the policy to an IAM Role. If you have created an EC2 instance using one of the recipes listed in the Prerequisites of this recipe, it should have an IAM Role assigned to it that we can use.
Navigate to Roles in the left side menu, and then select the name of the Role in the table (selecting the checkbox to the left of the name is insufficient).
On the Summary page that appears, select Attach policies in the Permissions tab. Select the policy you just created. You may need to filter the table if it's lost among the many Amazon-managed policies. Select Attach policy. You return the Summary page and should see the policy attached.
To test the result of this policy, SSH into an EC2 instance and try to download the file you placed in your bucket earlier.

If this copy worked correctly, the permissions are set up properly, and you are now ready to configure Spark to work with s3a.

⇖ Next Steps

This recipe configures just the AWS side of the S3 equation. Once your bucket is set up correctly and can be accessed from the EC2 instance, the recipe, Configuring Spark to Use Amazon S3, will help you configure Spark itself.

Reference Links

Change Log

2019-10-20: Updated to focus solely on s3a, now that s3n is fully deprecated (SPARKOUR-34).

Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically? If so, please help me improve Sparkour by opening a ticket on the Issues page. You can also discuss this recipe with others in the Sparkour community on the Discussion page.

back to Recipes

Apache, Spark, and Apache Spark are trademarks of the Apache Software Foundation (ASF).
Sparkour is © 2016 - 2024 by It is an independent project that is not endorsed or supported by Accenture Federal Services or the ASF.

visitors since February 2016