Sparkour

Configuring Amazon S3 as a Spark Data Source

by Brian Uri!, 2016-03-05

Synopsis

This recipe provides the steps needed to securely expose data in Amazon Simple Storage Service (S3) for consumption by a Spark application. The resultant configuration works with both supported S3 protocols in Spark: the classic s3n protocol and the newer, but still maturing, s3a protocol.

Prerequisites

  1. You need an AWS account and a general understanding of how AWS billing works.
  2. You need an EC2 instance with the AWS command line tools installed, so you can test the connection. The instance created in either Tutorial #2: Installing Spark on Amazon EC2 or Managing a Spark Cluster with the spark-ec2 Script is sufficient.

Target Versions

  1. This recipe is independent of any specific version of Spark or Hadoop. All work is done through Amazon Web Services (AWS).

Section Links

Introducing Amazon S3

Amazon S3 is a key-value object store that can be used as a data source to your Spark cluster. You can store unlimited data in S3 although there is a 5 TB maximum on individual files. Data is organized into S3 buckets with various options for access control and versioning. The monthly cost is based upon the number of API calls your application makes and the amount of space your data takes up ($0.03 per GB per month, as of May 2017). Transfer of data between S3 and an EC2 instance is free.

There are no S3 libraries in the core Apache Spark project. Spark uses libraries from Hadoop to connect to S3, and the integration between Spark, Hadoop, and the AWS services is very much a work in progress. Your success in getting things working is very dependent on specific versions of the various libraries, the protocol you use, and possibly even the weather forecast. Because of this, the Spark side is covered in a separate recipe (Configuring Spark to Use Amazon S3) and this recipe focuses solely on the S3 side.

Important Limitations

  • By using S3 as a data source, you lose the ability to position your data as closely as possible to your cluster (data locality). A common pattern to work around this is to load data from S3 into a local HDFS store in the cluster, and then operate upon it from there.
  • You can mimic the behavior of subdirectories on a filesystem by using object keys resembling file paths, such as s3n://myBucket/2016/03/03/output.log, but S3 is not a true hierarchical filesystem. You cannot assume that a wildcard in a key will be processed as it would on a real filesystem. In particular, if you use Spark Streaming with S3, nested directories are not supported. If you want to grab all files in all subdirectories, you'll need to do some extra coding on your side to resolve the subdirectories first so you can send explicit path requests to S3.
  • S3 provides Read-After-Write consistency for new objects and Eventual consistency for object updates and deletions. If your data changes, you cannot guarantee that every worker node requesting the data will see the authoritative newest version right away -- all requesters will eventually see the changes.

Creating a Bucket

If you don't already have an S3 bucket created, you can create one for the purposes of this recipe.

  1. Login to your AWS Management Console and select the S3 service. The S3 dashboard provides a filesystem-like view of the object store (with forward slashes treated like directory separators) for ease of understanding and browsing.
  2. Select Create Bucket. Set the Bucket Name to a unique value like sparkour-data. Bucket names must be unique across all of S3, so it's a good idea to assign a unique hyphenated prefix to your bucket names.
  3. Set the Region to US Standard. This is a bicoastal US service. If your cluster is in a different AWS Region, you want the S3 bucket to exist in the same Region. Finally, select Create. You should see the new bucket in the list.
  4. Select the bucket name in the list to browse inside of it. Select Upload, then Add Files to upload a simple text file. Once you have picked a file from your local machine, select Start Upload. We try to download this file later on to test our connection.
  5. Exit this dashboard and return to the list of AWS service offerings by selecting the "cube" icon in the upper left corner.

Configuring Access Control

Access control can be handled as a policy attached to the bucket itself, a policy attached to the entity interacting with the bucket, or some combination of both (with the final decision based on the principle of least-privilege). The way you secure your bucket depends on the communication protocol you intend to use in your Spark application.

  1. The s3 protocol is supported in Hadoop, but does not work with Apache Spark unless you are using the AWS version of Spark in Elastic MapReduce (EMR). We can safely ignore this protocol for now.
  2. The s3n protocol is Hadoop's older protocol for connecting to S3. Implemented with a third-party library (JetS3t), it provides rudimentary support for files up to 5 GB in size and uses AWS secret API keys to run. This "shared secret" approach is brittle, and no longer the preferred best practice within AWS. It also conflates the concepts of users and roles, as the worker node communicating with S3 presents itself as the person tied to the access keys.
  3. The s3a protocol is successor to s3n but is not mature yet. Implemented directly on top of AWS APIs, it is faster, handles files up to 5 TB in size, and supports authentication with Identity and Access Management (IAM) Roles. With IAM Roles, you assign an IAM Role to your worker nodes and then attach policies granting access to your S3 bucket. No secret keys are involved, and the risk of accidentally disseminating keys or committing them in version control is reduced. s3a support was introduced in Hadoop 2.6.0, but several important issues were corrected in version 2.7.0 and 2.8.0. You should consider 2.8.0 to be the minimum required version of Hadoop to use this protocol.

As you can see, the protocol you select is a trade-off between maturity, security, and performance. This decision drives your approach for bucket access control.

Configuring Your Bucket for s3n

To support the "shared secret" style of authentication, we add permissions directly to the bucket for the owner of the access keys used in the application.

  1. From the AWS Management Console and select the S3 service.
  2. Select the "magnifying glass" icon next to your bucket name in the list to bring up the Properties tab. Expand the Permissions divider, which is collapsed by default.
  3. As the owner of the bucket, you should see your own permissions. Select Add bucket policy and a policy editing window appears. Paste in the policy shown below, which grants a specific user read/write access to anything in the bucket. You need to modify the aws-account-id, username, and bucket-name to match your environment. If you don't know the user's Principal string, you can look up the details in the Identity & Access Management dashboard (select the user and it appears on their Summary page).
  4. Here is an example of the populated policy template.
  5. Select Save. If there are no syntax errors, the policy is attached to the bucket.
  6. To test the result of this policy, SSH into an EC2 instance and try to download the file you placed in your bucket earlier.
  7. If this copy worked correctly, the permissions are set up properly, and you are now ready to configure Spark to work with s3n.

Configuring Your Bucket for s3a

To support the role-based style of authentication, we create a policy that can be attached to an IAM Role on your worker nodes.

  1. From the AWS Management Console and select the Identity & Access Management service.
  2. Navigate to Policies in the left side menu, and then select Create Policy at the top of the page. This starts a wizard workflow to create a new policy.
  3. On Step 1. Create Policy, select Create Your Own Policy. (You can use one of the other wizard options if the sample policy below is insufficient for your needs).
  4. Step 2 is skipped based on your previous selection. On Step 3. Review Policy, set the Policy Name to a memorable value. The example policy grants read/write permissions to the bucket, so we call it sparkour-data-S3-RW. Set the Description to Grant read/write access to the sparkour-data bucket.
  5. In the Policy Document editing area, paste in the following policy, altering the sparkour-data to match your bucket name.
  6. Select Validate Policy to check for syntax errors, and then select Create Policy. You return to the Policies dashboard and should be able to find your policy in the list (You may need to filter out the Amazon-managed policies).
  7. Now, let's attach the policy to an IAM Role. If you have created an EC2 instance using one of the recipes listed in the Prerequisites of this recipe, it should have an IAM Role assigned to it that we can use.
  8. Navigate to Roles in the left side menu, and then select the name of the Role in the table (selecting the checkbox to the left of the name is insufficient).
  9. On the Summary page that appears, select Attach Policy in the Permissions tab. Select the policy you just created. You may need to filter the table if it's lost among the many Amazon-managed policies. Select Attach Policy. You return the Summary page and should see the policy attached.
  10. To test the result of this policy, SSH into an EC2 instance and try to download the file you placed in your bucket earlier.
  11. If this copy worked correctly, the permissions are set up properly, and you are now ready to configure Spark to work with s3a.

Configuring Your Bucket for Both Protocols

It is safe to apply both the bucket access policy and the IAM Role access policy if you plan on using both protocols. The access control decision is based on the union of the two policies.

Next Steps

This recipe configures just the AWS side of the S3 equation. Once your bucket is set up correctly and can be accessed from the EC2 instance, the recipe, Configuring Spark to Use Amazon S3, will help you configure Spark itself.

Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically? If so, please help me improve Sparkour by opening a ticket on the Issues page. You can also discuss this recipe with others in the Sparkour community on the Discussion page.

Apache, Spark, and Apache Spark are trademarks of the Apache Software Foundation (ASF).
Sparkour is © 2016 - 2017 by It is an independent project that is not endorsed or supported by Novetta or the ASF.
visitors since February 2016