Sparkour

Configuring Spark to Use Amazon S3

by Brian Uri!, 2016-03-06

Synopsis

This recipe provides the steps needed to securely connect an Apache Spark cluster running on Amazon Elastic Compute Cloud (EC2) to data stored in Amazon Simple Storage Service (S3). It contains instructions for both the classic s3n protocol and the newer, but still maturing, s3a protocol. Coordinating the versions of the various required libraries is the most difficult part -- writing application code for S3 is very straightforward.

Prerequisites

  1. You need a working Spark cluster, as described in Managing a Spark Cluster with the spark-ec2 Script.
    • For the s3n protocol, the cluster must be configured with the --copy-aws-credentials parameter.
    • For the s3a protocol, the cluster must be configured with an Identity & Access Management (IAM) Role via --instance-profile-name.
  2. You need an access-controlled S3 bucket available for Spark consumption, as described in Configuring Amazon S3 as a Spark Data Source.
    • For the s3n protocol, the bucket must have a bucket policy granting access to the owner of the access keys.
    • For the s3a protocol, the IAM Role of the cluster instances must have a policy granting access to the bucket.

Target Versions

  1. Spark depends on Apache Hadoop and Amazon Web Services (AWS) for libraries that communicate with Amazon S3. As such, any version of Spark should work with this recipe.
  2. Apache Hadoop started supporting the s3n protocol in version 0.18.0, so any recent version should suffice. The s3a protocol was introduced in version 2.6.0, but is still maturing. Several important issues were corrected in 2.7.0 and 2.8.0. You should consider 2.8.0 to be the minimum recommended version.

Section Links

S3 Support in Spark

There are no S3 libraries in the core Apache Spark project. Spark uses libraries from Hadoop to connect to S3, and the integration between Spark, Hadoop, and the AWS services is very much a work in progress. Hadoop offers 3 protocols for working with Amazon S3's REST API, and the protocol you select for your application is a trade-off between maturity, security, and performance.

  1. The s3 protocol is supported in Hadoop, but does not work with Apache Spark unless you are using the AWS version of Spark in Elastic MapReduce (EMR). We can safely ignore this protocol for now.
  2. The s3n protocol is Hadoop's older protocol for connecting to S3. Implemented with a third-party library (JetS3t), it provides rudimentary support for files up to 5 GB in size and uses AWS secret API keys to run. This "shared secret" approach is brittle, and no longer the preferred best practice within AWS. It also conflates the concepts of users and roles, as the worker node communicating with S3 presents itself as the person tied to the access keys.
  3. The s3a protocol is successor to s3n but is not mature yet. Implemented directly on top of AWS APIs, it is faster, handles files up to 5 TB in size, and supports authentication with Identity and Access Management (IAM) Roles. With IAM Roles, you assign an IAM Role to your worker nodes and then attach policies granting access to your S3 bucket. If you want to use this protocol with an older version of Hadoop, you might find success by downloading specific JAR files from a 2.7.x distribution, as described in this recipe. For example, the Spark cluster created with the spark-ec2 script only supports Hadoop 2.4 right now, so if you built your cluster with that script, additional JAR files are necessary.

S3 can be incorporated into your Spark application wherever a string-based file path is accepted in the code. An example of each protocol (using the bucket name, sparkour-data) is shown below.

Some Spark tutorials show AWS access keys hardcoded into the file paths. This is a horribly insecure approach and should never be done. Use exported environment variables or IAM Roles instead, as described in Configuring Amazon S3 as a Spark Data Source.

Advanced Configuration

The Hadoop libraries expose additional configuration properties for more fine-grained control of S3. To maximize your security, you should not use any of the Authentication properties that require you to write secret keys to a properties file.

Testing the Protocols

The simplest way to confirm that your Spark cluster is handling S3 protocols correctly is to point a Spark interactive shell at the cluster and run a simple chain of operators. You can either start up an interactive shell on your development environment or SSH into the master node of the cluster. You should have already tested your authentication credentials, as described in Configuring Amazon S3 as a Spark Data Source, so you can focus any troubleshooting efforts solely on the Spark and Hadoop side of the equation.

  1. Start the shell. Your secret keys or IAM Role should already be configured within the cluster.
  2. There is no interactive shell available for Java. Here is how you would run an application with the spark-submit script.

  3. Once the shell has started, pull a file from your S3 bucket and run a simple action on it. Remember that transformations are lazy, so simply calling textFile() on a file path does not actually do anything until a subsequent action. Use either s3n or s3a as a path prefix, depending on which protocol your cluster is configured for.
  4. There is no interactive shell available for Java. Here is how you would run this test inside an application.

    The low-level Spark Core API containing textFile() is not available in R, so we try to create a DataFrame instead. You should upload a simple JSON dataset to your S3 bucket for use in this test.

  5. After the code executes, check the S3 bucket via the AWS Management Console. You should see the newly saved file in the bucket. If the code ran successfully, you are ready to use S3 in your real application.
  6. If the code fails, it will likely fail for one of the reasons described below.

No FileSystem for scheme: s3n

This message appears when dependencies are missing from your Apache Spark distribution. If you see this error message, you can use the --packages parameter and Spark will use Maven to locate the missing dependencies and distribute them to the cluster. Alternately, you can use --jars if you manually downloaded the dependencies already. These parameters also works on the spark-submit script.

There is no interactive shell available for Java. Here is how you would run an application with the spark-submit script.

Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

This message appears when dependencies are missing from your Apache Spark distribution. If you see this error message, you should download a recent Hadoop 2.7.x distribution and unzip it in your development environment to get the necessary JAR files. This workaround requires a specific, older version of an AWS JAR (1.7.4) that might not be available in the EC2 Maven Repository, so the --packages parameter is not a good solution. Instead, we use --jars to point to our manually downloaded dependencies. This parameter also works on the spark-submit script.

There is no interactive shell available for Java. Here is how you would run an application with the spark-submit script.

AWS Error Message: One or more objects could not be deleted

This message occurs when your IAM Role does not have the proper permissions to delete objects in the S3 bucket. When you write to S3, several temporary files are saved during the task. These files are deleted once the write operation is complete, so your EC2 instance must have the s3:Delete* permission added to its IAM Role policy, as shown in Configuring Amazon S3 as a Spark Data Source.

Reference Links

  1. Amazon S3 in the Hadoop Wiki
  2. Available Configuration Options for Hadoop-AWS
  3. Configuring an S3 VPC Endpoint for a Spark Cluster

Change Log

  • 2016-09-20: Updated for Spark 2.0.0. Code may not be backwards compatible with Spark 1.6.x (SPARKOUR-18).

Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically? If so, please help me improve Sparkour by opening a ticket on the Issues page. You can also discuss this recipe with others in the Sparkour community on the Discussion page.

Apache, Spark, and Apache Spark are trademarks of the Apache Software Foundation (ASF).
Sparkour is © 2016 - 2017 by It is an independent project that is not endorsed or supported by Novetta or the ASF.
visitors since February 2016