Sparkour

Configuring Spark to Use Amazon S3

by Brian Uri!, 2016-03-06

Synopsis

This recipe provides the steps needed to securely connect an Apache Spark cluster running on Amazon Elastic Compute Cloud (EC2) to data stored in Amazon Simple Storage Service (S3), using the s3a protocol. Coordinating the versions of the various required libraries is the most difficult part -- writing application code for S3 is very straightforward.

Prerequisites

  1. You need a working Spark cluster, as described in Managing a Spark Cluster with the spark-ec2 Script. The cluster must be configured with an Identity & Access Management (IAM) Role via --instance-profile-name.
  2. You need an access-controlled S3 bucket available for Spark consumption, as described in Configuring Amazon S3 as a Spark Data Source. The IAM Role of the cluster instances must have a policy granting access to the bucket.

Target Versions

  1. Spark depends on Apache Hadoop and Amazon Web Services (AWS) for libraries that communicate with Amazon S3. As such, any version of Spark should work with this recipe.
  2. Apache Hadoop started supporting the s3a protocol in version 2.6.0, but several important issues were corrected in Hadoop 2.7.0 and Hadoop 2.8.0. You should consider 2.7.2 to be the minimum recommended version.

Section Links

S3 Support in Spark

There are no S3 libraries in the core Apache Spark project. Spark uses libraries from Hadoop to connect to S3, and the integration between Spark, Hadoop, and the AWS services can feel a little finicky. We skip over two older protocols for this recipe:

  1. The s3 protocol is supported in Hadoop, but does not work with Apache Spark unless you are using the AWS version of Spark in Elastic MapReduce (EMR).
  2. The s3n protocol is Hadoop's older protocol for connecting to S3. This deprecated protocol has major limitations, including a brittle security approach that requires the use of AWS secret API keys to run.

We focus on the s3a protocol, which is the most modern protocol available. Implemented directly on top of AWS APIs, s3a is scalable, handles files up to 5 TB in size, and supports authentication with Identity and Access Management (IAM) Roles. With IAM Roles, you assign an IAM Role to your worker nodes and then attach policies granting access to your S3 bucket. No secret keys are involved, and the risk of accidentally disseminating keys or committing them in version control is reduced.

S3 can be incorporated into your Spark application wherever a string-based file path is accepted in the code. An example (using the bucket name, sparkour-data) is shown below.

Some Spark tutorials show AWS access keys hardcoded into the file paths. This is a horribly insecure approach and should never be done. Use exported environment variables or IAM Roles instead, as described in Configuring Amazon S3 as a Spark Data Source.

Advanced Configuration

The Hadoop libraries expose additional configuration properties for more fine-grained control of S3. To maximize your security, you should not use any of the Authentication properties that require you to write secret keys to a properties file.

Testing the s3a Protocol

The simplest way to confirm that your Spark cluster is handling S3 protocols correctly is to point a Spark interactive shell at the cluster and run a simple chain of operators. You can either start up an interactive shell on your development environment or SSH into the master node of the cluster. You should have already tested your authentication credentials, as described in Configuring Amazon S3 as a Spark Data Source, so you can focus any troubleshooting efforts solely on the Spark and Hadoop side of the equation.

  1. Start the shell. Your Spark cluster's EC2 instances should already be configured with an IAM Role.
  2. There is no interactive shell available for Java. Here is how you would run an application with the spark-submit script.

  3. Once the shell has started, pull a file from your S3 bucket and run a simple action on it. Remember that transformations are lazy, so simply calling textFile() on a file path does not actually do anything until a subsequent action.
  4. There is no interactive shell available for Java. Here is how you would run this test inside an application.

    The low-level Spark Core API containing textFile() is not available in R, so we try to create a DataFrame instead. You should upload a simple JSON dataset to your S3 bucket for use in this test.

  5. After the code executes, check the S3 bucket via the AWS Management Console. You should see the newly saved file in the bucket. If the code ran successfully, you are ready to use S3 in your real application.
  6. If the code fails, it will likely fail for one of the reasons described below.

Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

This message appears when you're using the s3a protocol and dependencies are missing from your Apache Spark distribution. If you're using a Spark distribution that was "Pre-built for Apache Hadoop 2.7 and later", you can automatically load the dependencies from the EC2 Maven Repository with the --packages parameter. This parameter also works on the spark-submit script.

There is no interactive shell available for Java. Here is how you would run an application with the spark-submit script.

The solution gets trickier if you want to take advantage of the s3a improvements in Hadoop 2.8.x and higher. Spark's "Pre-built for Apache Hadoop 2.7 and later" distributions contain dependencies that conflict with the libraries needed in modern Hadoop versions, so using the --packages parameter will lead to error messages such as:

To integrate modern Hadoop versions, you need to download a "Pre-built with user-provided Apache Hadoop" distribution of Spark and add Hadoop to your classpath, as shown in Using Spark's "Hadoop Free" Build from the official documentation.

No FileSystem for scheme: s3n

(s3n is no longer relevant now that the s3a protocol is mature, but I'm including this because many people search for the error message on Google and arrive here). This message appears when you're using the s3n protocol and dependencies are missing from your Apache Spark distribution. See the S3AFileSystem error above for ways to correct this.

AWS Error Message: One or more objects could not be deleted

This message occurs when your IAM Role does not have the proper permissions to delete objects in the S3 bucket. When you write to S3, several temporary files are saved during the task. These files are deleted once the write operation is complete, so your EC2 instance must have the s3:Delete* permission added to its IAM Role policy, as shown in Configuring Amazon S3 as a Spark Data Source.

Reference Links

  1. Amazon S3 in the Hadoop Wiki
  2. Available Configuration Options for Hadoop-AWS
  3. Using Spark's "Hadoop Free" Build
  4. Configuring an S3 VPC Endpoint for a Spark Cluster

Change Log

  • 2016-09-20: Updated for Spark 2.0.0. Code may not be backwards compatible with Spark 1.6.x (SPARKOUR-18).
  • 2019-10-20: Updated to focus solely on s3a, now that s3n is fully deprecated (SPARKOUR-34).

Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically? If so, please help me improve Sparkour by opening a ticket on the Issues page. You can also discuss this recipe with others in the Sparkour community on the Discussion page.

Apache, Spark, and Apache Spark are trademarks of the Apache Software Foundation (ASF).
Sparkour is © 2016 - 2020 by It is an independent project that is not endorsed or supported by Novetta or the ASF.
visitors since February 2016