This recipe provides the steps needed to securely connect an Apache Spark cluster running on Amazon Elastic Compute Cloud (EC2) to data stored in Amazon Simple Storage Service (S3), using the s3a protocol. Coordinating the versions of the various required libraries is the most difficult part -- writing application code for S3 is very straightforward.
There are no S3 libraries in the core Apache Spark project. Spark uses libraries from Hadoop to connect to S3, and the integration between Spark, Hadoop, and the AWS services can feel a little finicky. We skip over two older protocols for this recipe:
We focus on the s3a protocol, which is the most modern protocol available. Implemented directly on top of AWS APIs, s3a is scalable, handles files up to 5 TB in size, and supports authentication with Identity and Access Management (IAM) Roles. With IAM Roles, you assign an IAM Role to your worker nodes and then attach policies granting access to your S3 bucket. No secret keys are involved, and the risk of accidentally disseminating keys or committing them in version control is reduced.
S3 can be incorporated into your Spark application wherever a string-based file path is accepted in the code. An example (using the bucket name, sparkour-data) is shown below.
Some Spark tutorials show AWS access keys hardcoded into the file paths. This is a horribly insecure approach and should never be done. Use exported environment variables or IAM Roles instead, as described in Configuring Amazon S3 as a Spark Data Source.
The Hadoop libraries expose additional configuration properties for more fine-grained control of S3. To maximize your security, you should not use any of the Authentication properties that require you to write secret keys to a properties file.
The simplest way to confirm that your Spark cluster is handling S3 protocols correctly is to point a Spark interactive shell at the cluster and run a simple chain of operators. You can either start up an interactive shell on your development environment or SSH into the master node of the cluster. You should have already tested your authentication credentials, as described in Configuring Amazon S3 as a Spark Data Source, so you can focus any troubleshooting efforts solely on the Spark and Hadoop side of the equation.
There is no interactive shell available for Java. Here is how you would run an application with the spark-submit script.
There is no interactive shell available for Java. Here is how you would run this test inside an application.
The low-level Spark Core API containing textFile() is not available in R, so we try to create a DataFrame instead. You should upload a simple JSON dataset to your S3 bucket for use in this test.
This message appears when you're using the s3a protocol and dependencies are missing from your Apache Spark distribution. If you're using a Spark distribution that was "Pre-built for Apache Hadoop 2.7 and later", you can automatically load the dependencies from the EC2 Maven Repository with the --packages parameter. This parameter also works on the spark-submit script.
There is no interactive shell available for Java. Here is how you would run an application with the spark-submit script.
The solution gets trickier if you want to take advantage of the s3a improvements in Hadoop 2.8.x and higher. Spark's "Pre-built for Apache Hadoop 2.7 and later" distributions contain dependencies that conflict with the libraries needed in modern Hadoop versions, so using the --packages parameter will lead to error messages such as:
To integrate modern Hadoop versions, you need to download a "Pre-built with user-provided Apache Hadoop" distribution of Spark and add Hadoop to your classpath, as shown in Using Spark's "Hadoop Free" Build from the official documentation.
(s3n is no longer relevant now that the s3a protocol is mature, but I'm including this because many people search for the error message on Google and arrive here). This message appears when you're using the s3n protocol and dependencies are missing from your Apache Spark distribution. See the S3AFileSystem error above for ways to correct this.
This message occurs when your IAM Role does not have the proper permissions to delete objects in the S3 bucket. When you write to S3, several temporary files are saved during the task. These files are deleted once the write operation is complete, so your EC2 instance must have the s3:Delete* permission added to its IAM Role policy, as shown in Configuring Amazon S3 as a Spark Data Source.
Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically? If so, please help me improve Sparkour by opening a ticket on the Issues page. You can also discuss this recipe with others in the Sparkour community on the Discussion page.