This tutorial takes you through the common steps involved in creating a Spark application and submitting it to a Spark cluster for execution. We write and submit a simple application and then review the examples bundled with Apache Spark.
Spark imposes no special restrictions on where you can do your development. The Sparkour recipes will continue to use the EC2 instance created in a previous tutorial as a development environment, so that each recipe can start from the same baseline configuration. However, you probably already have a development environment tuned just the way you like it, so you can use it instead if you prefer. You'll just need to get your build dependencies in order.
Instructions for installing each programming language on your EC2 instance are shown below. If your personal development environment (outside of EC2) does not already have the right components installed, you should be able to review these instructions and adapt them to your unique environment.
If you intend to write any Spark applications with Java, you should consider updating to Java 8 or higher. This version of Java introduced Lambda Expressions which reduce the pain of writing repetitive boilerplate code while making the resulting code more similar to Python or Scala code. Sparkour Java examples employ Lambda Expressions heavily, and Java 7 support may go away in Spark 2.x. The Amazon Linux AMI comes with Java 7, but it's easy to switch versions:
Because the Amazon Linux image requires Python 2 as a core system dependency, you should install Python 3 in parallel without removing Python 2.
To complete your installation, set the PYSPARK_PYTHON environment variable so it takes effect when you login to the EC2 instance. This variable determines which version of Python is used by Spark.
You need to reload the environment variables (or logout and login again) so they take effect.
The R environment is available in the Amazon Linux package repository.
Scala is not in the Amazon Linux package repository, and must be downloaded separately. You should use the same version of Scala that was used to build your copy of Apache Spark.
To complete your installation, set the SCALA_HOME environment variable so it takes effect when you login to the EC2 instance.
You need to reload the environment variables (or logout and login again) so they take effect.
These instructions do not cover Java and Scala build tools (such as Maven and SBT) which simplify the compiling and bundling steps of the development lifecycle. Build tools are covered in Building Spark Applications with Maven and Building Spark Applications with SBT.
When you submit an application to a Spark cluster, the cluster manager distributes the application code to each worker so it can be executed locally. This means that all dependencies need to be included (except for Spark and Hadoop dependencies, which the workers already have copies of). There are multiple approaches for bundling dependencies, depending on your programming language:
The Spark documentation recommends creating an assembly JAR (or "uber" JAR) containing your application and all of the dependencies. You can also use the --packages parameter with a comma-separated list of Maven dependency IDs and Spark will download and distribute the libraries from a Maven repository. Finally, you can also specify a comma-separated list of third-party JAR files with the --jars parameter in the spark-submit script. Avoiding the assembly JAR is simpler, but has a performance trade-off if many separate files are copied across the workers.
You can use the --py-files parameter in the spark-submit script and pass in .zip, .egg, or .py dependencies.
You can load additional libraries when you initialize the SparkR package in your R script.
The Spark documentation recommends creating an assembly JAR (or "uber" JAR) containing your application and all of the dependencies. You can also use the --packages parameter with a comma-separated list of Maven dependency IDs and Spark will download and distribute the libraries from a Maven repository. Finally, you can also specify a comma-separated list of third-party JAR files with the --jars parameter in the spark-submit script. Avoiding the assembly JAR is simpler, but has a performance trade-off if many separate files are copied across the workers.
In the case of the --py-files or --jars parameters, you can also use URL schemes such as ftp: or hdfs: to reference files stored elsewhere, or local: to specify files that are already stored on the workers. See the documentation on Advanced Dependency Management for more details.
The simple application in this tutorial has no dependencies besides Spark itself. We cover the creation of an assembly JAR in Building Spark Applications with Maven and Building Spark Applications with SBT.
Here are examples of how you would submit an application in each of the supported languages. For JAR-based languages (Java and Scala), we would pass in an application JAR file and a class name. For interpreted languages, we just need the top-level module or script name. Everything after the expected spark-submit arguments are passed into the application itself.
The spark-submit script accepts a --deploy-mode parameter which dictates how the driver is set up. Recall from the previous recipe that the driver contains your application and relies upon the completed tasks of workers to successfully execute your code. It's best if the driver is physically co-located near the workers to reduce any network latency.
You can visualize these modes in the image below.
If you find yourself often reusing the same configuration parameters, you can create a conf/spark-defaults.conf file in the Spark home directory of your development environment. Properties explicitly set within a Spark application (on the SparkConf object) have the highest priority, followed by properties passed into the spark-submit script, and finally the defaults file.
Here is an example of setting the master URL in a defaults file. More detail on the available properties can be found in the official documentation.
Now that we have successfully submitted and executed an application, let's take a look at one of the examples included in the Spark distribution.
You have now been exposed to the application lifecycle for applications that use Spark for data processing. It's time to stop poking around the edges of Spark and actually use the Spark APIs to do something useful to data. In the next tutorial, Tutorial #5: Working with Spark Resilient Distributed Datasets (RDDs), we dive into the Spark Core API and work with some common transformations and actions. If you are done playing with Spark for now, make sure that you stop your EC2 instance so you don't incur unexpected charges.
Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically? If so, please help me improve Sparkour by opening a ticket on the Issues page. You can also discuss this recipe with others in the Sparkour community on the Discussion page.