This tutorial describes the tools available to manage Spark in a clustered configuration, including the official Spark scripts and the web User Interfaces (UIs). We compare the different cluster modes available, and experiment with Local and Standalone mode on our EC2 instance. We also learn how to connect the Spark interactive shells to different Spark clusters and conclude with a description of the spark-ec2 script.
A key feature of Spark is its ability to process data in a parallel, distributed fashion. Breaking down a processing job into smaller tasks that can be executed in parallel by many workers is critical when dealing with massive volumes of data, but this introduces new complexity in the form of task scheduling, monitoring, failure recovery. Spark abstracts this complexity away, leaving you free to focus on your data processing. Key clustering concepts are shown in the image below.
Spark comes with a cluster manager implementation referred to as the Standalone cluster manager. However, resource management is not a unique Spark concept, and you can swap in one of these implementations instead:
We focus on Standalone Mode here, and explore these alternatives in later recipes.
A server with Spark installed on it can be used in four separate ways:
Sometimes, these roles overlap on the same server. For example, you probably use the same server as both a development environment and a launch environment. When you're doing rapid development with a test cluster, you might reuse the master node as a development environment to reduce the feedback loop between compiling and running your code. Later in this tutorial, we reuse the same server to create a cluster where a master and a worker are running on the same EC2 instance -- this demonstrates clustering concepts without forcing us to pay for an additional instance.
When you document or describe your architecture for others, you should be clear about the logical roles. For example, "Submitting my application from my local server A to the Spark cluster on server B" is much clearer than "Running my application on my Spark server" and guaranteed to elicit better responses when you inevitably pose a question on Stack Overflow.
Juggling the added complexity a Spark cluster may not be a priority when you're just getting started with Spark. If you are developing a new application and just want to execute code against small test datasets, you can use Spark's Local mode. In this mode, your development environment is the only thing you need. An ephemeral Spark engine is started when you run your application, executes your application just like a real cluster would, and then goes away when your application stops. There are no explicit masters or slaves to complicate your mental model.
By default, the Spark interactive shells run in Local mode unless you explicitly specify a cluster. A Spark engine starts up when you start the shell, executes the code that you type interactively, and goes away when you exit the shell. To mimic the idea of adding more workers to a Spark cluster, you can specify the number of parallel threads used in Local mode:
There is no interactive shell available for Java. You should use one of the other languages to smoke test your Spark cluster.
"Running in Standalone Mode" is just another way of saying that the Spark Standalone cluster manager is being used in the cluster. The Standalone cluster manager is a simple implementation that accepts jobs in First In, First Out (FIFO) order: applications that submit jobs first are given priority access to workers over applications that submit jobs later on. Starting a Spark cluster in Standalone Mode on our EC2 instance is easy to do with the official Spark scripts.
Normally, a Spark cluster would consist of separate servers for the cluster manager (master) and each worker (slave). Each server would have a copy of Spark installed, and a third server would be your development environment where you write and submit your application. In the interests of limiting the time you spend configuring EC2 instances and reducing the risk of unexpected charges, this tutorial creates a minimal cluster using the same EC2 instance as the development environment, master, and slave, as shown in the image below.
In a real world situation, you would launch as many worker EC2 instances as you need (and you would probably forgo these manual steps in favour of automation with the spark-ec2 script).
There are several environment variables and properties that can be configured to control the behavior of the master and slaves. For example, we could explicitly set the hostnames used in the URLs to our EC2 instance's public IP, to guarantee that the URLs would be accessible from outside of the Amazon Virtual Private Cloud (VPC) containing our instance. Refer to the official Spark documentation on Spark Standalone Mode for a complete list.
Each member of the Spark cluster has a web-based UI that allows you to monitor running applications and executors in real time.
Now that you have a Spark cluster running, you can connect to it with the interactive shells. We are going to run our interactive shells from our EC2 instance, meaning that this instance is playing the roles of development environment, master, and slave at the same time.
There is no interactive shell available for Java. You should use one of the other languages to smoke test your Spark cluster.
There is no interactive shell available for Java. You should use one of the other languages to smoke test your Spark cluster.
There is no interactive shell available for Java. You should use one of the other languages to smoke test your Spark cluster. Here is how you would accomplish this example inside an application.
There is no interactive shell available for Java. You should use one of the other languages to smoke test your Spark cluster.
When using a version of Spark built with Scala 2.10, the command to quit is simply "exit".
Our minimal cluster employed the following official Spark scripts:
In our contrived example, it was straightforward to run all of the scripts from the same overloaded instance. In a real cluster, you would need to login to each slave instance to start and stop that slave. As you can imagine, that would be quite tedious with many slaves, and there is a better way.
If you create a text file called conf/slaves in the master node's Spark home directory (/opt/spark here), and put the hostname of each slave instance in the file (1 slave per line), you can do all of your starting and stopping from the master instance with these scripts:
Although managing the cluster from the master instance is better than logging into each worker, you still have the overhead of configuring and launching each instance. As you start to work with clusters of non-trivial size, you should take a look at the spark-ec2 script. This script automates all facets of cluster management, allowing you to configure, launch, start, stop, and terminate instances from the command line. It ensures that the masters and slaves running on the instances are cleanly started and stopped when the cluster is started or stopped. We explore the power of this script in the recipe, Managing a Spark Cluster with the spark-ec2 Script.
You have now seen the different ways that Spark can be deployed, and are almost ready to dive into doing functional, productive things with Spark. In the next tutorial, Tutorial #4: Writing and Submitting a Spark Application, we focus on the steps needed to create a Spark application and submit it to a cluster for execution. If you are done playing with Spark for now, make sure that you stop your EC2 instance so you don't incur unexpected charges.
Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically? If so, please help me improve Sparkour by opening a ticket on the Issues page. You can also discuss this recipe with others in the Sparkour community on the Discussion page.