Sparkour

Building Spark Applications with SBT

by Brian Uri!, 2016-03-17

Synopsis

This recipe covers the use of SBT (Simple Build Tool or, sometimes, Scala Build Tool) to build and bundle Spark applications written in Java or Scala. It focuses very narrowly on a subset of commands relevant to Spark applications, including managing library dependencies, packaging, and creating an assembly JAR file with the sbt-assembly plugin.

Prerequisites

  1. You need a development environment with Java, Scala, and Apache Spark installed, as covered in Tutorial #4: Writing and Submitting a Spark Application.

Target Versions

  1. This recipe is independent of any specific version of Spark or Hadoop.
  2. This recipe uses Java 8 and Scala 2.11.11. You are welcome to use different versions, but you may need to change the version numbers in the instructions. Make sure to use the same version of Scala as the one used to build your distribution of Spark. Pre-built distributions of Spark 1.x use Scala 2.10, while pre-built distributions of Spark 2.0.x use Scala 2.11.
  3. SBT continues to mature, sometimes in ways that break backwards compatibility. You should consider using a minimum of SBT 0.13.6 and sbt-assembly 0.12.0.

Section Links

Introducing SBT

SBT is a Scala-based build tool that works with both Java and Scala source code. It adopts many of the conventions used by Apache Maven. Although it has faced some criticism for its arcane syntax and the fact that it's "yet another build tool", SBT has become the de facto build tool for Scala applications. SBT manages library dependencies internally with Apache Ivy, but you do need to interact directly with Ivy to use this feature. You are most likely to benefit from adopting SBT if you're writing a pure Scala Spark application or you have a mixed codebase of both Java and Scala code.

This recipe focuses very narrowly on aspects of SBT relevant to Spark development and intentionally glosses over the more complex configurations and commands. Refer to the SBT Reference Manual for more advanced usage.

Downloading the Source Code

  1. Download and unzip the example source code for this recipe. This ZIP archive contains source code in Java and Scala. Here's how you would do this on an EC2 instance running Amazon Linux:
  2. The example source code for each language is in a subdirectory of src/main with that language's name.

Installing SBT

  1. SBT can be downloaded and manually installed from its website. You can also use the yum utility to install it on an Amazon EC2 instance:

Project Organization

SBT follows a Maven-like convention for its directory structure. The downloaded Sparkour example is set up as a "single build" project and contains the following important paths and files:

  • build.properties: This file controls the version of SBT. Different projects can use different versions of the tool in the same development environment.
  • build.sbt: This file contains important properties about the project.
  • lib/: This directory contains any unmanaged library dependencies that you have downloaded locally.
  • project/assembly.sbt: This file contains configuration for the sbt-assembly plugin, which allows you to create a Spark assembly JAR.
  • src/main/java: This directory is where SBT expects to find Java source code.
  • src/main/scala: This directory is where SBT expects to find Scala source code.
  • target: This directory is where SBT places compiled classes and JAR files.

The build.properties file always contains a single property identifying the SBT version. In our example, that version is 0.13.15.

The build.sbt file is a Scala-based file containing properties about the project. Earlier versions of SBT required the file to be double-spaced, but this restriction has been removed in newer releases. The syntax of the libraryDependencies setting is covered in the next section.

The project/assembly.sbt file includes the sbt-assembly plugin in the build. Additional configuration for the plugin would be added here, although our example uses basic defaults.

Finally, we have two very simple Spark applications (in Java and Scala) that we use to demonstrate SBT. Each application has a dependency on the Apache Commons CSV Java library, so we can demonstrate how SBT handles dependencies.

Because SBT is specific to Java and Scala applications, no examples are provided for Python and R.

Because SBT is specific to Java and Scala applications, no examples are provided for Python and R.

Building and Submitting an Application

Managed Library Dependencies

Under the hood, SBT uses Apache Ivy to download dependencies from the Maven2 repository. You define your dependencies in your build.sbt file with the format of groupID % artifactID % revision, which may look familiar to developers who have used Maven. In our example, we have 3 dependencies, Commons CSV, Spark Core, and Spark SQL:

The double percent operator (%%) is a convenience operator which inserts the Scala compiler version into the ID. Dependencies only available in Java should always be written with the single percent operator (%). If you don't know the groupID or artifactID of your dependency, you can probably find them on that dependency's website or in the Maven Central Repository.

  1. Let's build our example source code with SBT.
  2. The package command compiles the source code in /src/main/ and creates a JAR file of just the project code without any dependencies. In our case, we have both Java and Scala applications in the directory, so we end up with a convenient JAR file containing both applications.
  3. There are many more configuration options available in SBT that you may want to learn if you are serious about adopting SBT across your codebase. Refer to the SBT Reference Manual to learn more. For example, you might use dependencyClasspath to separate compile, test, and runtime dependencies or add resolvers to identify alternate repositories for downloading dependencies.
  4. We can now run these applications using the familiar spark-submit script. We use the --packages parameter to include Commons CSV as a runtime dependency. Remember that spark-submit uses Maven syntax, not SBT syntax (colons as separators instead of percent signs), and that we don't need to include Spark itself as a dependency since it is implied by default. You can add other Maven IDs with a comma-separated list.
  5. Because SBT is specific to Java and Scala applications, no examples are provided for Python and R.

    Because SBT is specific to Java and Scala applications, no examples are provided for Python and R.

Unmanaged Library Dependencies

An alternative to letting SBT handle your dependencies is to download them locally yourself. Here's how you would alter our example project (which uses managed dependencies) to use this approach:

  1. Update the build.sbt file to remove the Commons CSV dependency.
  2. Download Commons CSV to the local lib/ directory. SBT implicitly uses anything in this directory at compile time.
  3. Build the code as we did before. This results in the same JAR file as the previous approach.
  4. We can now run the applications using spark-submit with the --jars parameter to include Commons CSV as a runtime dependency. You can add other JARs with a comma-separated list.
  5. Because SBT is specific to Java and Scala applications, no examples are provided for Python and R.

    Because SBT is specific to Java and Scala applications, no examples are provided for Python and R.

  6. As a best practice, you should make sure that dependencies are not both managed and unmanaged. If you specify a managed dependency and also have a local copy in lib/, you may waste several enjoyable hours troubleshooting if the versions ever get out of sync. You should also review Spark's own assembly JAR, which is implicitly in the classpath when you run spark-submit. If a library you need is already a core dependency of Spark, including your own copy may lead to version conflicts.

Creating an Assembly JAR

As the number of library dependencies increases, the network overhead of sending all of those files to each node in the Spark cluster increases as well. The official Spark documentation recommends creating a special JAR file containing both the application and all of its dependencies called an assembly JAR (or "uber" JAR) to reduce network churn. The assembly JAR contains a combined and flattened set of class and resource files -- it is not just a JAR file containing other JAR files.

  1. We use the sbt-assembly plugin to generate an assembly JAR. This plugin is already in our example project, as seen in the project/assembly.sbt file:
  2. Update the build.sbt file to mark the Spark dependency as provided. This prevents it from being included in the assembly JAR. You can also restore the Commons CSV dependency if you want, although our local copy in the lib/ directory will still get picked up automatically at compile time.
  3. Next, run the assembly command. This command creates an assembly JAR containing both your applications and the GSON classes.
  4. There are many more configuration options available, such as using a MergeStrategy to resolve potential duplicates and dependency conflicts. Refer to the sbt-assembly documentation to learn more.
  5. You can confirm the contents of the assembly JAR with the less command:
  6. You can now submit your application for execution using the assembly JAR. Because the dependencies are bundled inside, there is no need to use --jars or --packages.
  7. Because SBT is specific to Java and Scala applications, no examples are provided for Python and R.

    Because SBT is specific to Java and Scala applications, no examples are provided for Python and R.

Reference Links

  1. SBT Reference Manual
  2. sbt-assembly plugin for SBT
  3. Building Spark Applications with Maven

Change Log

  • 2016-09-20: Updated for Spark 2.0.0. Code may not be backwards compatible with Spark 1.6.x (SPARKOUR-18).

Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically? If so, please help me improve Sparkour by opening a ticket on the Issues page. You can also discuss this recipe with others in the Sparkour community on the Discussion page.

Apache, Spark, and Apache Spark are trademarks of the Apache Software Foundation (ASF).
Sparkour is © 2016 - 2017 by It is an independent project that is not endorsed or supported by Novetta or the ASF.
visitors since February 2016