Sparkour

Building Spark Applications with Maven

by Brian Uri!, 2016-03-29

Synopsis

This recipe covers the use of Apache Maven to build and bundle Spark applications written in Java or Scala. It focuses very narrowly on a subset of commands relevant to Spark applications, including managing library dependencies, packaging, and creating an assembly JAR file.

Prerequisites

  1. You need a development environment with Java and Apache Spark installed, as covered in Tutorial #4: Writing and Submitting a Spark Application.

Target Versions

  1. This recipe is independent of any specific version of Spark or Hadoop.
  2. This recipe uses Java 8 and Scala 2.11.11. You are welcome to use different versions, but you may need to change the version numbers in the instructions. Make sure to use the same version of Scala as the one used to build your distribution of Spark. Pre-built distributions of Spark 1.x use Scala 2.10, while pre-built distributions of Spark 2.0.x use Scala 2.11.
  3. You should consider using a minimum of Maven 3.2.5 to maximize the availability and compatibility of plugins, such as maven-compiler-plugin, addjars-maven-plugin, maven-shade-plugin, and scala-maven-plugin.

Section Links

Introducing Maven

Apache Maven is a Java-based build tool that works with both Java and Scala source code. It employs a "convention over configuration" philosophy that attempts to make useful assumptions about project structure and common build tasks in order to reduce the amount of explicit configuration by a developer. Although it has faced some criticism for its use of verbose XML in configuration files and for being difficult to customize, it has eclipsed Apache Ant as the standard build tool for Java applications. Maven also manages library dependencies and benefits from broadly adopted public dependency repositories. You are most likely to benefit from adopting Maven if you're writing a pure Java Spark application or you have a mixed codebase of mostly Java code.

This recipe focuses very narrowly on aspects of Maven relevant to Spark development and intentionally glosses over the more complex configurations and commands. Refer to the Maven Documentation for more advanced usage.

Downloading the Source Code

  1. Download and unzip the example source code for this recipe. This ZIP archive contains source code in Java and Scala. Here's how you would do this on an EC2 instance running Amazon Linux:
  2. The example source code for each language is in a subdirectory of src/main with that language's name.

Installing Maven

  1. Maven can be downloaded and manually installed from its website. Here's how you would do so on an Amazon EC2 instance:

Project Organization

Maven defines a standard convention for the directory structure of a project. The downloaded Sparkour example contains the following important paths and files:

  • pom-java.xml: A Maven file for Java projects containing managed library dependencies.
  • pom-java-local.xml: A Maven file for Java projects containing unmanaged library dependencies that have been manually downloaded to the local filesystem.
  • pom-scala.xml: A Maven file for Scala projects containing managed library dependencies.
  • pom-scala-local.xml: A Maven file for Scala projects containing unmanaged library dependencies that have been manually downloaded to the local filesystem.
  • lib/: This directory contains any unmanaged library dependencies that you have downloaded locally.
  • src/main/java: This directory is where Maven expects to find Java source code.
  • src/main/scala: This directory is where Maven expects to find Scala source code.
  • target: This directory is where Maven places compiled classes and JAR files.

The Project Object Model (POM) file is the primary Maven artifact. In this case, there are 4 POM files, each of which builds the example in a different way. We can designate a specific file with the --file parameter. In the absence of this parameter, Maven will default to a file named pom.xml.

Each example POM file has some configuration in common:

  • The combination of a groupId, artifactId, and version uniquely identifies this project build if it is ever published to a Maven repository.
  • A collection of dependencies identifies library dependencies that Maven needs to gather from a Maven repository to compile, package, or run the project. All of our example POMs identify Apache Spark as a dependency.
  • The _2.11 suffix in the artifactId specifies a build of Spark that was compiled with Scala 2.11. Adding a scope of provided signifies that Spark is needed to compile the project, but does not need to be available at runtime or included in an assembly JAR file. (Recall that Spark will already be installed on a Spark cluster executing your application, so there is no need to provide a new copy at runtime).
  • A collection of plugins identifies Maven plugins that perform tasks such as compiling code.

Finally, we have two very simple Spark applications (in Java and Scala) that we use to demonstrate Maven. Each application has a dependency on the Apache Commons CSV Java library, so we can demonstrate how Maven handles dependencies.

Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

Building and Submitting an Application

Managed Library Dependencies

A key feature of Maven is the ability to download library dependencies when needed, without requiring them to be a local part of your project. In addition to Apache Spark, we also need to add the Scala library (for the Scala example only) and Commons CSV (for both Java and Scala):

Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

The provided scope prevents the Scala base classes from being bundled into your application package (assuming that Scala is already provided in the target runtime environment.

If you don't know the groupID or artifactID of your dependency, you can probably find them on that dependency's website or in the Maven Central Repository.

  1. Let's build our example source code with Maven.
  2. Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

    Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

  3. The package command compiles the source code in /src/main/ and creates a JAR file of just the project code without any dependencies. The result is saved at target/original-building-maven-1.0-SNAPSHOT.jar.
  4. There are many more configuration options available in Maven that you may want to learn if you are serious about adopting Maven across your codebase. Refer to the Maven Documentation to learn more. For example, you might use repositories to identify alternate repositories for downloading dependencies.
  5. We can now run these applications using the familiar spark-submit script. We use the --packages parameter to include Commons CSV as a runtime dependency. Remember that we don't need to include Spark itself as a dependency since it is implied by default. You can add other Maven IDs with a comma-separated list.
  6. Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

    Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

Unmanaged Library Dependencies

An alternative to letting Maven handle your dependencies is to download them locally yourself. We can use the addjars-maven-plugin plugin to identify a local directory containing these unmanaged libraries.

  1. The plugin is defined in the pom-java-local.xml and pom-scala-local.xml POM files. This plugin works with both Java and Scala code. Our examples specify the lib/ directory for storing extra libraries.
  2. Next, manually download the Commons CSV library into the lib/ directory.
  3. Build the code using the -local version of the POM file. This results in the same JAR file as the previous approach.
  4. Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

    Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

  5. We can now run the applications using spark-submit with the --jars parameter to include Commons CSV as a runtime dependency. You can add other JARs with a comma-separated list.
  6. Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

    Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

  7. As a best practice, you should make sure that dependencies are not both managed and unmanaged. If you specify a managed dependency and also have a local copy in lib/, you may end up doing some late-night troubleshooting if the versions ever get out of sync. You should also review Spark's own assembly JAR, which is implicitly in the classpath when you run spark-submit. If a library you need is already a core dependency of Spark, including your own copy may lead to version conflicts.

Creating an Assembly JAR

As the number of library dependencies increases, the network overhead of sending all of those files to each node in the Spark cluster increases as well. The official Spark documentation recommends creating a special JAR file containing both the application and all of its dependencies called an assembly JAR (or "uber" JAR) to reduce network churn. The assembly JAR contains a combined and flattened set of class and resource files -- it is not just a JAR file containing other JAR files.

  1. We use the maven-shade-plugin plugin to generate an assembly JAR. This plugin is registered in all of our example POM files already:
  2. The assembly JAR file is automatically created whenever you run the mvn package command, and the JAR file is saved at target/building-maven-1.0-SNAPSHOT.jar.
  3. You can confirm the contents of the assembly JAR with the less command:
  4. You can now submit your application for execution using the assembly JAR. Because the dependencies are bundled inside, there is no need to use --jars or --packages.
  5. Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

    Because Maven is specific to Java and Scala applications, no examples are provided for Python and R.

Reference Links

  1. Maven Documentation
  2. maven-compiler-plugin Plugin
  3. scala-maven-plugin Plugin
  4. maven-shade-plugin Plugin
  5. addjars-maven-plugin Plugin
  6. Building Spark Applications with SBT

Change Log

  • 2016-09-20: Updated for Spark 2.0.0. Code may not be backwards compatible with Spark 1.6.x (SPARKOUR-18).

Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically? If so, please help me improve Sparkour by opening a ticket on the Issues page. You can also discuss this recipe with others in the Sparkour community on the Discussion page.

Apache, Spark, and Apache Spark are trademarks of the Apache Software Foundation (ASF).
Sparkour is © 2016 - 2017 by It is an independent project that is not endorsed or supported by Novetta or the ASF.
visitors since February 2016