Aggregating Results with Spark Accumulators

by Brian Uri!, 2016-05-07

Synopsis

This recipe explains how to use accumulators to aggregate results in a Spark application. Accumulators provide a safe way for multiple Spark workers to contribute information to a shared variable, which can then be read by the application driver.

Prerequisites

You need a development environment with your primary programming language and Apache Spark installed, as covered in Tutorial #4: Writing and Submitting a Spark Application.

Target Versions

The example code used in this recipe is written for Spark 2.x or higher to take advantage of the latest Accumulator API changes. You may need to make modifications to use it on an older version of Spark.
The SparkR API does not yet support accumulators at all. You can track the progress of this work in the SPARK-6815 ticket.

⇖ Introducing Accumulators

Accumulators are a built-in feature of Spark that allow multiple workers to write to a shared variable. When a job is submitted, Spark calculates a closure consisting of all of the variables and methods required for a single executor to perform operations, and then sends that closure to each worker node. Without accumulators, each worker has its own local copy of the variables in your application. This could lead to unexpected results if you are trying to aggregate data from all of the workers, such as counting the number of failed records processed across the cluster.

Out of the box, Spark provides an accumulator that can aggregate numeric data, suitable for counting and sum use cases. You can also create custom accumulators for other data types. You should consider using accumulators under the following conditions:

You need to collect some simple data across all worker nodes as a side effect of normal Spark operations, such as statistics about the work being performed or errors encountered.
The operation used to aggregate this data is both associative and commutative. In a distributed processing pipeline, the order and grouping of the data contributed by each worker cannot be guaranteed.
You do not need to read the data until all tasks have completed. Although any worker can write to an accumulator, only the application driver can see its value. Because of this, accumulators are not good candidates for monitoring task progress or live statistics.

Accumulators can be used in Spark transformations or actions, and obey the execution rules of the enclosing operation. Remember that transformations are "lazy" and not executed until your processing pipeline has reached an action. Because of this, an accumulator employed inside a transformation is not actually touched until a subsequent action is triggered.

You should limit your use of accumulators to run within Spark actions for several reasons. For one, Spark guarantees that an accumulator employed in an action runs exactly one time, but no such guarantee covers accumulators in transformations. If a task fails for a hardware reason and is then re-executed, you might get duplicate values (or no value at all) written to an accumulator inside a transformation. Spark also employs speculative execution (duplicating a task on a free worker in case a slow-running worker fails) which could introduce duplicate accumulator data outside of an action.