Sparkour

Understanding the SparkSession in Spark 2.0

by Brian Uri!, 2016-09-24

Synopsis

This recipe introduces the new SparkSession class from Spark 2.0, which provides a unified entry point for all of the various Context classes previously found in Spark 1.x. There is no hands-on work involved.

Prerequisites

  1. There are no prerequisites for this recipe.

Target Versions

  1. The SparkSession class was introduced in Spark 2.0.0.

Section Links

Introducing SparkSession

The SparkSession class is a new feature of Spark 2.0 which streamlines the number of configuration and helper classes you need to instantiate before writing Spark applications. SparkSession provides a single entry point to perform many operations that were previously scattered across multiple classes, and also provides accessor methods to these older classes for maximum compatibility.

In interactive environments, such as the Spark Shell or interactive notebooks, a SparkSession is already be created for you in a variable named spark. For consistency, you should use this name when you create one in your own application. You can create a new SparkSession through a Builder pattern which uses a "fluent interface" style of coding to build a new object by chaining methods together. Spark properties can be passed in, as shown in these examples:

The SparkR library doesn't use the "fluent interface" style. Simply pass parameters into the function.

At the end of your application, calling stop() on the SparkSession implicitly stops any nested Context classes.

In R, you call stop() on the sparkR object, rather than the session.

Updating Spark 1.x Applications

The developers of Spark 2.0 maintained backwards compatibility with Spark 1.x when they introduced SparkSession, so all of your existing code should still work in Spark 2.0. When you are ready to modernize your code, you should understand the relationships between the older classes and SparkSession.

SparkConf

Previously, this class was required to initialize configuration properties used by the SparkContext, as well as set runtime properties while an application was running. Now, all initialization occurs through the SparkSession builder class. You still use this class (via the conf accessor) to set runtime properties, but do not need to manually create it.

In R, you change properties by reinitializing the entire session.

SparkContext and JavaSparkContext

You will continue to use these classes (via the sparkContext accessor) to perform operations that require the Spark Core API, such as working with accumulators, broadcast variables, or low-level RDDs. However, you do not need to manually create them.

The low-level Spark Core API is not exposed in SparkR.

SQLContext

The SQLContext is completely superceded by SparkSession. Most Dataset and DataFrame operations are directly available in SparkSession. Operations related to table and database metadata are now encapsulated in a Catalog (via the catalog accessor).

SparkR does not use a Catalog.

HiveContext

The HiveContext is completely superceded by SparkSession. You need enable Hive support when you create your SparkSession and include the necessary Hive library dependencies in your classpath.

SparkR does not use a Catalog.

Reference Links

  1. SparkSession in the Java API Documentation
  2. SparkSession in the Python API Documentation
  3. SparkSession in the R API Documentation
  4. SparkSession in the Scala API Documentation

Change Log

  • This recipe hasn't had any substantive updates since it was first published.

Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically? If so, please help me improve Sparkour by opening a ticket on the Issues page. You can also discuss this recipe with others in the Sparkour community on the Discussion page.

Apache, Spark, and Apache Spark are trademarks of the Apache Software Foundation (ASF).
Sparkour is © 2016 - 2024 by It is an independent project that is not endorsed or supported by Accenture Federal Services or the ASF.
visitors since February 2016