This recipe introduces the new SparkSession class from Spark 2.0, which provides a unified entry point for all of the various Context classes previously found in Spark 1.x. There is no hands-on work involved.
The SparkSession class is a new feature of Spark 2.0 which streamlines the number of configuration and helper classes you need to instantiate before writing Spark applications. SparkSession provides a single entry point to perform many operations that were previously scattered across multiple classes, and also provides accessor methods to these older classes for maximum compatibility.
In interactive environments, such as the Spark Shell or interactive notebooks, a SparkSession is already be created for you in a variable named spark. For consistency, you should use this name when you create one in your own application. You can create a new SparkSession through a Builder pattern which uses a "fluent interface" style of coding to build a new object by chaining methods together. Spark properties can be passed in, as shown in these examples:
The SparkR library doesn't use the "fluent interface" style. Simply pass parameters into the function.
At the end of your application, calling stop() on the SparkSession implicitly stops any nested Context classes.
In R, you call stop() on the sparkR object, rather than the session.
The developers of Spark 2.0 maintained backwards compatibility with Spark 1.x when they introduced SparkSession, so all of your existing code should still work in Spark 2.0. When you are ready to modernize your code, you should understand the relationships between the older classes and SparkSession.
Previously, this class was required to initialize configuration properties used by the SparkContext, as well as set runtime properties while an application was running. Now, all initialization occurs through the SparkSession builder class. You still use this class (via the conf accessor) to set runtime properties, but do not need to manually create it.
In R, you change properties by reinitializing the entire session.
You will continue to use these classes (via the sparkContext accessor) to perform operations that require the Spark Core API, such as working with accumulators, broadcast variables, or low-level RDDs. However, you do not need to manually create them.
The SQLContext is completely superceded by SparkSession. Most Dataset and DataFrame operations are directly available in SparkSession. Operations related to table and database metadata are now encapsulated in a Catalog (via the catalog accessor).
The HiveContext is completely superceded by SparkSession. You need enable Hive support when you create your SparkSession and include the necessary Hive library dependencies in your classpath.
Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically? If so, please help me improve Sparkour by opening a ticket on the Issues page. You can also discuss this recipe with others in the Sparkour community on the Discussion page.