Sparkour

Installing and Configuring Apache Zeppelin

by Brian Uri!, 2016-09-22

Synopsis

This recipe explains how to install Apache Zeppelin and configure it to work with Spark. Interactive notebooks such as Zeppelin make it easier for analysts (who may not be software developers) to harness the power of Spark through iterative exploration and built-in visualizations.

Prerequisites

  1. You need a web-accessible server where you can install Zeppelin and a web browser to visit the user interface (UI). If you are installing on Amazon EC2 (such as the instance used in Tutorial #2: Installing Spark on Amazon EC2), it should have at least 1 GB of free memory available.

Target Versions

  1. This recipe uses Apache Zeppelin 0.7.1, which supports Spark 2.1.x. If you are targeting Spark 1.x, you should use Zeppelin 0.6.0.

Section Links

Introducing Notebooks

While Apache Spark offers a robust, high performance engine for distributed data processing, it does not necessarily fit directly into the day-to-day workflow of data scientists and analysts. Data scientists may need to explore data in an ad hoc way, iteratively defining their data processing algorithms over successive executions and then rendering the output graphically so it is easily understandable. The algorithms and visualizations might then be reused and refined collaboratively or applied to alternate datasets.

Interactive Notebooks (such as Zeppelin, Jupyter, and Beaker) bridge the gap between the data scientist and the underlying engine, providing a web-based UI for real-time exploration of data and useful charts and graphs to quickly visualize output. This expands Spark's potential audience beyond software developers (who are already comfortable with programming languages and submitting Spark jobs for batch processing).

Notebooks connect to underlying data sources and engines through Interpreters. Zeppelin, in particular, ships with 20 built-in Interpreters for sources as varied as Elasticsearch, HDFS, JDBC, and Spark. This reduces the level of effort needed to use different engines at different phases in a data processing pipeline, and allows new engines to be supported in the future.

Installing Apache Zeppelin

  1. Visit the Apache Zeppelin - Download page to find the download link for the binary distribution you need. Because of Scala and Spark version differences, you should download Zeppelin 0.7.1 to use with Spark 2.1.x, 0.6.1 to use with Spark 2.0.x, or 0.6.0 to use with Spark 1.x. While it's theoretically possible to get newer versions of Zeppelin to work with older versions of Spark, you may end up spending more time than desired troubleshooting arcane version errors.
  2. On a web-accessible server, download and unpack the binary distribution.
  3. To complete your installation, add Zeppelin into your PATH environment variable.
  4. You need to reload the environment variables (or logout and login again) so they take effect.
  5. By default, Zeppelin will run on port 8080. If you have other software already using this port, you'll need to assign a different one. For example, the Spark Master UI also uses port 8080, so installing Zeppelin on the same server as a master node will cause conflicts.
  6. Finally, start up Zeppelin. Log files will be in /opt/zeppelin/logs if you need to troubleshoot anything.

Testing Your Installation

  1. From a web browser, visit the hostname and port of the running Zeppelin server. An example of this URL is http://127.0.0.1:8080/. You should see the Zeppelin home page shown below. If you cannot connect to Zeppelin on an Amazon EC2 instance, make sure that your Security Groups allow traffic from the computer where your web browser is installed.
  2. Select the Basic Features (Spark) note (called Zeppelin Tutorial in Zeppelin 0.6.1). On your first visit, you will be taken to a Settings page listing all of the installed Interpreters, as shown below. Scroll down this list and select Save.
  3. The Basic Features note will appear onscreen, with multiple panes, as shown below. The top pane shows a welcome message, the next pane down provides some Scala code to generate sample data, and the three smaller panes provide some SQL code to query the data and generate visualizations. These panes are called Paragraphs. You can refer to the Zeppelin Documentation for a more descriptive walkthrough of Zeppelin features and the Tutorial paragraphs.
  4. Select the ▷ icon in the upper right corner of the Load data into table paragraph. This will load sample data out of Amazon S3 so it can be explored.
  5. Select the ▷ icon in one of the 3 visualization paragraphs. This will execute a SQL query against the sample data and render the results as some form of chart. You can select different chart types (e.g. table, bar, pie) and the results will be re-rendered without re-executing the query. You can also edit the SQL query on the fly to run a different query.
  6. If running each Paragraph works without errors, Zeppelin has been installed successfully.

Configuring Zeppelin to Use Spark

By default, Zeppelin's Spark Interpreter points at a local Spark cluster bundled with the Zeppelin distribution. It is very straightforward to point at an existing Spark cluster instead.

  1. From any Zeppelin note page, click on the ⚙ icon at the top of the page to get to the Interpreter Binding page. Then, follow the link to the Interpreter page.
  2. Scroll down the list to the Spark Interpreter. As shown in the image below, the default Interpreter has its master property set to local[*]. This means that running a paragraph will use a local Spark cluster with as many cores as available.
  3. Select the edit button and change the master property to a valid Spark cluster URL, such as spark://ip-172-31-24-101:7077. Make sure your network and AWS Security Groups allow traffic on this port between the Zeppelin server and the Spark cluster. Scroll down the page and select the Save button. You will be prompted to confirm the action.
  4. Return to the Zeppelin Tutorial note by way of the Notebook dropdown menu at the top of the page. Rerun the various paragraphs and Zeppelin should use your external Spark cluster. You can also visit the Master UI for your Spark cluster and see Zeppelin as a running application.

The configuration page for the Spark Interpreter also allows you to specify library dependencies from the local filesystem or from a Maven repository, as described in Dependency Management in the Zeppelin documentation.

java.lang.NoSuchMethodError: scala.runtime.VolatileByteRef.create(B)Lscala/runtime/VolatileByteRef;

This error appears in the Zeppelin logs when there is a mismatch between Scala versions. Make sure that the Zeppelin distribution you have installed was built with the same version of Scala as your Spark distribution, as described in the "Installing Apache Zeppelin" section.

Reference Links

  1. Zeppelin Documentation
  2. Interpreter Installation in the Zeppelin Documentation
  3. Dependency Management in the Zeppelin Documentation

Change Log

  • 2017-06-09: Updated for Zeppelin 0.7.1. (SPARKOUR-25).

Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically? If so, please help me improve Sparkour by opening a ticket on the Issues page. You can also discuss this recipe with others in the Sparkour community on the Discussion page.

Apache, Spark, and Apache Spark are trademarks of the Apache Software Foundation (ASF).
Sparkour is © 2016 - 2017 by It is an independent project that is not endorsed or supported by Novetta or the ASF.
visitors since February 2016