Installing and Configuring Apache Zeppelin
by Brian Uri!, 2016-09-22
Synopsis
This recipe explains how to install Apache Zeppelin and configure it to work with Spark. Interactive notebooks
such as Zeppelin make it easier for analysts (who may not be software developers) to harness the power of Spark
through iterative exploration and built-in visualizations.
Prerequisites
- You need a web-accessible server where you can install Zeppelin and a web browser to visit the user interface
(UI). If you are installing on Amazon EC2 (such as the instance used in Tutorial #2: Installing Spark on Amazon EC2),
it should have at least 1 GB of free memory available.
Target Versions
- This recipe uses Apache Zeppelin 0.8.0, which supports Spark 2.2.x. If you are targeting Spark 1.x, you should use Zeppelin
0.6.0.
Section Links
⇖ Introducing Notebooks
While Apache Spark offers a robust, high performance engine for distributed data processing, it does not necessarily
fit directly into the day-to-day workflow of data scientists and analysts. Data scientists may need
to explore data in an ad hoc way, iteratively defining their data processing algorithms over successive executions and
then rendering the output graphically so it is easily understandable. The algorithms and visualizations might then be
reused and refined collaboratively or applied to alternate datasets.
Interactive Notebooks (such as Zeppelin, Jupyter, and Beaker) bridge the gap between the
data scientist and the underlying engine, providing a web-based UI for real-time exploration of data and useful
charts and graphs to quickly visualize output. This expands Spark's potential audience beyond software developers
(who are already comfortable with programming languages and submitting Spark jobs for batch processing).
Notebooks connect to underlying data sources and engines through Interpreters. Zeppelin,
in particular, ships with 20 built-in Interpreters for sources as varied as Elasticsearch, HDFS, JDBC, and Spark. This reduces
the level of effort needed to use different engines at different phases in a data processing pipeline, and allows
new engines to be supported in the future.
⇖ Installing Apache Zeppelin
- Visit the Apache Zeppelin - Download page to find the
download link for the binary distribution you need. Because of Scala and Spark version differences, you should
download Zeppelin 0.8.0 to use with Spark 2.2.x or 0.6.0 to use with Spark 1.x.
While it's theoretically possible to get newer versions of Zeppelin to work with
older versions of Spark, you may end up spending more time than desired troubleshooting arcane version errors.
- On a web-accessible server, download and unpack the binary distribution.
- To complete your installation, add Zeppelin into your PATH
environment variable.
- You need to reload the environment variables (or logout and login again) so
they take effect.
- By default, Zeppelin runs on port 8080. If you have other software already using this port,
you'll need to assign a different one. For example, the Spark Master UI also uses port 8080,
so installing Zeppelin on the same server as a master node will cause conflicts.
- Finally, start up Zeppelin. Log files are found in /opt/zeppelin/logs if you
need to troubleshoot anything.
Testing Your Installation
- From a web browser, visit the hostname and port of the running Zeppelin server. An example of this URL is
http://127.0.0.1:8080/. You should see the Zeppelin home page shown below. If you
cannot connect to Zeppelin on an Amazon EC2 instance, make sure that your Security Groups allow traffic from
the computer where your web browser is installed.
- Expand the Zeppelin Tutorial and select the Basic Features (Spark) note.
On your first visit, you are taken to a Settings page listing all of the installed Interpreters, as shown below.
Scroll down this list and select Save.
- The Basic Features note appears onscreen, with multiple panes, as shown below. The top pane shows a welcome message,
the next pane down provides some Scala code to generate sample data, and the three smaller panes provide some SQL code to query the data and
generate visualizations. These panes are called Paragraphs. You can refer to the
Zeppelin Documentation
for a more descriptive walkthrough of Zeppelin features and the Tutorial paragraphs.
- Select the ▷ icon in the upper right corner of the Load data into table paragraph. This
loads sample data out of Amazon S3 so it can be explored.
- Select the ▷ icon in one of the 3 visualization paragraphs. This executes a SQL query against the sample
data and render the results as some form of chart. You can select different chart types (e.g., table, bar, pie) and
the results are re-rendered without re-executing the query. You can also edit the SQL query on the fly to run
a different query.
- If running each Paragraph works without errors, Zeppelin has been installed successfully.
⇖ Configuring Zeppelin to Use Spark
By default, Zeppelin's Spark Interpreter points at a local Spark cluster bundled with the Zeppelin distribution. It is
very straightforward to point at an existing Spark cluster instead.
- From any Zeppelin note page, click on the ⚙ icon at the top of the page to get to the
Interpreter Binding page. Then, follow the link to the Interpreter
page.
- Scroll down the list to the Spark Interpreter. As shown in the image below, the default Interpreter has its
master property set to local[*]. This means that running a paragraph
uses a local Spark cluster with as many cores as available.
- Select the edit button and change the master property
to a valid Spark cluster URL, such as spark://ip-172-31-24-101:7077. Make sure your network
and AWS Security Groups allow traffic on this port between the Zeppelin server and the Spark cluster.
Scroll down the page and select the Save button. You will be prompted to confirm the action.
- Return to the Zeppelin Tutorial note by way of the Notebook dropdown menu at the top of the page.
Rerun the various paragraphs and Zeppelin should use your external Spark cluster. You can also visit
the Master UI for your Spark cluster and see Zeppelin as a running application.
The configuration page for the Spark Interpreter also allows you to specify library dependencies from the local filesystem
or from a Maven repository, as described in Dependency Management
in the Zeppelin documentation.
java.lang.NoSuchMethodError: scala.runtime.VolatileByteRef.create(B)Lscala/runtime/VolatileByteRef;
This error appears in the Zeppelin logs when there is a mismatch between Scala versions. Make sure that the Zeppelin distribution you have
installed was built with the same version of Scala as your Spark distribution, as described in the "Installing Apache Zeppelin" section.
Spot any inconsistencies or errors? See things that could be explained better or code that could be written more idiomatically?
If so, please help me improve Sparkour by opening a ticket on the Issues page.
You can also discuss this recipe with others in the Sparkour community on the Discussion page.