Jupyter Notebooks

This tutorial will go over the basics of using Jupyter notebooks on Hopsworks.

Jupyter Notebooks Basics

Open the Jupyter Service

Jupyter is provided as a micro-service on Hopsworks and can be found in the main UI inside a project.

Open the Jupyter service on Hopsworks

Open the Jupyter service on Hopsworks

Start a Jupyter notebook server

When you start a Jupyter notebook server you have the possibility to specify Spark properties of the notebooks you will create on the server. Hopsworks provides Machine Learning-as-a-Service with Tensorflow, Spark, supporting distributed training, parallel experiments, hyperparameter tuning, and model serving (HopsML). If you are doing machine learning on hops you probably want to select the notebook servers “Experiment”, “Parallel Experiment” or “Distributed training” as shown in the figure below. See HopsML for more information on the Machine Learning pipeline. For general purpose notebooks, select the type “Spark (Static)” or “Spark (Dynamic)”.

Start a Jupyter notebook server

Start a Jupyter notebook server

Jupyter + Spark on Hopsworks

As a user, you will just interact with the Jupyter notebooks, but below you can find a detailed explanation of the technology behind the scenes.

When using Jupyter on Hopsworks, a library called sparkmagic is used to interact with the Hops cluster. When you create a Jupyter notebook on Hopsworks, you first select a kernel. A kernel is simply a program that executes the code that you have in the Jupyter cells, you can think of it as a REPL-backend to your jupyter notebook that acts as a frontend.

Sparkmagic works with a remote REST server for Spark, called livy, running inside the Hops cluster. Livy is an interface that Jupyter-on-Hopsworks uses to interact with the Hops cluster. When you run Jupyter cells using the pyspark kernel, the kernel will automatically send commands to livy in the background for executing the commands on the cluster. Thus, the work that happens in the background when you run a Jupyter cell is as follows:

  • The code in the cell will first go to the kernel.
  • Next, the kernel kernel sends the code as a HTTP REST request to livy.
  • When receiving the REST request, livy executes the code on the Spark driver in the cluster.
  • If the code is regular python/scala/R code, it will run inside a python/scala/R interpreter on the Spark driver.
  • If the code includes a spark command, using the spark session, a spark job will be launched on the cluster from the Spark driver.
  • When the python/scala/R or spark execution is finished, the results are sent back from livy to the pyspark kernel/sparkmagic.
  • Finally, the pyspark kernel displays the result in the Jupyter notebook.

The three Jupyter kernels we support on Hopsworks are:

  • Spark, a kernel for executing scala code and interacting with the cluster through spark-scala
  • PySpark, a kernel for executing python code and interacting with the cluster through pyspark
  • SparkR, a kernel for executing R code and interacting with the cluster through spark-R

All notebooks make use of Spark, since that is the standard way to allocate resources and run jobs in the cluster.

In the rest of this tutorial we will focus on the pyspark kernel.

Pyspark notebooks

Create a pyspark notebook

After you have started the Jupyter notebook server, you can create a pyspark notebook from the Jupyter dashboard:

Create a pyspark notebook

Create a pyspark notebook

When you execute the first cell in a pyspark notebook, the spark session is automatically created, referring to the Hops cluster.

SparkSession creation with pyspark kernel

SparkSession creation with pyspark kernel

The notebook will look just like any python notebook, with the difference that the python interpreter is actually running on a Spark driver in the cluster. You can execute regular python code:

Executing python code on the spark driver in the cluster

Executing python code on the spark driver in the cluster

Since you are executing on the spark driver, you can also launch jobs on spark executors in the cluster, the spark session is available as the variable spark in the notebook:

Starting a spark job from Jupyter

Starting a spark job from Jupyter

When you execute a cell in Jupyter that starts a Spark job, you can go back to the Hopsworks-Jupyter-UI and you will see that a link to the SparkUI for the job that has been created.

Opening the SparkUI in Hopsworks

Opening the SparkUI in Hopsworks

The SparkUI in Hopsworks

The SparkUI in Hopsworks

In addition to having access to a regular python interpreter as well as the spark cluster, you also have access to magic commands provided by sparkmagic. You can view a list of all commands by executing a cell with %%help:

Printing a list of all sparkmagic commands

Printing a list of all sparkmagic commands

Plotting with Pyspark Kernel

So far throughout this tutorial, the Jupyter notebook have behaved more or less identical to how it does if you start the notebook server locally on your machine using a python kernel, without access to a Hadoop cluster. However, there is one main difference from a user-standpoint when using pyspark notebooks instead of regular python notebooks, this is related to plotting.

Since the code in a pyspark notebook is being executed remotely, in the spark cluster, regular python plotting will not work. What you can do however, is to use sparkmagic to download your remote spark dataframe as a local pandas dataframe and plot it using matplotlib, seaborn, or sparkmagics built in visualization. To do this we use the magics: %%sql, %%spark, and %%local. The steps to do plotting using a pyspark notebook are illustrated below. Using this approach, you can have large scale cluster computation and plotting in the same notebook.

Step 1 : Create a remote Spark Dataframe:

Creating a spark dataframe

Creating a spark dataframe

Step 2 : Download the Spark Dataframe to a local Pandas Dataframe using %%sql or %%spark:

Note: you should not try to download large spark dataframes for plotting. When you plot a dataframe, the entire dataframe must fit into memory, so add the flag –maxrows x to limit the dataframe size when you download it to the local Jupyter server for plotting.

Using %%sql:

Downloading the spark dataframe to a pandas dataframe using %%sql

Downloading the spark dataframe to a pandas dataframe using %%sql

Using %%spark:

Downloading the spark dataframe to a pandas dataframe using %%spark

Downloading the spark dataframe to a pandas dataframe using %%spark

Step 3 : Plot the pandas dataframe using Python plotting libraries:

When you download a dataframe from spark to pandas with sparkmagic, it gives you a default visualization of the data using autovizwidget, as you saw in the screenshots above. However, sometimes you want custom plots, using matplotlib or seaborn. To do this, use the sparkmagic %%local to access the local pandas dataframe and then you can plot like usual. Just make sure that you have your plotting libraries (e.g matplotlib or seaborn) installed on the Jupyter machine, contact a system administrator if this is not already installed.

Import plotting libraries locally on the Jupyter server

Import plotting libraries locally on the Jupyter server

Plot a local pandas dataframe using seaborn and the magic %%local

Plot a local pandas dataframe using seaborn and the magic %%local

Plot a local pandas dataframe using matplotlib and the magic %%local

Plot a local pandas dataframe using matplotlib and the magic %%local

Want to Learn More?

We have provided a large number of example notebooks, available here. Go to Hopsworks and try them out! You can do this either by taking one of the built-in tours on Hopsworks, or by uploading one of the example notebooks to your project and run it through the Jupyter service. You can also have a look at HopsML, which enables large-scale distributed deep learning on Hops.