This Tutorial requires a basic installation of Open Data Hub with Spark and JupyterHub as detailed in the quick installation. All screenshots and instructions are from OpenShift 4.1. For the purposes of this tutorial, we used try.openshift.com on AWS. Tutorials have also been tested on Code Ready Containers with 16GB of RAM.
The source for the following notebook is available in GitLab with comments for easy viewing.
Exploring JupyterHub and Spark
JupyterHub and Spark are installed by default with Open Data Hub. You can create Jupyter Notebooks and connect to Spark. This is a simple
- Find the route to JupyterHub. Within your Open Data Hub Project click on Networking -> Routes
- For the route named
jupyterhub, click on the location to bring up JupyterHub (typically
- Sign in using your OpenShift credentials.
- Spawn a new server with spark functionality. (e.g.
- Create a new Python 3 notebook
- Copy the following code to test a basic spark connection.
from pyspark.sql import SparkSession, SQLContext import os import socket # Add the necessary Hadoop and AWS jars to access Ceph from Spark # Can be omitted if s3 storage access is not required os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.hadoop:hadoop-aws:2.7.3,com.amazonaws:aws-java-sdk:1.7.4 pyspark-shell' # create a spark session spark = SparkSession.builder.master('local').getOrCreate() # test your spark connection spark.range(5, numPartitions=5).rdd.map(lambda x: socket.gethostname()).distinct().collect()
- Run the notebook. If successful, you should see the output:
Let’s add on to the notebook from the previous section and access data on an Object Store (such as Ceph or AWS S3) using the S3 API. For instructions on installing Ceph, refer to the Advanced Installation documentation.
- Click on the
+button and insert a new cell below of type
- To access S3 directly, we’ll use the boto3 library. We’ll download a sample data file and then upload it to our S3 storage. In the new cell paste the following code, and edit the
s3_variables with your own credentials.
# Edit this section using your own credentials s3_region = 'region-1' # fill in for AWS, blank for Ceph s3_endpoint_url = 'https://s3.storage.server' s3_access_key_id = 'AccessKeyId-ChangeMe' s3_secret_access_key = 'SecretAccessKey-ChangeMe' s3_bucket = 'MyBucket' # for easy download !pip install wget import wget import boto3 # configure boto S3 connection s3 = boto3.client('s3', s3_region, endpoint_url = s3_endpoint_url, aws_access_key_id = s3_access_key_id, aws_secret_access_key = s3_secret_access_key) # download the sample data file url = "https://gitlab.com/opendatahub/opendatahub.io/raw/master/assets/files/tutorials/basic/sample_data.csv" file = wget.download(url=url, out='sample_data.csv') #upload the file to storage s3.upload_file(file, s3_bucket, "sample_data.csv")
- Run the cell. After it completes check your S3 bucket. You should see the
Spark + Object Storage
Now, let’s access that same data file from Spark so you can analyze data.
- Now let’s read the data from Spark. First, click on the
+button and insert a new cell of type
- Paste the following code to read the data from spark and print some data.
hadoopConf = spark.sparkContext._jsc.hadoopConfiguration() hadoopConf.set("fs.s3a.endpoint", s3_endpoint_url) hadoopConf.set("fs.s3a.access.key", s3_access_key_id) hadoopConf.set("fs.s3a.secret.key", s3_secret_access_key) hadoopConf.set("fs.s3a.path.style.access", "true") hadoopConf.set("fs.s3a.connection.ssl.enabled", "true") # false if not https data = spark.read.csv('s3a://' + s3_bucket + '/sample_data.csv',sep=",", header=True) df = data.toPandas() df.head()
- Run the cell. The data from the
csvfile should be displayed as a Pandas data frame.
That’s it! You have a working Jupyter notebook with access to storage and Spark.