Open Data Hub 0.5.0 Release Guide

What is included?

Open Data Hub 0.5.0 includes many new tools that are essential to a comprehensive AI/ML end-to-end platform. Open Data Hub is a meta-operator that can be installed on Openshift Container Platform 3.11 and 4.x.

The following is a list of tools added to Open Data Hub in this release:

Technology Version Category
JupyterHub CUDA GPU Images and Notebooks 3.0.7 Support for building CUDA GPU Images and GPU Notebook
Apache Superset 0.34.0 Data Exploration and Visualization Tool
Data Catalog (Hue, Spark Thrift Server, Hive Metastore) Hue 4.4.1 & Spark Thrift Server 2.4 & Hive Metastore 1.2.1 Deployment of Hue, Spark Thrift Server and Hive Metastore to simplify querying data lakes using Spark SQL language
Argo 2.4.2 Container native workflow engine

You can review the release notes for components added in the previous v0.4.0 release here

AICoE-JupyterHub CUDA GPU Images and Notebooks

AICoE-JupyterHub now has support for accessing NVIDIA GPUs from Jupyter notebooks. In this release, we added

  • Documentation on how to enable GPUs nodes in your OpenShift cluster
  • Support for building CUDA GPU images and notebooks as part of the component deployment process

You can test these new features by following the Data Engineering and Machine Learning workshop. The tf-training-serving contains a demonstration of how you can create Openshift Jobs to access a cluster GPU.

Apache Superset

Apache Superset is a data exploration and visualization tool. Instructions for deploying and creating an example database & chart are available in Deploy Superset Setup.

Data Catalog (Tech Preview)

The Data Catalog is a set of components with which you can run Data Exploration on your Data Lake. These components are:

  • Hive Metastore to store metadada information about the Hive tables
  • Spark SQL Thrift server to expose a ODBC/JDBC endpoint to interact with the Hive Tables
  • Hue to view S3 object store, connect to Spark SQL Thrift server and run queries, as well as create dashboards.

For more information on the Data Catalog, please review the Data Catalog announcement and tutorial. The Data Catalog is currently designated as “Tech Preview” as we enable support for additional features available in Hue.

Argo

Argo has been updated to version 2.4.2. Argo is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. It is useful for defining workflows using containers, running computer intensive jobs, and orchestrating DAG container pipelines natively on Kubernetes.

To learn more about deploying Argo in Open Data Hub, please visit link