Info alert:Important Notice
Please note that more information about the previous v2 releases can be found here. You can use "Find a release" search bar to search for a particular release.
Getting started with Open Data Hub
Overview
Open Data Hub is an artificial intelligence (AI) platform that provides tools to rapidly train, serve, and monitor machine learning (ML) models onsite, in the public cloud, or at the edge.
Open Data Hub provides a powerful AI/ML platform for building AI-enabled applications. Data scientists and MLOps engineers can collaborate to move from experiment to production in a consistent environment quickly.
Data science workflow
For the purpose of getting you started with Open Data Hub, Figure 1 illustrates a simplified data science workflow. The real world process of developing ML models is an iterative one.
The simplified data science workflow for predictive AI use cases includes the following tasks:
-
Defining your business problem and setting goals to solve it.
-
Gathering, cleaning, and preparing data. Data often has to be federated from a range of sources, and exploring and understanding data plays a key role in the success of a data science project.
-
Evaluating and selecting ML models for your business use case.
-
Train models for your business use case by tuning model parameters based on your set of training data. In practice, data scientists train a range of models, and compare performance while considering tradeoffs such as time and memory constraints.
-
Integrate models into an application, including deployment and testing. After model training, the next step of the workflow is production. Data scientists are often responsible for putting the model in production and making it accessible so that a developer can integrate the model into an application.
-
Monitor and manage deployed models. Depending on the organization, data scientists, data engineers, or ML engineers must monitor the performance of models in production, tracking prediction and performance metrics.
-
Refine and retrain models. Data scientists can evaluate model performance results and refine models to improve outcome by excluding or including features, changing the training data, and modifying other configuration parameters.
About this guide
This guide assumes you are familiar with data science and ML Ops concepts. It describes the following tasks to get you started with using Open Data Hub:
-
Log in to the Open Data Hub dashboard
-
Create a data science project
-
If you have data stored in Object Storage, configure a connection to more easily access it
-
Create a workbench and choose an IDE, such as JupyterLab or code-server, for your data scientist development work
-
Learn where to get information about the next steps:
-
Developing and training a model
-
Automating the workflow with pipelines
-
Implementing distributed workloads
-
Testing your model
-
Deploying your model
-
Monitoring and managing your model
-
Logging in to Open Data Hub
After you install Open Data Hub, log in to the Open Data Hub dashboard so that you can set up your development and deployment environment.
-
You know the Open Data Hub identity provider and your login credentials.
-
If you are a data scientist, data engineer, or ML engineer, your administrator must provide you with the Open Data Hub instance URL, for example:
https:://odh-dashboard-odh.apps.ocp4.example.com
-
-
You have the latest version of one of the following supported browsers:
-
Google Chrome
-
Mozilla Firefox
-
Safari
-
-
Browse to the Open Data Hub instance URL and click Log in with OpenShift.
-
If you have access to OpenShift Container Platform, you can browse to the OpenShift Container Platform web console and click the Application Launcher () → Open Data Hub.
-
-
Click the name of your identity provider, for example,
GitHub
,Google
, or your company’s single sign-on method. -
Enter your credentials and click Log in (or equivalent for your identity provider).
-
The Open Data Hub dashboard opens on the Home page.
Creating a data science project
To implement a data science workflow, you must create a project. In OpenShift, a project is a Kubernetes namespace with additional annotations, and is the main way that you can manage user access to resources. A project organizes your data science work in one place and also allows you to collaborate with other developers and data scientists in your organization.
Within a project, you can add the following functionality:
-
Connections so that you can access data without having to hardcode information like endpoints or credentials.
-
Workbenches for working with and processing data, and for developing models.
-
Deployed models so that you can test them and then integrate them into intelligent applications. Deploying a model makes it available as a service that you can access by using an API.
-
Pipelines for automating your ML workflow.
-
You have logged in to Open Data Hub.
-
If you are using Open Data Hub groups, you are part of the user group or admin group (for example,
odh-users
orodh-admins
) in OpenShift. -
You have the appropriate roles and permissions to create projects.
-
From the Open Data Hub dashboard, select Data Science Projects.
The Data Science Projects page shows a list of projects that you can access. For each user-requested project in the list, the Name column shows the project display name, the user who requested the project, and the project description.
-
Click Create project.
-
In the Create project dialog, update the Name field to enter a unique display name for your project.
-
Optional: If you want to change the default resource name for your project, click Edit resource name.
The resource name is what your resource is labeled in OpenShift. Valid characters include lowercase letters, numbers, and hyphens (-). The resource name cannot exceed 30 characters, and it must start with a letter and end with a letter or number.
Note: You cannot change the resource name after the project is created. You can edit only the display name and the description.
-
Optional: In the Description field, provide a project description.
-
Click Create.
-
A project details page opens. From this page, you can add connections, create workbenches, configure pipelines, and deploy models.
Creating a workbench and selecting an IDE
A workbench is an isolated area where you can examine and work with ML models. You can also work with data and run programs, for example to prepare and clean data. While a workbench is not required if, for example, you only want to service an existing model, one is needed for most data science workflow tasks, such as writing code to process data or training a model.
When you create a workbench, you specify an image (an IDE, packages, and other dependencies). IDEs include JupyterLab and code-server.
The IDEs are based on a server-client architecture. Each IDE provides a server that runs in a container on the OpenShift cluster, while the user interface (the client) is displayed in your web browser. For example, the Jupyter notebook server runs in a container on the Red Hat OpenShift cluster. The client is the JupyterLab interface that opens in your web browser on your local computer. All of the commands that you enter in JupyterLab are executed by the notebook server. Similarly, other IDEs like code-server or RStudio Server provide a server that runs in a container on the OpenShift cluster, while the user interface is displayed in your web browser. This architecture allows you to interact through your local computer in a browser environment, while all processing occurs on the cluster. The cluster provides the benefits of larger available resources and security because the data being processed never leaves the cluster.
In a workbench, you can also configure connections (to access external data for training models and to save models so that you can deploy them) and cluster storage (for persisting data). Workbenches within the same project can share models and data through object storage with the data science pipelines and model servers.
For data science projects that require data retention, you can add container storage to the workbench you are creating.
Within a project, you can create multiple workbenches. When to create a new workbench depends on considerations, such as the following:
-
The workbench configuration (for example, CPU, RAM, or IDE). If you want to avoid editing the configuration of an existing workbench’s configuration to accommodate a new task, you can create a new workbench instead.
-
Separation of tasks or activities. For example, you might want to use one workbench for your Large Language Models (LLM) experimentation activities, another workbench dedicated to a demo, and another workbench for testing.
About workbench images
A workbench image (sometimes referred to as a notebook image) is optimized with the tools and libraries that you need for model development. You can use the provided workbench images or an Open Data Hub administrator can create custom workbench images adapted to your needs.
To provide a consistent, stable platform for your model development, many provided workbench images contain the same version of Python. Most workbench images available on Open Data Hub are pre-built and ready for you to use immediately after Open Data Hub is installed or upgraded.
The following table lists the workbench images that are installed with Open Data Hub by default.
If the preinstalled packages that are provided in these images are not sufficient for your use case, you have the following options:
-
Install additional libraries after launching a default image. This option is good if you want to add libraries on an ad hoc basis as you develop models. However, it can be challenging to manage the dependencies of installed libraries and your changes are not saved when the workbench restarts.
-
Create a custom image that includes the additional libraries or packages. For more information, see Creating custom workbench images.
Image name | Description |
---|---|
CUDA |
If you are working with compute-intensive data science models that require GPU support, use the Compute Unified Device Architecture (CUDA) workbench image to gain access to the NVIDIA CUDA Toolkit. Using this toolkit, you can optimize your work by using GPU-accelerated libraries and optimization tools. |
Standard Data Science |
Use the Standard Data Science workbench image for models that do not require TensorFlow or PyTorch. This image contains commonly-used libraries to assist you in developing your machine learning models. |
TensorFlow |
TensorFlow is an open source platform for machine learning. With TensorFlow, you can build, train and deploy your machine learning models. TensorFlow contains advanced data visualization features, such as computational graph visualizations. It also allows you to easily monitor and track the progress of your models. |
PyTorch |
PyTorch is an open source machine learning library optimized for deep learning. If you are working with computer vision or natural language processing models, use the Pytorch workbench image. |
Minimal Python |
If you do not require advanced machine learning features, or additional resources for compute-intensive data science work, you can use the Minimal Python image to develop your models. |
TrustyAI |
Use the TrustyAI workbench image to leverage your data science work with model explainability, tracing, and accountability, and runtime monitoring. See the TrustyAI Explainability repository for some example Jupyter notebooks. |
code-server |
With the code-server workbench image, you can customize your workbench environment to meet your needs using a variety of extensions to add new languages, themes, debuggers, and connect to additional services. Enhance the efficiency of your data science work with syntax highlighting, auto-indentation, and bracket matching, as well as an automatic task runner for seamless automation. For more information, see code-server in GitHub. NOTE: Elyra-based pipelines are not available with the code-server workbench image. |
RStudio Server |
Use the RStudio Server workbench image to access the RStudio IDE, an integrated development environment for R, a programming language for statistical computing and graphics.
For more information, see the RStudio Server site. |
CUDA - RStudio Server |
Use the CUDA - RStudio Server workbench image to access the RStudio IDE and NVIDIA CUDA Toolkit. RStudio is an integrated development environment for R, a programming language for statistical computing and graphics. With the NVIDIA CUDA toolkit, you can optimize your work using GPU-accelerated libraries and optimization tools.
For more information, see the RStudio Server site. |
Creating a workbench
When you create a workbench, you specify an image (an IDE, packages, and other dependencies). You can also configure connections, cluster storage, and add container storage.
-
You have logged in to Open Data Hub.
-
If you use Open Data Hub groups, you are part of the user group or admin group (for example,
odh-users
orodh-admins
) in OpenShift. -
You created a project.
-
From the Open Data Hub dashboard, click Data Science Projects.
The Data Science Projects page opens.
-
Click the name of the project that you want to add the workbench to.
A project details page opens.
-
Click the Workbenches tab.
-
Click Create workbench.
The Create workbench page opens.
-
In the Name field, enter a unique name for your workbench.
-
Optional: If you want to change the default resource name for your workbench, click Edit resource name.
The resource name is what your resource is labeled in OpenShift. Valid characters include lowercase letters, numbers, and hyphens (-). The resource name cannot exceed 30 characters, and it must start with a letter and end with a letter or number.
Note: You cannot change the resource name after the workbench is created. You can edit only the display name and the description.
-
Optional: In the Description field, enter a description for your workbench.
-
In the Notebook image section, complete the fields to specify the workbench image to use with your workbench.
From the Image selection list, select a workbench image that suits your use case. A workbench image includes an IDE and Python packages (reusable code). Optionally, click View package information to view a list of packages that are included in the image that you selected.
If the workbench image has multiple versions available, select the workbench image version to use from the Version selection list. To use the latest package versions, Red Hat recommends that you use the most recently added image.
NoteYou can change the workbench image after you create the workbench. -
In the Deployment size section, from the Container size list, select a container size for your server. The container size specifies the number of CPUs and the amount of memory allocated to the container, setting the guaranteed minimum (request) and maximum (limit) for both.
-
Optional: In the Environment variables section, select and specify values for any environment variables.
Setting environment variables during the workbench configuration helps you save time later because you do not need to define them in the body of your notebooks, or with the IDE command line interface.
If you are using S3-compatible storage, add these recommended environment variables:
-
AWS_ACCESS_KEY_ID
specifies your Access Key ID for Amazon Web Services. -
AWS_SECRET_ACCESS_KEY
specifies your Secret access key for the account specified inAWS_ACCESS_KEY_ID
.
Open Data Hub stores the credentials as Kubernetes secrets in a protected namespace if you select Secret when you add the variable.
-
-
In the Cluster storage section, configure the storage for your workbench. Select one of the following options:
-
Create new persistent storage to create storage that is retained after you shut down your workbench. Complete the relevant fields to define the storage:
-
Enter a name for the cluster storage.
-
Enter a description for the cluster storage.
-
Select a storage class for the cluster storage.
NoteYou cannot change the storage class after you add the cluster storage to the workbench. -
Under Persistent storage size, enter a new size in gibibytes or mebibytes.
-
-
Use existing persistent storage to reuse existing storage and select the storage from the Persistent storage list.
-
-
Optional: You can add a connection to your workbench. A connection is a resource that contains the configuration parameters needed to connect to a data source or sink, such as an object storage bucket. You can use storage buckets for storing data, models, and pipeline artifacts. You can also use a connection to specify the location of a model that you want to deploy.
In the Connections section, use an existing connection or create a new connection:
-
Use an existing connection as follows:
-
Click Attach existing connections.
-
From the Connection list, select a connection that you previously defined.
-
-
Create a new connection as follows:
-
Click Create connection. The Add connection dialog appears.
-
From the Connection type drop-down list, select the type of connection. The Connection details section appears.
-
If you selected S3 compatible object storage in the preceding step, configure the connection details:
-
In the Connection name field, enter a unique name for the connection.
-
Optional: In the Description field, enter a description for the connection.
-
In the Access key field, enter the access key ID for the S3-compatible object storage provider.
-
In the Secret key field, enter the secret access key for the S3-compatible object storage account that you specified.
-
In the Endpoint field, enter the endpoint of your S3-compatible object storage bucket.
-
In the Region field, enter the default region of your S3-compatible object storage account.
-
In the Bucket field, enter the name of your S3-compatible object storage bucket.
-
Click Create.
-
-
If you selected URI in the preceding step, configure the connection details:
-
In the Connection name field, enter a unique name for the connection.
-
Optional: In the Description field, enter a description for the connection.
-
In the URI field, enter the Uniform Resource Identifier (URI).
-
Click Create.
-
-
-
-
Click Create workbench.
-
The workbench that you created appears on the Workbenches tab for the project.
-
Any cluster storage that you associated with the workbench during the creation process appears on the Cluster storage tab for the project.
-
The Status column on the Workbenches tab displays a status of Starting when the workbench server is starting, and Running when the workbench has successfully started.
-
Optional: Click the Open link to open the IDE in a new window.
Next steps
The following product documentation provides more information on how to develop, test, and deploy data science solutions with Open Data Hub.
- Develop and train a model in your workbench IDE
-
Working in your data science IDE
Learn how to access your workbench IDE (JupyterLab, code-server, or RStudio Server).
For the JupyterLab IDE, learn about the following tasks:
-
Creating and importing notebooks
-
Using Git to collaborate on notebooks
-
Viewing and installing Python packages
-
Troubleshooting common problems
-
- Automate your ML workflow with pipelines
-
Working with data science pipelines
Enhance your data science projects on Open Data Hub by building portable machine learning (ML) workflows with data science pipelines, by using Docker containers. Use pipelines for continuous retraining and updating of a model based on newly received data.
- Deploy and test a model
-
Deploy your ML models on your OpenShift cluster to test and then integrate them into intelligent applications. When you deploy a model, it is available as a service that you can access by using API calls. You can return predictions based on data inputs that you provide through API calls.
- Monitor and manage models
-
The Open Data Hub service includes model deployment options for hosting the model on Red Hat OpenShift Dedicated or Red Hat Openshift Service on AWS for integration into an external application.
- Add accelerators to optimize performance
-
If you work with large data sets, you can use accelerators, such as NVIDIA GPUs and Intel Gaudi AI accelerators, to optimize the performance of your data science models in Open Data Hub. With accelerators, you can scale your work, reduce latency, and increase productivity.
- Implement distributed workloads for higher performance
-
Working with distributed workloads
Implement distributed workloads to use multiple cluster nodes in parallel for faster, more efficient data processing and model training.
- Explore extensions
-
Working with connected applications
Extend your core Open Data Hub solution with integrated third-party applications. Several leading AI/ML software technology partners, including Starburst, Intel AI Tools, Anaconda, and IBM are also available through Red Hat Marketplace.
Additional resources
On the Resources page of the Open Data Hub dashboard, you can use the category links to filter the resources for various stages of your data science workflow. For example, click the Model serving category to display resources that describe various methods of deploying models. Click All items to show the resources for all categories.
For the selected category, you can apply additional options to filter the available resources. For example, you can filter by type, such as how-to articles, quick starts, or tutorials; these resources provide the answers to common questions.