export IMG=my_training_image
Info alert:Important Notice
Please note that more information about the previous v2 releases can be found here. You can use "Find a release" search bar to search for a particular release.
Working with distributed workloads
- Overview of distributed workloads
- Managing custom training images
- Running distributed workloads
- Monitoring distributed workloads
- Tuning a model by using the Training Operator
- Troubleshooting common problems with distributed workloads for users
- My Ray cluster is in a suspended state
- My Ray cluster is in a failed state
- I see a failed to call webhook error message for the CodeFlare Operator
- I see a failed to call webhook error message for Kueue
- My Ray cluster doesn’t start
- I see a Default Local Queue … not found error message
- I see a local_queue provided does not exist error message
- I cannot create a Ray cluster or submit jobs
- My pod provisioned by Kueue is terminated before my image is pulled
To train complex machine-learning models or process data more quickly, you can use the distributed workloads feature to run your jobs on multiple OpenShift worker nodes in parallel. This approach significantly reduces the task completion time, and enables the use of larger datasets and more complex models.
Overview of distributed workloads
You can use the distributed workloads feature to queue, scale, and manage the resources required to run data science workloads across multiple nodes in an OpenShift cluster simultaneously. Typically, data science workloads include several types of artificial intelligence (AI) workloads, including machine learning (ML) and Python workloads.
Distributed workloads provide the following benefits:
-
You can iterate faster and experiment more frequently because of the reduced processing time.
-
You can use larger datasets, which can lead to more accurate models.
-
You can use complex models that could not be trained on a single node.
-
You can submit distributed workloads at any time, and the system then schedules the distributed workload when the required resources are available.
The distributed workloads infrastructure includes the following components:
- CodeFlare Operator
-
Secures deployed Ray clusters and grants access to their URLs
- CodeFlare SDK
-
Defines and controls the remote distributed compute jobs and infrastructure for any Python-based environment
NoteThe CodeFlare SDK is not installed as part of Open Data Hub, but it is contained in some of the notebook images provided by Open Data Hub.
- KubeRay
-
Manages remote Ray clusters on OpenShift for running distributed compute workloads
- Kueue
-
Manages quotas and how distributed workloads consume them, and manages the queueing of distributed workloads with respect to quotas
You can run distributed workloads from data science pipelines, from Jupyter notebooks, or from Microsoft Visual Studio Code files.
Note
|
Data science pipelines workloads are not managed by the distributed workloads feature, and are not included in the distributed workloads metrics. |
Managing custom training images
To run distributed training jobs, you can use one of the base training images that are provided with Open Data Hub, or you can create your own custom training images. You can optionally push your custom training images to the integrated OpenShift image registry, to make your images available to other users.
About base training images
The base training images for distributed workloads are optimized with the tools and libraries that you need to run distributed training jobs. You can use the provided base images, or you can create custom images that are specific to your needs.
The following table lists the training images that are installed with Open Data Hub by default.
Image type | Description |
---|---|
Ray CUDA |
If you are working with compute-intensive models and you want to accelerate the training job with NVIDIA GPU support, you can use the Ray Compute Unified Device Architecture (CUDA) base image to gain access to the NVIDIA CUDA Toolkit. Using this toolkit, you can accelerate your work by using libraries and tools that are optimized for NVIDIA GPUs. |
Ray ROCm |
If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the Ray ROCm base image to gain access to the AMD ROCm software stack. Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs. |
KFTO CUDA |
If you are working with compute-intensive models and you want to accelerate the training job with NVIDIA GPU support, you can use the Kubeflow Training Operator (KFTO) CUDA base image to gain access to the NVIDIA CUDA Toolkit. Using this toolkit, you can accelerate your work by using libraries and tools that are optimized for NVIDIA GPUs. |
KFTO ROCm |
If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the KFTO ROCm base image to gain access to the AMD ROCm software stack. Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs. |
If the preinstalled packages that are provided in these images are not sufficient for your use case, you have the following options:
-
Install additional libraries after launching a default image. This option is good if you want to add libraries on an ad hoc basis as you run training jobs. However, it can be challenging to manage the dependencies of installed libraries.
-
Create a custom image that includes the additional libraries or packages. For more information, see Creating a custom training image.
Creating a custom training image
You can create a custom training image by adding packages to a base training image.
-
You can access the training image that you have chosen to use as the base for your custom image.
-
You have Podman installed in your local environment, and you can access a container registry.
For more information about Podman and container registries, see Building, running, and managing containers.
-
In a terminal window, create a directory for your work, and change to that directory.
-
Set the
IMG
environment variable to the name of your image. In the example commands in this section,my_training_image
is the name of the image. -
Create a file named
Dockerfile
with the following content:-
Use the
FROM
instruction to specify the location of a suitable base training image.The Python version in the training image must be the same as the Python version in the workbench.
-
To create a CUDA-compatible Ray cluster image, specify the location of a CUDA-compatible Ray base image, as shown in the following examples:
CUDA-compatible Ray base image with Python 3.9FROM quay.io/modh/ray:2.35.0-py39-cu121
CUDA-compatible Ray base image with Python 3.11FROM quay.io/modh/ray:2.35.0-py311-cu121
-
To create a ROCm-compatible Ray cluster image, specify the location of a ROCm-compatible Ray base image, as shown in the following examples:
ROCm-compatible Ray base image with Python 3.9FROM quay.io/modh/ray:2.35.0-py39-rocm61
ROCm-compatible Ray base image with Python 3.11FROM quay.io/modh/ray:2.35.0-py311-rocm61
-
To create a CUDA-compatible KFTO cluster image, specify the CUDA-compatible KFTO base image location:
CUDA-compatible KFTO base image with Python 3.11FROM quay.io/modh/training:py311-cuda121-torch241
-
To create a ROCm-compatible KFTO cluster image, specify the ROCm-compatible KFTO base image location:
ROCm-compatible KFTO base image with Python 3.11FROM quay.io/modh/training:py311-rocm61-torch241
-
-
Use the
RUN
instruction to install additional packages. You can also add comments to the Dockerfile by prefixing each comment line with a number sign (#
).The following example shows how to install a specific version of the Python PyTorch package:
# Install PyTorch RUN python3 -m pip install torch==2.4.0
-
-
Build the image file. Use the
-t
option with thepodman build
command to create an image tag that specifies the image name and version, to make it easier to reference and manage the image:podman build -t <image-name>:_<version>_ -f Dockerfile
Example:
podman build -t ${IMG}:0.0.1 -f Dockerfile
The build output indicates when the build process is complete.
-
Display a list of your images:
podman images
If your new image was created successfully, it is included in the list of images.
-
Push the image to your container registry:
podman push ${IMG}:0.0.1
-
Optional: Make your new image available to other users, as described in Pushing an image to the integrated OpenShift image registry.
Pushing an image to the integrated OpenShift image registry
To make an image available to other users in your OpenShift cluster, you can push the image to the integrated OpenShift image registry, a built-in container image registry.
For more information about the integrated OpenShift image registry, see Integrated OpenShift image registry.
-
Your cluster administrator has exposed the integrated image registry, as described in Exposing the registry.
-
You have Podman installed in your local environment.
For more information about Podman and container registries, see Building, running, and managing containers.
-
In a terminal window, log in to the OpenShift CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
-
Set the
IMG
environment variable to the name of your image. In the example commands in this section,my_training_image
is the name of the image.export IMG=my_training_image
-
Log in to the integrated image registry:
podman login -u $(oc whoami) -p $(oc whoami -t) $(oc registry info)
-
Tag the image for the integrated image registry:
podman tag ${IMG} $(oc registry info)/$(oc project -q)/${IMG}
-
Push the image to the integrated image registry:
podman push $(oc registry info)/$(oc project -q)/${IMG}
-
Retrieve the image repository location for the tag that you want:
oc get is ${IMG} -o jsonpath='{.status.tags[?(@.tag=="<TAG>")].items[0].dockerImageReference}'
Any user can now use your image by specifying this retrieved image location value in the
image
parameter of a Ray cluster or training job.
Running distributed workloads
In Open Data Hub, you can run a distributed workload from a notebook or from a pipeline.
You can run distributed workloads in a disconnected environment if you can access all of the required software from that environment. For example, you must be able to access a Ray cluster image, and the data sets and Python dependencies used by the workload, from the disconnected environment.
Running distributed data science workloads from notebooks
To run a distributed workload from a notebook, you must configure a Ray cluster. You must also provide environment-specific information such as cluster authentication details.
The examples in this section refer to the JupyterLab integrated development environment (IDE).
Downloading the demo notebooks from the CodeFlare SDK
The demo notebooks from the CodeFlare SDK provide guidelines on how to use the CodeFlare stack in your own notebooks. Download the demo notebooks so that you can learn how to run the notebooks locally.
-
You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
-
You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.
-
You have Admin access for the data science project.
-
If you created the project, you automatically have Admin access.
-
If you did not create the project, your cluster administrator must give you Admin access.
-
-
You have logged in to Open Data Hub, started your workbench, and logged in to JupyterLab.
-
In the JupyterLab interface, click File > New > Notebook. Specify your preferred Python version, and then click Select.
A new notebook is created in an
.ipynb
file. -
Add the following code to a cell in the new notebook:
Code to download the demo notebooksfrom codeflare_sdk import copy_demo_nbs copy_demo_nbs()
-
Select the cell, and click Run > Run selected cell.
After a few seconds, the
copy_demo_nbs()
function copies the demo notebooks that are packaged with the currently installed version of the CodeFlare SDK, and clones them into thedemo-notebooks
folder. -
In the left navigation pane, right-click the new notebook and click Delete.
-
Click Delete to confirm.
Locate the downloaded demo notebooks in the JupyterLab interface, as follows:
-
In the left navigation pane, double-click demo-notebooks.
-
Double-click additional-demos and verify that the folder contains several demo notebooks.
-
Click demo-notebooks.
-
Double-click guided-demos and verify that the folder contains several demo notebooks.
You can run these demo notebooks as described in Running the demo notebooks from the CodeFlare SDK.
Running the demo notebooks from the CodeFlare SDK
To run the demo notebooks from the CodeFlare SDK, you must provide environment-specific information.
In the examples in this procedure, you edit the demo notebooks in JupyterLab to provide the required information, and then run the notebooks.
-
You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
-
You can access the following software from your data science cluster:
-
A Ray cluster image that is compatible with your hardware architecture
-
The data sets and models to be used by the workload
-
The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server
-
-
You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.
-
You have Admin access for the data science project.
-
If you created the project, you automatically have Admin access.
-
If you did not create the project, your cluster administrator must give you Admin access.
-
-
You have logged in to Open Data Hub, started your workbench, and logged in to JupyterLab.
-
You have downloaded the demo notebooks provided by the CodeFlare SDK, as described in Downloading the demo notebooks from the CodeFlare SDK.
-
Check whether your cluster administrator has defined a default local queue for the Ray cluster.
You can use the
codeflare_sdk.list_local_queues()
function to view all local queues in your current namespace, and the resource flavors associated with each local queue.Alternatively, you can use the OpenShift web console as follows:
-
In the OpenShift web console, select your project from the Project list.
-
Click Search, and from the Resources list, select LocalQueue to show the list of local queues for your project.
If no local queue is listed, contact your cluster administrator.
-
Review the details of each local queue:
-
Click the local queue name.
-
Click the YAML tab, and review the
metadata.annotations
section.If the
kueue.x-k8s.io/default-queue
annotation is set to'true'
, the queue is configured as the default local queue.NoteIf your cluster administrator does not define a default local queue, you must specify a local queue in each notebook.
-
-
-
In the JupyterLab interface, open the demo-notebooks > guided-demos folder.
-
Open all of the notebooks by double-clicking each notebook file.
Notebook files have the
.ipynb
file name extension. -
In each notebook, ensure that the
import
section imports the required components from the CodeFlare SDK, as follows:Example import sectionfrom codeflare_sdk import Cluster, ClusterConfiguration, TokenAuthentication
-
In each notebook, update the
TokenAuthentication
section to provide thetoken
andserver
details to authenticate to the OpenShift cluster by using the CodeFlare SDK.You can find your token and server details as follows:
-
In the Open Data Hub top navigation bar, click the application launcher icon () and then click OpenShift Console to open the OpenShift web console.
-
In the upper-right corner of the OpenShift web console, click your user name and select Copy login command.
-
After you have logged in, click Display Token.
-
In the Log in with this token section, find the required values as follows:
-
The
token
value is the text after the--token=
prefix. -
The
server
value is the text after the--server=
prefix.
-
NoteThe
token
andserver
values are security credentials, treat them with care.-
Do not save the token and server details in a notebook.
-
Do not store the token and server details in Git.
The token expires after 24 hours.
-
-
Optional: If you want to use custom certificates, update the
TokenAuthentication
section to add theca_cert_path
parameter to specify the location of the custom certificates, as shown in the following example:Example authentication sectionauth = TokenAuthentication( token = "XXXXX", server = "XXXXX", skip_tls=False, ca_cert_path="/path/to/cert" ) auth.login()
Alternatively, you can set the
CF_SDK_CA_CERT_PATH
environment variable to specify the location of the custom certificates. -
In each notebook, update the cluster configuration section as follows:
-
If the
namespace
value is specified, replace the example value with the name of your project.If you omit this line, the Ray cluster is created in the current project.
-
If the
image
value is specified, replace the example value with a link to a suitable Ray cluster image. The Python version in the Ray cluster image must be the same as the Python version in the workbench.If you omit this line, one of the following Ray cluster images is used by default, based on the Python version detected in the workbench:
-
Python 3.9:
quay.io/modh/ray:2.35.0-py39-cu121
-
Python 3.11:
quay.io/modh/ray:2.35.0-py311-cu121
The default Ray images are compatible with NVIDIA GPUs that are supported by CUDA 12.1. The default images are AMD64 images, which might not work on other architectures.
Additional ROCm-compatible Ray cluster images are available. These images are compatible with AMD accelerators that are supported by ROCm 6.1. These images are AMD64 images, which might not work on other architectures.
-
-
If your cluster administrator has not configured a default local queue, specify the local queue for the Ray cluster, as shown in the following example:
Example local queue assignmentlocal_queue="your_local_queue_name"
-
Optional: Assign a dictionary of
labels
parameters to the Ray cluster for identification and management purposes, as shown in the following example:Example labels assignmentlabels = {"exampleLabel1": "exampleLabel1Value", "exampleLabel2": "exampleLabel2Value"}
-
-
In the
2_basic_interactive.ipynb
notebook, ensure that the following Ray cluster authentication code is included after the Ray cluster creation section:Ray cluster authentication codefrom codeflare_sdk import generate_cert generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace) generate_cert.export_env(cluster.config.name, cluster.config.namespace)
NoteMutual Transport Layer Security (mTLS) is enabled by default in the CodeFlare component in Open Data Hub. You must include the Ray cluster authentication code to enable the Ray client that runs within a notebook to connect to a secure Ray cluster that has mTLS enabled.
-
Run the notebooks in the order indicated by the file-name prefix (
0_
,1_
, and so on).-
In each notebook, run each cell in turn, and review the cell output.
-
If an error is shown, review the output to find information about the problem and the required corrective action. For example, replace any deprecated parameters as instructed. See also Troubleshooting common problems with distributed workloads for users.
-
For more information about the interactive browser controls that you can use to simplify Ray cluster tasks when working within a Jupyter notebook, see Managing Ray clusters from within a Jupyter notebook.
-
-
The notebooks run to completion without errors.
-
In the notebooks, the output from the
cluster.status()
function orcluster.details()
function indicates that the Ray cluster isActive
.
Managing Ray clusters from within a Jupyter notebook
You can use interactive browser controls to simplify Ray cluster tasks when working within a Jupyter notebook.
The interactive browser controls provide an alternative to the equivalent commands, but do not replace them. You can continue to manage the Ray clusters by running commands within the notebook, for ease of use in scripts and pipelines.
Several different interactive browser controls are available:
-
When you run a cell that provides the cluster configuration, the notebook automatically shows the controls for starting or deleting the cluster.
-
You can run the
view_clusters()
command to add controls that provide the following functionality:-
View a list of the Ray clusters that you can access.
-
View cluster information, such as cluster status and allocated resources, for the selected Ray cluster. You can view this information from within the notebook, without switching to the OpenShift Container Platform console or the Ray dashboard.
-
Open the Ray dashboard directly from the notebook, to view the submitted jobs.
-
Refresh the Ray cluster list and the cluster information for the selected cluster.
You can add these controls to existing notebooks, or manage the Ray clusters from a separate notebook.
-
The 3_widget_example.ipynb
demo notebook shows all of the available interactive browser controls.
In the example in this procedure, you create a new notebook to manage the Ray clusters, similar to the example provided in the 3_widget_example.ipynb
demo notebook.
-
You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
-
You can access the following software from your data science cluster:
-
A Ray cluster image that is compatible with your hardware architecture
-
The data sets and models to be used by the workload
-
The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server
-
-
You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.
-
You have Admin access for the data science project.
-
If you created the project, you automatically have Admin access.
-
If you did not create the project, your cluster administrator must give you Admin access.
-
-
You have logged in to Open Data Hub, started your workbench, and logged in to JupyterLab.
-
You have downloaded the demo notebooks provided by the CodeFlare SDK, as described in Downloading the demo notebooks from the CodeFlare SDK.
-
Run all of the demo notebooks in the order indicated by the file-name prefix (
0_
,1_
, and so on), as described in Running the demo notebooks from the CodeFlare SDK. -
In each demo notebook, when you run the cluster configuration step, the following interactive controls are automatically shown in the notebook:
-
Cluster Up: You can click this button to start the Ray cluster. This button is equivalent to the
cluster.up()
command. When you click this button, a message indicates whether the cluster was successfully created. -
Cluster Down: You can click this button to delete the Ray cluster. This button is equivalent to the
cluster.down()
command. The cluster is deleted immediately; you are not prompted to confirm the deletion. When you click this button, a message indicates whether the cluster was successfully deleted. -
Wait for Cluster: You can select this option to specify that the notebook should wait for the Ray cluster dashboard to be ready before proceeding to the next step. This option is equivalent to the
cluster.wait_ready()
command.
-
-
In the JupyterLab interface, create a new notebook to manage the Ray clusters, as follows:
-
Click File > New > Notebook. Specify your preferred Python version, and then click Select.
A new notebook is created in an
.ipynb
file. -
Add the following code to a cell in the new notebook:
Code to import the required packagesfrom codeflare_sdk import TokenAuthentication, view_clusters
The
view_clusters
package provides the interactive browser controls for listing the clusters, showing the cluster details, opening the Ray dashboard, and refreshing the cluster data. -
Add a new cell to the notebook, and add the following code to the new cell:
Code to authenticateauth = TokenAuthentication( token = "XXXXX", server = "XXXXX", skip_tls=False ) auth.login()
For information about how to find the token and server values, see Running the demo notebooks from the CodeFlare SDK.
-
Add a new cell to the notebook, and add the following code to the new cell:
Code to view clusters in the current projectview_clusters()
When you run the
view_clusters()
command with no arguments specified, you generate a list of all of the Ray clusters in the current project, and display information similar to thecluster.details()
function.If you have access to another project, you can list the Ray clusters in that project by specifying the project name as shown in the following example:
Code to view clusters in another projectview_clusters("my_second_project")
-
Click File > Save Notebook As, enter
demo-notebooks/guided-demos/manage_ray_clusters.ipynb
, and click Save.
-
-
In the
demo-notebooks/guided-demos/manage_ray_clusters.ipynb
notebook, select each cell in turn, and click Run > Run selected cell. -
When you run the cell with the
view_clusters()
function, the output depends on whether any Ray clusters exist.If no Ray clusters exist, the following text is shown, where
_[project-name]_
is the name of the target project:No clusters found in the [project-name] namespace.
Otherwise, the notebook shows the following information about the existing Ray clusters:
-
Select an existing cluster
Under this heading, a toggle button is shown for each existing cluster. Click a cluster name to select the cluster. The cluster details section is updated to show details about the selected cluster; for example, cluster name, Open Data Hub project name, cluster resource information, and cluster status.
-
Delete cluster
Click this button to delete the selected cluster. This button is equivalent to the Cluster Down button. The cluster is deleted immediately; you are not prompted to confirm the deletion. A message indicates whether the cluster was successfully deleted, and the corresponding button is no longer shown under the Select an existing cluster heading.
-
View Jobs
Click this button to open the Jobs tab in the Ray dashboard for the selected cluster, and view details of the submitted jobs. The corresponding URL is shown in the notebook.
-
Open Ray Dashboard
Click this button to open the Overview tab in the Ray dashboard for the selected cluster. The corresponding URL is shown in the notebook.
-
Refresh Data
Click this button to refresh the list of Ray clusters, and the cluster details for the selected cluster, on demand. The cluster details are automatically refreshed when you select a cluster and when you delete the selected cluster.
-
-
The demo notebooks run to completion without errors.
-
In the
manage_ray_clusters.ipynb
notebook, the output from theview_clusters()
function is correct.
Running distributed data science workloads from data science pipelines
To run a distributed workload from a pipeline, you must first update the pipeline to include a link to your Ray cluster image.
-
You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
-
You can access the following software from your data science cluster:
-
A Ray cluster image that is compatible with your hardware architecture
-
The data sets and models to be used by the workload
-
The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server
-
-
You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.
-
You have Admin access for the data science project.
-
If you created the project, you automatically have Admin access.
-
If you did not create the project, your cluster administrator must give you Admin access.
-
-
You have access to S3-compatible object storage.
-
You have logged in to Open Data Hub.
-
Create a connection to connect the object storage to your data science project, as described in Adding a connection to your data science project.
-
Configure a pipeline server to use the connection, as described in Configuring a pipeline server.
-
Create the data science pipeline as follows:
-
Install the
kfp
Python package, which is required for all pipelines:$ pip install kfp
-
Install any other dependencies that are required for your pipeline.
-
Build your data science pipeline in Python code.
For example, create a file named
compile_example.py
with the following content.NoteIf you copy and paste the following code example, remember to remove the callouts, which are not part of the code. The callouts (parenthetical numbers, highlighted in bold font in this document) map the relevant line of code to an explanatory note in the text immediately after the code example.
from kfp import dsl @dsl.component( base_image="registry.redhat.io/ubi8/python-39:latest", packages_to_install=['codeflare-sdk'] ) def ray_fn(): import ray (1) from codeflare_sdk import Cluster, ClusterConfiguration, generate_cert (2) cluster = Cluster( (3) ClusterConfiguration( namespace="my_project", (4) name="raytest", num_workers=1, head_cpus="500m", min_memory=1, max_memory=1, worker_extended_resource_requests={“nvidia.com/gpu”: 1}, (5) image="quay.io/modh/ray:2.35.0-py39-cu121", (6) local_queue="local_queue_name", (7) ) ) print(cluster.status()) cluster.up() (8) cluster.wait_ready() (9) print(cluster.status()) print(cluster.details()) ray_dashboard_uri = cluster.cluster_dashboard_uri() ray_cluster_uri = cluster.cluster_uri() print(ray_dashboard_uri, ray_cluster_uri) # Enable Ray client to connect to secure Ray cluster that has mTLS enabled generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace) (10) generate_cert.export_env(cluster.config.name, cluster.config.namespace) ray.init(address=ray_cluster_uri) print("Ray cluster is up and running: ", ray.is_initialized()) @ray.remote def train_fn(): (11) # complex training function return 100 result = ray.get(train_fn.remote()) assert 100 == result ray.shutdown() cluster.down() (12) auth.logout() return result @dsl.pipeline( (13) name="Ray Simple Example", description="Ray Simple Example", ) def ray_integration(): ray_fn() if __name__ == '__main__': (14) from kfp.compiler import Compiler Compiler().compile(ray_integration, 'compiled-example.yaml')
-
Imports Ray.
-
Imports packages from the CodeFlare SDK to define the cluster functions.
-
Specifies the Ray cluster configuration: replace these example values with the values for your Ray cluster.
-
Optional: Specifies the project where the Ray cluster is created. Replace the example value with the name of your project. If you omit this line, the Ray cluster is created in the current project.
-
Optional: Specifies the requested accelerators for the Ray cluster (in this example, 1 NVIDIA GPU). If no accelerators are required, set the value to 0 or omit the line. Note: To specify the requested accelerators for the Ray cluster, use the
worker_extended_resource_requests
parameter instead of the deprecatednum_gpus
parameter. For more details, see the CodeFlare SDK documentation. -
Specifies the location of the Ray cluster image. If you omit this line, one of the default CUDA-compatible Ray cluster images is used, based on the Python version detected in the workbench. The default Ray images are AMD64 images, which might not work on other architectures. If you are running this code in a disconnected environment, replace the default value with the location for your environment.
-
Specifies the local queue to which the Ray cluster will be submitted. If a default local queue is configured, you can omit this line.
-
Creates a Ray cluster by using the specified image and configuration.
-
Waits until the Ray cluster is ready before proceeding.
-
Enables the Ray client to connect to a secure Ray cluster that has mutual Transport Layer Security (mTLS) enabled. mTLS is enabled by default in the CodeFlare component in Open Data Hub.
-
Replace the example details in this section with the details for your workload.
-
Removes the Ray cluster when your workload is finished.
-
Replace the example name and description with the values for your workload.
-
Compiles the Python code and saves the output in a YAML file.
-
-
Compile the Python file (in this example, the
compile_example.py
file):$ python compile_example.py
This command creates a YAML file (in this example,
compiled-example.yaml
), which you can import in the next step.
-
-
Import your data science pipeline, as described in Importing a data science pipeline.
-
Schedule the pipeline run, as described in Scheduling a pipeline run.
-
When the pipeline run is complete, confirm that it is included in the list of triggered pipeline runs, as described in Viewing the details of a pipeline run.
The YAML file is created and the pipeline run completes without errors.
You can view the run details, as described in Viewing the details of a pipeline run.
Monitoring distributed workloads
In Open Data Hub, you can view project metrics for distributed workloads, and view the status of all distributed workloads in the selected project. You can use these metrics to monitor the resources used by distributed workloads, assess whether project resources are allocated correctly, track the progress of distributed workloads, and identify corrective action when necessary.
Note
|
Data science pipelines workloads are not managed by the distributed workloads feature, and are not included in the distributed workloads metrics. |
Viewing project metrics for distributed workloads
In Open Data Hub, you can view the following project metrics for distributed workloads:
-
CPU - The number of CPU cores that are currently being used by all distributed workloads in the selected project.
-
Memory - The amount of memory in gibibytes (GiB) that is currently being used by all distributed workloads in the selected project.
You can use these metrics to monitor the resources used by the distributed workloads, and assess whether project resources are allocated correctly.
-
You have installed Open Data Hub.
-
On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.
-
You have logged in to Open Data Hub.
-
If you are using Open Data Hub groups, you are part of the user group or admin group (for example,
odh-users
orodh-admins
) in OpenShift. -
Your data science project contains distributed workloads.
-
In the Open Data Hub left navigation pane, click Distributed Workloads Metrics.
-
From the Project list, select the project that contains the distributed workloads that you want to monitor.
-
Click the Project metrics tab.
-
Optional: From the Refresh interval list, select a value to specify how frequently the graphs on the metrics page are refreshed to show the latest data.
You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, or 1 day.
-
In the Requested resources section, review the CPU and Memory graphs to identify the resources requested by distributed workloads as follows:
-
Requested by the selected project
-
Requested by all projects, including the selected project and projects that you cannot access
-
Total shared quota for all projects, as provided by the cluster queue
For each resource type (CPU and Memory), subtract the Requested by all projects value from the Total shared quota value to calculate how much of that resource quota has not been requested and is available for all projects.
-
-
Scroll down to the Top resource-consuming distributed workloads section to review the following graphs:
-
Top 5 distributed workloads that are consuming the most CPU resources
-
Top 5 distributed workloads that are consuming the most memory
You can also identify how much CPU or memory is used in each case.
-
-
Scroll down to view the Distributed workload resource metrics table, which lists all of the distributed workloads in the selected project, and indicates the current resource usage and the status of each distributed workload.
In each table entry, progress bars indicate how much of the requested CPU and memory is currently being used by this distributed workload. To see numeric values for the actual usage and requested usage for CPU (measured in cores) and memory (measured in GiB), hover the cursor over each progress bar. Compare the actual usage with the requested usage to assess the distributed workload configuration. If necessary, reconfigure the distributed workload to reduce or increase the requested resources.
On the Project metrics tab, the graphs and table provide resource-usage data for the distributed workloads in the selected project.
Viewing the status of distributed workloads
In Open Data Hub, you can view the status of all distributed workloads in the selected project. You can track the progress of the distributed workloads, and identify corrective action when necessary.
-
You have installed Open Data Hub.
-
On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.
-
You have logged in to Open Data Hub.
-
If you are using Open Data Hub groups, you are part of the user group or admin group (for example,
odh-users
orodh-admins
) in OpenShift. -
Your data science project contains distributed workloads.
-
In the Open Data Hub left navigation pane, click Distributed Workloads Metrics.
-
From the Project list, select the project that contains the distributed workloads that you want to monitor.
-
Click the Distributed workload status tab.
-
Optional: From the Refresh interval list, select a value to specify how frequently the graphs on the metrics page are refreshed to show the latest data.
You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, or 1 day.
-
In the Status overview section, review a summary of the status of all distributed workloads in the selected project.
The status can be Pending, Inadmissible, Admitted, Running, Evicted, Succeeded, or Failed.
-
Scroll down to view the Distributed workloads table, which lists all of the distributed workloads in the selected project. The table provides the priority, status, creation date, and latest message for each distributed workload.
The latest message provides more information about the current status of the distributed workload. Review the latest message to identify any corrective action needed. For example, a distributed workload might be Inadmissible because the requested resources exceed the available resources. In such cases, you can either reconfigure the distributed workload to reduce the requested resources, or reconfigure the cluster queue for the project to increase the resource quota.
On the Distributed workload status tab, the graph provides a summarized view of the status of all distributed workloads in the selected project, and the table provides more details about the status of each distributed workload.
Viewing Kueue alerts for distributed workloads
In Open Data Hub, you can view Kueue alerts for your cluster. Each alert provides a link to a runbook. The runbook provides instructions on how to resolve the situation that triggered the alert.
-
You have logged in to OpenShift Container Platform with the
cluster-admin
role. -
You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
-
You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.
-
You have logged in to Open Data Hub.
-
Your data science project contains distributed workloads.
-
In the OpenShift Container Platform console, in the Administrator perspective, click Observe → Alerting.
-
Click the Alerting rules tab to view a list of alerting rules for default and user-defined projects.
-
The Severity column indicates whether the alert is informational, a warning, or critical.
-
The Alert state column indicates whether a rule is currently firing.
-
-
Click the name of an alerting rule to see more details, such as the condition that triggers the alert. The following table summarizes the alerting rules for Kueue resources.
Table 2. Alerting rules for Kueue resources Severity Name Alert condition Critical
KueuePodDown
The Kueue pod is not ready for a period of 5 minutes.
Info
LowClusterQueueResourceUsage
Resource usage in the cluster queue is below 20% of its nominal quota for more than 1 day. Resource usage refers to any resources listed in the cluster queue, such as CPU, memory, and so on.
Info
ResourceReservationExceedsQuota
Resource reservation is 10 times the available quota in the cluster queue. Resource reservation refers to any resources listed in the cluster queue, such as CPU, memory, and so on.
Info
PendingWorkloadPods
A pod has been in a
Pending
state for more than 3 days. -
If the Alert state of an alerting rule is set to Firing, complete the following steps:
-
Click Observe → Alerting and then click the Alerts tab.
-
Click each alert for the firing rule, to see more details. Note that a separate alert is fired for each resource type affected by the alerting rule.
-
On the alert details page, in the Runbook section, click the link to open a GitHub page that provides troubleshooting information.
-
Complete the runbook steps to identify the cause of the alert and resolve the situation.
-
After you resolve the cause of the alert, the alerting rule stops firing.
Tuning a model by using the Training Operator
To tune a model by using the Kubeflow Training Operator, you configure and run a training job.
Optionally, you can use Low-Rank Adaptation (LoRA) to efficiently fine-tune large language models, such as Llama 3. The integration optimizes computational requirements and reduces memory footprint, allowing fine-tuning on consumer-grade GPUs. The solution combines PyTorch Fully Sharded Data Parallel (FSDP) and LoRA to enable scalable, cost-effective model training and inference, enhancing the flexibility and performance of AI workloads within OpenShift environments.
Configuring the training job
Before you can use a training job to tune a model, you must configure the training job. The example training job in this section is based on the IBM and Hugging Face tuning example provided in GitHub.
-
You have logged in to OpenShift Container Platform.
-
You have access to a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
-
You have created a data science project. For information about how to create a project, see Creating a data science project.
-
You have Admin access for the data science project.
-
If you created the project, you automatically have Admin access.
-
If you did not create the project, your cluster administrator must give you Admin access.
-
-
You have access to a model.
-
You have access to data that you can use to train the model.
-
In a terminal window, if you are not already logged in to your OpenShift cluster, log in to the OpenShift CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <username> -p <password>
-
Configure a training job, as follows:
-
Create a YAML file named
config_trainingjob.yaml
. -
Add the
ConfigMap
object definition as follows:Example training-job configurationkind: ConfigMap apiVersion: v1 metadata: name: training-config namespace: kfto data: config.json: | { "model_name_or_path": "bigscience/bloom-560m", "training_data_path": "/data/input/twitter_complaints.json", "output_dir": "/data/output/tuning/bloom-twitter", "save_model_dir": "/mnt/output/model", "num_train_epochs": 10.0, "per_device_train_batch_size": 4, "per_device_eval_batch_size": 4, "gradient_accumulation_steps": 4, "save_strategy": "no", "learning_rate": 1e-05, "weight_decay": 0.0, "lr_scheduler_type": "cosine", "include_tokens_per_second": true, "response_template": "\n### Label:", "dataset_text_field": "output", "padding_free": ["huggingface"], "multipack": [16], "use_flash_attn": false }
-
Optional: To fine-tune with Low Rank Adaptation (LoRA), update the
config.json
section as follows:-
Set the
peft_method
parameter to"lora"
. -
Add the
lora_r
,lora_alpha
,lora_dropout
,bias
, andtarget_modules
parameters.Example LoRA configuration... "peft_method": "lora", "lora_r": 8, "lora_alpha": 8, "lora_dropout": 0.1, "bias": "none", "target_modules": ["all-linear"] }
-
-
Optional: To fine-tune with Quantized Low Rank Adaptation (QLoRA), update the
config.json
section as follows:-
Set the
use_flash_attn
parameter to"true"
. -
Set the
peft_method
parameter to"lora"
. -
Add the LoRA parameters:
lora_r
,lora_alpha
,lora_dropout
,bias
, andtarget_modules
. -
Add the QLoRA mandatory parameters:
auto_gptq
,torch_dtype
, andfp16
. -
If required, add the QLoRA optional parameters:
fused_lora
andfast_kernels
.Example QLoRA configuration... "use_flash_attn": true, "peft_method": "lora", "lora_r": 8, "lora_alpha": 8, "lora_dropout": 0.1, "bias": "none", "target_modules": ["all-linear"], "auto_gptq": ["triton_v2"], "torch_dtype": float16, "fp16": true, "fused_lora": ["auto_gptq", true], "fast_kernels": [true, true, true] }
-
-
Edit the metadata of the training-job configuration as shown in the following table.
Table 3. Training-job configuration metadata Parameter Value name
Name of the training-job configuration
namespace
Name of your project
-
Edit the parameters of the training-job configuration as shown in the following table.
Table 4. Training-job configuration parameters Parameter Value model_name_or_path
Name of the pre-trained model or the path to the model in the training-job container; in this example, the model name is taken from the Hugging Face web page
training_data_path
Path to the training data that you set in the
training_data.yaml
ConfigMapoutput_dir
Output directory for the model
save_model_dir
Directory where the tuned model is saved
num_train_epochs
Number of epochs for training; in this example, the training job is set to run 10 times
per_device_train_batch_size
Batch size, the number of data set examples to process together; in this example, the training job processes 4 examples at a time
per_device_eval_batch_size
Batch size, the number of data set examples to process together per GPU or TPU core or CPU; in this example, the training job processes 4 examples at a time
gradient_accumulation_steps
Number of gradient accumulation steps
save_strategy
How often model checkpoints can be saved; the default value is
"epoch"
(save model checkpoint every epoch), other possible values are"steps"
(save model checkpoint for every training step) and"no"
(do not save model checkpoints)save_total_limit
Number of model checkpoints to save; omit if
save_strategy
is set to"no"
(no model checkpoints saved)learning_rate
Learning rate for the training
weight_decay
Weight decay to apply
lr_scheduler_type
Optional: Scheduler type to use; the default value is
"linear"
, other possible values are"cosine"
,"cosine_with_restarts"
,"polynomial"
,"constant"
, and"constant_with_warmup"
include_tokens_per_second
Optional: Whether or not to compute the number of tokens per second per device for training speed metrics
response_template
Template formatting for the response
dataset_text_field
Dataset field for training output, as set in the
training_data.yaml
config mappadding_free
Whether to use a technique to process multiple examples in a single batch without adding padding tokens that waste compute resources; if used, this parameter must be set to
["huggingface"]
multipack
Whether to use a technique for multi-GPU training to balance the number of tokens processed in each device, to minimize waiting time; you can experiment with different values to find the optimum value for your training job
use_flash_attn
Whether to use flash attention
peft_method
Tuning method: for full fine-tuning, omit this parameter; for LoRA and QLoRA, set to
"lora"
; for prompt tuning, set to"pt"
lora_r
LoRA: Rank of the low-rank decomposition
lora_alpha
LoRA: Scale the low-rank matrices to control their influence on the model’s adaptations
lora_dropout
LoRA: Dropout rate applied to the LoRA layers, a regularization technique to prevent overfitting
bias
LoRA: Whether to adapt bias terms in the model; setting the bias to
"none"
indicates that no bias terms will be adaptedtarget_modules
LoRA: Names of the modules to apply LoRA to; to include all linear layers, set to "all_linear"; optional parameter for some models
auto_gptq
QLoRA: Sets 4-bit GPTQ-LoRA with AutoGPTQ; when used, this parameter must be set to
["triton_v2"]
torch_dtype
QLoRA: Tensor datatype; when used, this parameter must be set to
float16
fp16
QLoRA: Whether to use half-precision floating-point format; when used, this parameter must be set to
true
fused_lora
QLoRA: Whether to use fused LoRA for more efficient LoRA training; if used, this parameter must be set to
["auto_gptq", true]
fast_kernels
QLoRA: Whether to use fast cross-entropy, rope, rms loss kernels; if used, this parameter must be set to
[true, true, true]
-
Save your changes in the
config_trainingjob.yaml
file. -
Apply the configuration to create the
training-config
object:$ oc apply -f config_trainingjob.yaml
-
-
Create the training data.
NoteThe training data in this simple example is for demonstration purposes only, and is not suitable for production use. The usual method for providing training data is to use persistent volumes.
-
Create a YAML file named
training_data.yaml
. -
Add the following
ConfigMap
object definition:kind: ConfigMap apiVersion: v1 metadata: name: twitter-complaints namespace: kfto data: twitter_complaints.json: | [ {"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}, {"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}, {"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"}, {"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"}, {"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp… https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp… https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"}, {"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"}, {"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"}, {"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year ��","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year ��\n\n### Label: no complaint"} ]
-
Replace the example namespace value
kfto
with the name of your project. -
Replace the example training data with your training data.
-
Save your changes in the
training_data.yaml
file. -
Apply the configuration to create the training data:
$ oc apply -f training_data.yaml
-
-
Create a persistent volume claim (PVC), as follows:
-
Create a YAML file named
trainedmodelpvc.yaml
. -
Add the following
PersistentVolumeClaim
object definition:apiVersion: v1 kind: PersistentVolumeClaim metadata: name: trained-model namespace: kfto spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi
-
Replace the example namespace value
kfto
with the name of your project, and update the other parameters to suit your environment. To calculate thestorage
value, multiply the model size by the number of epochs, and add a little extra as a buffer. -
Save your changes in the
trainedmodelpvc.yaml
file. -
Apply the configuration to create a Persistent Volume Claim (PVC) for the training job:
$ oc apply -f trainedmodelpvc.yaml
-
-
In the OpenShift Container Platform console, select your project from the Project list.
-
Click ConfigMaps and verify that the
training-config
andtwitter-complaints
ConfigMaps are listed. -
Click Search. From the Resources list, select PersistentVolumeClaim and verify that the
trained-model
PVC is listed.
Running the training job
You can run a training job to tune a model. The example training job in this section is based on the IBM and Hugging Face tuning example provided here.
-
You have access to a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
-
You have created a data science project. For information about how to create a project, see Creating a data science project.
-
You have Admin access for the data science project.
-
If you created the project, you automatically have Admin access.
-
If you did not create the project, your cluster administrator must give you Admin access.
-
-
You have access to a model.
-
You have access to data that you can use to train the model.
-
You have configured the training job as described in Configuring the training job.
-
In a terminal window, log in to the OpenShift CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <username> -p <password>
-
Create a PyTorch training job, as follows:
-
Create a YAML file named
pytorchjob.yaml
. -
Add the following
PyTorchJob
object definition:apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: kfto-demo namespace: kfto spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: Never template: spec: containers: - env: - name: SFT_TRAINER_CONFIG_JSON_PATH value: /etc/config/config.json image: 'quay.io/modh/fms-hf-tuning:release' imagePullPolicy: IfNotPresent name: pytorch volumeMounts: - mountPath: /etc/config name: config-volume - mountPath: /data/input name: dataset-volume - mountPath: /data/output name: model-volume volumes: - configMap: items: - key: config.json path: config.json name: training-config name: config-volume - configMap: name: twitter-complaints name: dataset-volume - name: model-volume persistentVolumeClaim: claimName: trained-model runPolicy: suspend: false
-
Replace the example namespace value
kfto
with the name of your project, and update the other parameters to suit your environment. -
Edit the parameters of the PyTorch training job, to provide the details for your training job and environment.
-
Save your changes in the
pytorchjob.yaml
file. -
Apply the configuration to run the PyTorch training job:
$ oc apply -f pytorchjob.yaml
-
-
In the OpenShift Container Platform console, select your project from the Project list.
-
Click Workloads → Pods and verify that the <training-job-name>-master-0 pod is listed.
Monitoring the training job
When you run a training job to tune a model, you can monitor the progress of the job. The example training job in this section is based on the IBM and Hugging Face tuning example provided here.
-
You have access to a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
-
You have created a data science project. For information about how to create a project, see Creating a data science project.
-
You have Admin access for the data science project.
-
If you created the project, you automatically have Admin access.
-
If you did not create the project, your cluster administrator must give you Admin access.
-
-
You have access to a model.
-
You have access to data that you can use to train the model.
-
You are running the training job as described in Running the training job.
-
In the OpenShift Container Platform console, select your project from the Project list.
-
Click Workloads → Pods.
-
Search for the pod that corresponds to the PyTorch job, that is, <training-job-name>-master-0.
For example, if the training job name is
kfto-demo
, the pod name is kfto-demo-master-0. -
Click the pod name to open the pod details page.
-
Click the Logs tab to monitor the progress of the job and view status updates, as shown in the following example output:
0%| | 0/10 [00:00<?, ?it/s] 10%|█ | 1/10 [01:10<10:32, 70.32s/it] {'loss': 6.9531, 'grad_norm': 1104.0, 'learning_rate': 9e-06, 'epoch': 1.0} 10%|█ | 1/10 [01:10<10:32, 70.32s/it] 20%|██ | 2/10 [01:40<06:13, 46.71s/it] 30%|███ | 3/10 [02:26<05:25, 46.55s/it] {'loss': 2.4609, 'grad_norm': 736.0, 'learning_rate': 7e-06, 'epoch': 2.0} 30%|███ | 3/10 [02:26<05:25, 46.55s/it] 40%|████ | 4/10 [03:23<05:02, 50.48s/it] 50%|█████ | 5/10 [03:41<03:13, 38.66s/it] {'loss': 1.7617, 'grad_norm': 328.0, 'learning_rate': 5e-06, 'epoch': 3.0} 50%|█████ | 5/10 [03:41<03:13, 38.66s/it] 60%|██████ | 6/10 [04:54<03:22, 50.58s/it] {'loss': 3.1797, 'grad_norm': 1016.0, 'learning_rate': 4.000000000000001e-06, 'epoch': 4.0} 60%|██████ | 6/10 [04:54<03:22, 50.58s/it] 70%|███████ | 7/10 [06:03<02:49, 56.59s/it] {'loss': 2.9297, 'grad_norm': 984.0, 'learning_rate': 3e-06, 'epoch': 5.0} 70%|███████ | 7/10 [06:03<02:49, 56.59s/it] 80%|████████ | 8/10 [06:38<01:39, 49.57s/it] 90%|█████████ | 9/10 [07:22<00:48, 48.03s/it] {'loss': 1.4219, 'grad_norm': 684.0, 'learning_rate': 1.0000000000000002e-06, 'epoch': 6.0} 90%|█████████ | 9/10 [07:22<00:48, 48.03s/it]100%|██████████| 10/10 [08:25<00:00, 52.53s/it] {'loss': 1.9609, 'grad_norm': 648.0, 'learning_rate': 0.0, 'epoch': 6.67} 100%|██████████| 10/10 [08:25<00:00, 52.53s/it] {'train_runtime': 508.0444, 'train_samples_per_second': 0.197, 'train_steps_per_second': 0.02, 'train_loss': 2.63125, 'epoch': 6.67} 100%|██████████| 10/10 [08:28<00:00, 52.53s/it]100%|██████████| 10/10 [08:28<00:00, 50.80s/it]
In the example output, the solid blocks indicate progress bars.
-
The <training-job-name>-master-0 pod is running.
-
The Logs tab provides information about the job progress and job status.
Troubleshooting common problems with distributed workloads for users
If you are experiencing errors in Open Data Hub relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.
My Ray cluster is in a suspended state
The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.
The Ray cluster head pod or worker pods remain in a suspended state.
-
In the OpenShift Container Platform console, select your project from the Project list.
-
Check the workload resource:
-
Click Search, and from the Resources list, select Workload.
-
Select the workload resource that is created with the Ray cluster resource, and click the YAML tab.
-
Check the text in the
status.conditions.message
field, which provides the reason for the suspended state, as shown in the following example:status: conditions: - lastTransitionTime: '2024-05-29T13:05:09Z' message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'
-
-
Check the Ray cluster resource:
-
Click Search, and from the Resources list, select RayCluster.
-
Select the Ray cluster resource, and click the YAML tab.
-
Check the text in the
status.conditions.message
field.
-
-
Check the cluster queue resource:
-
Click Search, and from the Resources list, select ClusterQueue.
-
Check your cluster queue configuration to ensure that the resources that you requested are within the limits defined for the project.
-
Either reduce your requested resources, or contact your administrator to request more resources.
-
My Ray cluster is in a failed state
You might have insufficient resources.
The Ray cluster head pod or worker pods are not running.
When a Ray cluster is created, it initially enters a failed
state.
This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.
If the failed state persists, complete the following steps:
-
In the OpenShift Container Platform console, select your project from the Project list.
-
Click Search, and from the Resources list, select Pod.
-
Click your pod name to open the pod details page.
-
Click the Events tab, and review the pod events to identify the cause of the problem.
-
If you cannot resolve the problem, contact your administrator to request assistance.
I see a failed to call webhook error message for the CodeFlare Operator
After you run the cluster.up()
command, the following error is shown:
ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}
The CodeFlare Operator pod might not be running.
Contact your administrator to request assistance.
I see a failed to call webhook error message for Kueue
After you run the cluster.up()
command, the following error is shown:
ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}
The Kueue pod might not be running.
Contact your administrator to request assistance.
My Ray cluster doesn’t start
After you run the cluster.up()
command, when you run either the cluster.details()
command or the cluster.status()
command, the Ray Cluster remains in the Starting
status instead of changing to the Ready
status.
No pods are created.
-
In the OpenShift Container Platform console, select your project from the Project list.
-
Check the workload resource:
-
Click Search, and from the Resources list, select Workload.
-
Select the workload resource that is created with the Ray cluster resource, and click the YAML tab.
-
Check the text in the
status.conditions.message
field, which provides the reason for remaining in theStarting
state.
-
-
Check the Ray cluster resource:
-
Click Search, and from the Resources list, select RayCluster.
-
Select the Ray cluster resource, and click the YAML tab.
-
Check the text in the
status.conditions.message
field.
-
If you cannot resolve the problem, contact your administrator to request assistance.
I see a Default Local Queue … not found error message
After you run the cluster.up()
command, the following error is shown:
Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.
No default local queue is defined, and a local queue is not specified in the cluster configuration.
-
In the OpenShift Container Platform console, select your project from the Project list.
-
Click Search, and from the Resources list, select LocalQueue.
-
Resolve the problem in one of the following ways:
-
If a local queue exists, add it to your cluster configuration as follows:
local_queue="<local_queue_name>"
-
If no local queue exists, contact your administrator to request assistance.
-
I see a local_queue provided does not exist error message
After you run the cluster.up()
command, the following error is shown:
local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.
An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.
-
In the OpenShift Container Platform console, select your project from the Project list.
-
Click Search, and from the Resources list, select LocalQueue.
-
Resolve the problem in one of the following ways:
-
If a local queue exists, ensure that you spelled the local queue name correctly in your cluster configuration, and that the
namespace
value in the cluster configuration matches your project name. If you do not specify anamespace
value in the cluster configuration, the Ray cluster is created in the current project. -
If no local queue exists, contact your administrator to request assistance.
-
I cannot create a Ray cluster or submit jobs
After you run the cluster.up()
command, an error similar to the following error is shown:
RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}
The correct OpenShift login credentials are not specified in the TokenAuthentication
section of your notebook code.
-
Identify the correct OpenShift login credentials as follows:
-
In the OpenShift Container Platform console header, click your username and click Copy login command.
-
In the new tab that opens, log in as the user whose credentials you want to use.
-
Click Display Token.
-
From the Log in with this token section, copy the
token
andserver
values.
-
-
In your notebook code, specify the copied
token
andserver
values as follows:auth = TokenAuthentication( token = "<token>", server = "<server>", skip_tls=False ) auth.login()
My pod provisioned by Kueue is terminated before my image is pulled
Kueue waits for a period of time before marking a workload as ready, to enable all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.
-
In the OpenShift Container Platform console, select your project from the Project list.
-
Click Search, and from the Resources list, select Pod.
-
Click the Ray head pod name to open the pod details page.
-
Click the Events tab, and review the pod events to check whether the image pull completed successfully.
If the pod takes more than 5 minutes to pull the image, contact your administrator to request assistance.