Open Data Hub logo

Info alert:Important Notice

Please note that more information about the previous v2 releases can be found here. You can use "Find a release" search bar to search for a particular release.

Working with distributed workloads

To train complex machine-learning models or process data more quickly, you can use the distributed workloads feature to run your jobs on multiple OpenShift worker nodes in parallel. This approach significantly reduces the task completion time, and enables the use of larger datasets and more complex models.

Overview of distributed workloads

You can use the distributed workloads feature to queue, scale, and manage the resources required to run data science workloads across multiple nodes in an OpenShift cluster simultaneously. Typically, data science workloads include several types of artificial intelligence (AI) workloads, including machine learning (ML) and Python workloads.

Distributed workloads provide the following benefits:

  • You can iterate faster and experiment more frequently because of the reduced processing time.

  • You can use larger datasets, which can lead to more accurate models.

  • You can use complex models that could not be trained on a single node.

  • You can submit distributed workloads at any time, and the system then schedules the distributed workload when the required resources are available.

The distributed workloads infrastructure includes the following components:

CodeFlare Operator

Secures deployed Ray clusters and grants access to their URLs

CodeFlare SDK

Defines and controls the remote distributed compute jobs and infrastructure for any Python-based environment

Note

The CodeFlare SDK is not installed as part of Open Data Hub, but it is contained in some of the notebook images provided by Open Data Hub.

KubeRay

Manages remote Ray clusters on OpenShift for running distributed compute workloads

Kueue

Manages quotas and how distributed workloads consume them, and manages the queueing of distributed workloads with respect to quotas

You can run distributed workloads from data science pipelines, from Jupyter notebooks, or from Microsoft Visual Studio Code files.

Note

Data science pipelines workloads are not managed by the distributed workloads feature, and are not included in the distributed workloads metrics.

Managing custom training images

To run distributed training jobs, you can use one of the base training images that are provided with Open Data Hub, or you can create your own custom training images. You can optionally push your custom training images to the integrated OpenShift image registry, to make your images available to other users.

About base training images

The base training images for distributed workloads are optimized with the tools and libraries that you need to run distributed training jobs. You can use the provided base images, or you can create custom images that are specific to your needs.

The following table lists the training images that are installed with Open Data Hub by default.

Table 1. Default training base images
Image type Description

Ray CUDA

If you are working with compute-intensive models and you want to accelerate the training job with NVIDIA GPU support, you can use the Ray Compute Unified Device Architecture (CUDA) base image to gain access to the NVIDIA CUDA Toolkit. Using this toolkit, you can accelerate your work by using libraries and tools that are optimized for NVIDIA GPUs.

Ray ROCm

If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the Ray ROCm base image to gain access to the AMD ROCm software stack. Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs.

KFTO CUDA

If you are working with compute-intensive models and you want to accelerate the training job with NVIDIA GPU support, you can use the Kubeflow Training Operator (KFTO) CUDA base image to gain access to the NVIDIA CUDA Toolkit. Using this toolkit, you can accelerate your work by using libraries and tools that are optimized for NVIDIA GPUs.

KFTO ROCm

If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the KFTO ROCm base image to gain access to the AMD ROCm software stack. Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs.

If the preinstalled packages that are provided in these images are not sufficient for your use case, you have the following options:

  • Install additional libraries after launching a default image. This option is good if you want to add libraries on an ad hoc basis as you run training jobs. However, it can be challenging to manage the dependencies of installed libraries.

  • Create a custom image that includes the additional libraries or packages. For more information, see Creating a custom training image.

Creating a custom training image

You can create a custom training image by adding packages to a base training image.

Prerequisites
  • You can access the training image that you have chosen to use as the base for your custom image.

  • You have Podman installed in your local environment, and you can access a container registry.

    For more information about Podman and container registries, see Building, running, and managing containers.

Procedure
  1. In a terminal window, create a directory for your work, and change to that directory.

  2. Set the IMG environment variable to the name of your image. In the example commands in this section, my_training_image is the name of the image.

    export IMG=my_training_image
  3. Create a file named Dockerfile with the following content:

    1. Use the FROM instruction to specify the location of a suitable base training image.

      The Python version in the training image must be the same as the Python version in the workbench.

      • To create a CUDA-compatible Ray cluster image, specify the location of a CUDA-compatible Ray base image, as shown in the following examples:

        CUDA-compatible Ray base image with Python 3.9
        FROM quay.io/modh/ray:2.35.0-py39-cu121
        CUDA-compatible Ray base image with Python 3.11
        FROM quay.io/modh/ray:2.35.0-py311-cu121
      • To create a ROCm-compatible Ray cluster image, specify the location of a ROCm-compatible Ray base image, as shown in the following examples:

        ROCm-compatible Ray base image with Python 3.9
        FROM quay.io/modh/ray:2.35.0-py39-rocm61
        ROCm-compatible Ray base image with Python 3.11
        FROM quay.io/modh/ray:2.35.0-py311-rocm61
      • To create a CUDA-compatible KFTO cluster image, specify the CUDA-compatible KFTO base image location:

        CUDA-compatible KFTO base image with Python 3.11
        FROM quay.io/modh/training:py311-cuda121-torch241
      • To create a ROCm-compatible KFTO cluster image, specify the ROCm-compatible KFTO base image location:

        ROCm-compatible KFTO base image with Python 3.11
        FROM quay.io/modh/training:py311-rocm61-torch241
    2. Use the RUN instruction to install additional packages. You can also add comments to the Dockerfile by prefixing each comment line with a number sign (#).

      The following example shows how to install a specific version of the Python PyTorch package:

      # Install PyTorch
      RUN python3 -m pip install torch==2.4.0
  4. Build the image file. Use the -t option with the podman build command to create an image tag that specifies the image name and version, to make it easier to reference and manage the image:

    podman build -t <image-name>:_<version>_ -f Dockerfile

    Example:

    podman build -t ${IMG}:0.0.1 -f Dockerfile

    The build output indicates when the build process is complete.

  5. Display a list of your images:

    podman images

    If your new image was created successfully, it is included in the list of images.

  6. Push the image to your container registry:

    podman push ${IMG}:0.0.1
  7. Optional: Make your new image available to other users, as described in Pushing an image to the integrated OpenShift image registry.

Pushing an image to the integrated OpenShift image registry

To make an image available to other users in your OpenShift cluster, you can push the image to the integrated OpenShift image registry, a built-in container image registry.

For more information about the integrated OpenShift image registry, see Integrated OpenShift image registry.

Prerequisites
Procedure
  1. In a terminal window, log in to the OpenShift CLI as shown in the following example:

    $ oc login <openshift_cluster_url> -u <admin_username> -p <password>
  2. Set the IMG environment variable to the name of your image. In the example commands in this section, my_training_image is the name of the image.

    export IMG=my_training_image
  3. Log in to the integrated image registry:

    podman login -u $(oc whoami) -p $(oc whoami -t) $(oc registry info)
  4. Tag the image for the integrated image registry:

    podman tag ${IMG} $(oc registry info)/$(oc project -q)/${IMG}
  5. Push the image to the integrated image registry:

    podman push $(oc registry info)/$(oc project -q)/${IMG}
  6. Retrieve the image repository location for the tag that you want:

    oc get is ${IMG} -o jsonpath='{.status.tags[?(@.tag=="<TAG>")].items[0].dockerImageReference}'

    Any user can now use your image by specifying this retrieved image location value in the image parameter of a Ray cluster or training job.

Running distributed workloads

In Open Data Hub, you can run a distributed workload from a notebook or from a pipeline.

You can run distributed workloads in a disconnected environment if you can access all of the required software from that environment. For example, you must be able to access a Ray cluster image, and the data sets and Python dependencies used by the workload, from the disconnected environment.

Running distributed data science workloads from notebooks

To run a distributed workload from a notebook, you must configure a Ray cluster. You must also provide environment-specific information such as cluster authentication details.

The examples in this section refer to the JupyterLab integrated development environment (IDE).

Downloading the demo notebooks from the CodeFlare SDK

The demo notebooks from the CodeFlare SDK provide guidelines on how to use the CodeFlare stack in your own notebooks. Download the demo notebooks so that you can learn how to run the notebooks locally.

Prerequisites
  • You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.

  • You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.

  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.

    • If you did not create the project, your cluster administrator must give you Admin access.

  • You have logged in to Open Data Hub, started your workbench, and logged in to JupyterLab.

Procedure
  1. In the JupyterLab interface, click File > New > Notebook. Specify your preferred Python version, and then click Select.

    A new notebook is created in an .ipynb file.

  2. Add the following code to a cell in the new notebook:

    Code to download the demo notebooks
    from codeflare_sdk import copy_demo_nbs
    copy_demo_nbs()
  3. Select the cell, and click Run > Run selected cell.

    After a few seconds, the copy_demo_nbs() function copies the demo notebooks that are packaged with the currently installed version of the CodeFlare SDK, and clones them into the demo-notebooks folder.

  4. In the left navigation pane, right-click the new notebook and click Delete.

  5. Click Delete to confirm.

Verification

Locate the downloaded demo notebooks in the JupyterLab interface, as follows:

  1. In the left navigation pane, double-click demo-notebooks.

  2. Double-click additional-demos and verify that the folder contains several demo notebooks.

  3. Click demo-notebooks.

  4. Double-click guided-demos and verify that the folder contains several demo notebooks.

You can run these demo notebooks as described in Running the demo notebooks from the CodeFlare SDK.

Running the demo notebooks from the CodeFlare SDK

To run the demo notebooks from the CodeFlare SDK, you must provide environment-specific information.

In the examples in this procedure, you edit the demo notebooks in JupyterLab to provide the required information, and then run the notebooks.

Prerequisites
  • You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.

  • You can access the following software from your data science cluster:

    • A Ray cluster image that is compatible with your hardware architecture

    • The data sets and models to be used by the workload

    • The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server

  • You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.

  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.

    • If you did not create the project, your cluster administrator must give you Admin access.

  • You have logged in to Open Data Hub, started your workbench, and logged in to JupyterLab.

  • You have downloaded the demo notebooks provided by the CodeFlare SDK, as described in Downloading the demo notebooks from the CodeFlare SDK.

Procedure
  1. Check whether your cluster administrator has defined a default local queue for the Ray cluster.

    You can use the codeflare_sdk.list_local_queues() function to view all local queues in your current namespace, and the resource flavors associated with each local queue.

    Alternatively, you can use the OpenShift web console as follows:

    1. In the OpenShift web console, select your project from the Project list.

    2. Click Search, and from the Resources list, select LocalQueue to show the list of local queues for your project.

      If no local queue is listed, contact your cluster administrator.

    3. Review the details of each local queue:

      1. Click the local queue name.

      2. Click the YAML tab, and review the metadata.annotations section.

        If the kueue.x-k8s.io/default-queue annotation is set to 'true', the queue is configured as the default local queue.

        Note

        If your cluster administrator does not define a default local queue, you must specify a local queue in each notebook.

  2. In the JupyterLab interface, open the demo-notebooks > guided-demos folder.

  3. Open all of the notebooks by double-clicking each notebook file.

    Notebook files have the .ipynb file name extension.

  4. In each notebook, ensure that the import section imports the required components from the CodeFlare SDK, as follows:

    Example import section
    from codeflare_sdk import Cluster, ClusterConfiguration, TokenAuthentication
  5. In each notebook, update the TokenAuthentication section to provide the token and server details to authenticate to the OpenShift cluster by using the CodeFlare SDK.

    You can find your token and server details as follows:

    1. In the Open Data Hub top navigation bar, click the application launcher icon (The application launcher) and then click OpenShift Console to open the OpenShift web console.

    2. In the upper-right corner of the OpenShift web console, click your user name and select Copy login command.

    3. After you have logged in, click Display Token.

    4. In the Log in with this token section, find the required values as follows:

      • The token value is the text after the --token= prefix.

      • The server value is the text after the --server= prefix.

    Note

    The token and server values are security credentials, treat them with care.

    • Do not save the token and server details in a notebook.

    • Do not store the token and server details in Git.

    The token expires after 24 hours.

  6. Optional: If you want to use custom certificates, update the TokenAuthentication section to add the ca_cert_path parameter to specify the location of the custom certificates, as shown in the following example:

    Example authentication section
    auth = TokenAuthentication(
        token = "XXXXX",
        server = "XXXXX",
        skip_tls=False,
        ca_cert_path="/path/to/cert"
    )
    auth.login()

    Alternatively, you can set the CF_SDK_CA_CERT_PATH environment variable to specify the location of the custom certificates.

  7. In each notebook, update the cluster configuration section as follows:

    1. If the namespace value is specified, replace the example value with the name of your project.

      If you omit this line, the Ray cluster is created in the current project.

    2. If the image value is specified, replace the example value with a link to a suitable Ray cluster image. The Python version in the Ray cluster image must be the same as the Python version in the workbench.

      If you omit this line, one of the following Ray cluster images is used by default, based on the Python version detected in the workbench:

      • Python 3.9: quay.io/modh/ray:2.35.0-py39-cu121

      • Python 3.11: quay.io/modh/ray:2.35.0-py311-cu121

      The default Ray images are compatible with NVIDIA GPUs that are supported by CUDA 12.1. The default images are AMD64 images, which might not work on other architectures.

      Additional ROCm-compatible Ray cluster images are available. These images are compatible with AMD accelerators that are supported by ROCm 6.1. These images are AMD64 images, which might not work on other architectures.

    3. If your cluster administrator has not configured a default local queue, specify the local queue for the Ray cluster, as shown in the following example:

      Example local queue assignment
      local_queue="your_local_queue_name"
    4. Optional: Assign a dictionary of labels parameters to the Ray cluster for identification and management purposes, as shown in the following example:

      Example labels assignment
      labels = {"exampleLabel1": "exampleLabel1Value", "exampleLabel2": "exampleLabel2Value"}
  8. In the 2_basic_interactive.ipynb notebook, ensure that the following Ray cluster authentication code is included after the Ray cluster creation section:

    Ray cluster authentication code
    from codeflare_sdk import generate_cert
    generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace)
    generate_cert.export_env(cluster.config.name, cluster.config.namespace)
    Note

    Mutual Transport Layer Security (mTLS) is enabled by default in the CodeFlare component in Open Data Hub. You must include the Ray cluster authentication code to enable the Ray client that runs within a notebook to connect to a secure Ray cluster that has mTLS enabled.

  9. Run the notebooks in the order indicated by the file-name prefix (0_, 1_, and so on).

    1. In each notebook, run each cell in turn, and review the cell output.

    2. If an error is shown, review the output to find information about the problem and the required corrective action. For example, replace any deprecated parameters as instructed. See also Troubleshooting common problems with distributed workloads for users.

    3. For more information about the interactive browser controls that you can use to simplify Ray cluster tasks when working within a Jupyter notebook, see Managing Ray clusters from within a Jupyter notebook.

Verification
  1. The notebooks run to completion without errors.

  2. In the notebooks, the output from the cluster.status() function or cluster.details() function indicates that the Ray cluster is Active.

Managing Ray clusters from within a Jupyter notebook

You can use interactive browser controls to simplify Ray cluster tasks when working within a Jupyter notebook.

The interactive browser controls provide an alternative to the equivalent commands, but do not replace them. You can continue to manage the Ray clusters by running commands within the notebook, for ease of use in scripts and pipelines.

Several different interactive browser controls are available:

  • When you run a cell that provides the cluster configuration, the notebook automatically shows the controls for starting or deleting the cluster.

  • You can run the view_clusters() command to add controls that provide the following functionality:

    • View a list of the Ray clusters that you can access.

    • View cluster information, such as cluster status and allocated resources, for the selected Ray cluster. You can view this information from within the notebook, without switching to the OpenShift Container Platform console or the Ray dashboard.

    • Open the Ray dashboard directly from the notebook, to view the submitted jobs.

    • Refresh the Ray cluster list and the cluster information for the selected cluster.

    You can add these controls to existing notebooks, or manage the Ray clusters from a separate notebook.

The 3_widget_example.ipynb demo notebook shows all of the available interactive browser controls. In the example in this procedure, you create a new notebook to manage the Ray clusters, similar to the example provided in the 3_widget_example.ipynb demo notebook.

Prerequisites
  • You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.

  • You can access the following software from your data science cluster:

    • A Ray cluster image that is compatible with your hardware architecture

    • The data sets and models to be used by the workload

    • The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server

  • You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.

  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.

    • If you did not create the project, your cluster administrator must give you Admin access.

  • You have logged in to Open Data Hub, started your workbench, and logged in to JupyterLab.

  • You have downloaded the demo notebooks provided by the CodeFlare SDK, as described in Downloading the demo notebooks from the CodeFlare SDK.

Procedure
  1. Run all of the demo notebooks in the order indicated by the file-name prefix (0_, 1_, and so on), as described in Running the demo notebooks from the CodeFlare SDK.

  2. In each demo notebook, when you run the cluster configuration step, the following interactive controls are automatically shown in the notebook:

    • Cluster Up: You can click this button to start the Ray cluster. This button is equivalent to the cluster.up() command. When you click this button, a message indicates whether the cluster was successfully created.

    • Cluster Down: You can click this button to delete the Ray cluster. This button is equivalent to the cluster.down() command. The cluster is deleted immediately; you are not prompted to confirm the deletion. When you click this button, a message indicates whether the cluster was successfully deleted.

    • Wait for Cluster: You can select this option to specify that the notebook should wait for the Ray cluster dashboard to be ready before proceeding to the next step. This option is equivalent to the cluster.wait_ready() command.

  3. In the JupyterLab interface, create a new notebook to manage the Ray clusters, as follows:

    1. Click File > New > Notebook. Specify your preferred Python version, and then click Select.

      A new notebook is created in an .ipynb file.

    2. Add the following code to a cell in the new notebook:

      Code to import the required packages
      from codeflare_sdk import TokenAuthentication, view_clusters

      The view_clusters package provides the interactive browser controls for listing the clusters, showing the cluster details, opening the Ray dashboard, and refreshing the cluster data.

    3. Add a new cell to the notebook, and add the following code to the new cell:

      Code to authenticate
      auth = TokenAuthentication(
          token = "XXXXX",
          server = "XXXXX",
          skip_tls=False
      )
      auth.login()

      For information about how to find the token and server values, see Running the demo notebooks from the CodeFlare SDK.

    4. Add a new cell to the notebook, and add the following code to the new cell:

      Code to view clusters in the current project
      view_clusters()

      When you run the view_clusters() command with no arguments specified, you generate a list of all of the Ray clusters in the current project, and display information similar to the cluster.details() function.

      If you have access to another project, you can list the Ray clusters in that project by specifying the project name as shown in the following example:

      Code to view clusters in another project
      view_clusters("my_second_project")
    5. Click File > Save Notebook As, enter demo-notebooks/guided-demos/manage_ray_clusters.ipynb, and click Save.

  4. In the demo-notebooks/guided-demos/manage_ray_clusters.ipynb notebook, select each cell in turn, and click Run > Run selected cell.

  5. When you run the cell with the view_clusters() function, the output depends on whether any Ray clusters exist.

    If no Ray clusters exist, the following text is shown, where _[project-name]_ is the name of the target project:

    No clusters found in the [project-name] namespace.

    Otherwise, the notebook shows the following information about the existing Ray clusters:

    • Select an existing cluster

      Under this heading, a toggle button is shown for each existing cluster. Click a cluster name to select the cluster. The cluster details section is updated to show details about the selected cluster; for example, cluster name, Open Data Hub project name, cluster resource information, and cluster status.

    • Delete cluster

      Click this button to delete the selected cluster. This button is equivalent to the Cluster Down button. The cluster is deleted immediately; you are not prompted to confirm the deletion. A message indicates whether the cluster was successfully deleted, and the corresponding button is no longer shown under the Select an existing cluster heading.

    • View Jobs

      Click this button to open the Jobs tab in the Ray dashboard for the selected cluster, and view details of the submitted jobs. The corresponding URL is shown in the notebook.

    • Open Ray Dashboard

      Click this button to open the Overview tab in the Ray dashboard for the selected cluster. The corresponding URL is shown in the notebook.

    • Refresh Data

      Click this button to refresh the list of Ray clusters, and the cluster details for the selected cluster, on demand. The cluster details are automatically refreshed when you select a cluster and when you delete the selected cluster.

Verification
  1. The demo notebooks run to completion without errors.

  2. In the manage_ray_clusters.ipynb notebook, the output from the view_clusters() function is correct.

Running distributed data science workloads from data science pipelines

To run a distributed workload from a pipeline, you must first update the pipeline to include a link to your Ray cluster image.

Prerequisites
  • You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.

  • You can access the following software from your data science cluster:

    • A Ray cluster image that is compatible with your hardware architecture

    • The data sets and models to be used by the workload

    • The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server

  • You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.

  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.

    • If you did not create the project, your cluster administrator must give you Admin access.

  • You have access to S3-compatible object storage.

  • You have logged in to Open Data Hub.

Procedure
  1. Create a connection to connect the object storage to your data science project, as described in Adding a connection to your data science project.

  2. Configure a pipeline server to use the connection, as described in Configuring a pipeline server.

  3. Create the data science pipeline as follows:

    1. Install the kfp Python package, which is required for all pipelines:

      $ pip install kfp
    2. Install any other dependencies that are required for your pipeline.

    3. Build your data science pipeline in Python code.

      For example, create a file named compile_example.py with the following content.

      Note

      If you copy and paste the following code example, remember to remove the callouts, which are not part of the code. The callouts (parenthetical numbers, highlighted in bold font in this document) map the relevant line of code to an explanatory note in the text immediately after the code example.

      from kfp import dsl
      
      
      @dsl.component(
          base_image="registry.redhat.io/ubi8/python-39:latest",
          packages_to_install=['codeflare-sdk']
      )
      
      
      def ray_fn():
         import ray (1)
         from codeflare_sdk import Cluster, ClusterConfiguration, generate_cert (2)
      
      
         cluster = Cluster( (3)
             ClusterConfiguration(
                 namespace="my_project", (4)
                 name="raytest",
                 num_workers=1,
                 head_cpus="500m",
                 min_memory=1,
                 max_memory=1,
                 worker_extended_resource_requests={“nvidia.com/gpu”: 1}, (5)
                 image="quay.io/modh/ray:2.35.0-py39-cu121", (6)
                 local_queue="local_queue_name", (7)
             )
         )
      
      
         print(cluster.status())
         cluster.up() (8)
         cluster.wait_ready() (9)
         print(cluster.status())
         print(cluster.details())
      
      
         ray_dashboard_uri = cluster.cluster_dashboard_uri()
         ray_cluster_uri = cluster.cluster_uri()
         print(ray_dashboard_uri, ray_cluster_uri)
      
         # Enable Ray client to connect to secure Ray cluster that has mTLS enabled
         generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace) (10)
         generate_cert.export_env(cluster.config.name, cluster.config.namespace)
      
      
         ray.init(address=ray_cluster_uri)
         print("Ray cluster is up and running: ", ray.is_initialized())
      
      
         @ray.remote
         def train_fn(): (11)
             # complex training function
             return 100
      
      
         result = ray.get(train_fn.remote())
         assert 100 == result
         ray.shutdown()
         cluster.down() (12)
         auth.logout()
         return result
      
      
      @dsl.pipeline( (13)
         name="Ray Simple Example",
         description="Ray Simple Example",
      )
      
      
      def ray_integration():
         ray_fn()
      
      
      if __name__ == '__main__': (14)
          from kfp.compiler import Compiler
          Compiler().compile(ray_integration, 'compiled-example.yaml')
      1. Imports Ray.

      2. Imports packages from the CodeFlare SDK to define the cluster functions.

      3. Specifies the Ray cluster configuration: replace these example values with the values for your Ray cluster.

      4. Optional: Specifies the project where the Ray cluster is created. Replace the example value with the name of your project. If you omit this line, the Ray cluster is created in the current project.

      5. Optional: Specifies the requested accelerators for the Ray cluster (in this example, 1 NVIDIA GPU). If no accelerators are required, set the value to 0 or omit the line. Note: To specify the requested accelerators for the Ray cluster, use the worker_extended_resource_requests parameter instead of the deprecated num_gpus parameter. For more details, see the CodeFlare SDK documentation.

      6. Specifies the location of the Ray cluster image. If you omit this line, one of the default CUDA-compatible Ray cluster images is used, based on the Python version detected in the workbench. The default Ray images are AMD64 images, which might not work on other architectures. If you are running this code in a disconnected environment, replace the default value with the location for your environment.

      7. Specifies the local queue to which the Ray cluster will be submitted. If a default local queue is configured, you can omit this line.

      8. Creates a Ray cluster by using the specified image and configuration.

      9. Waits until the Ray cluster is ready before proceeding.

      10. Enables the Ray client to connect to a secure Ray cluster that has mutual Transport Layer Security (mTLS) enabled. mTLS is enabled by default in the CodeFlare component in Open Data Hub.

      11. Replace the example details in this section with the details for your workload.

      12. Removes the Ray cluster when your workload is finished.

      13. Replace the example name and description with the values for your workload.

      14. Compiles the Python code and saves the output in a YAML file.

    4. Compile the Python file (in this example, the compile_example.py file):

      $ python compile_example.py

      This command creates a YAML file (in this example, compiled-example.yaml), which you can import in the next step.

  4. Import your data science pipeline, as described in Importing a data science pipeline.

  5. Schedule the pipeline run, as described in Scheduling a pipeline run.

  6. When the pipeline run is complete, confirm that it is included in the list of triggered pipeline runs, as described in Viewing the details of a pipeline run.

Verification

The YAML file is created and the pipeline run completes without errors.

You can view the run details, as described in Viewing the details of a pipeline run.

Monitoring distributed workloads

In Open Data Hub, you can view project metrics for distributed workloads, and view the status of all distributed workloads in the selected project. You can use these metrics to monitor the resources used by distributed workloads, assess whether project resources are allocated correctly, track the progress of distributed workloads, and identify corrective action when necessary.

Note

Data science pipelines workloads are not managed by the distributed workloads feature, and are not included in the distributed workloads metrics.

Viewing project metrics for distributed workloads

In Open Data Hub, you can view the following project metrics for distributed workloads:

  • CPU - The number of CPU cores that are currently being used by all distributed workloads in the selected project.

  • Memory - The amount of memory in gibibytes (GiB) that is currently being used by all distributed workloads in the selected project.

You can use these metrics to monitor the resources used by the distributed workloads, and assess whether project resources are allocated correctly.

Prerequisites
  • You have installed Open Data Hub.

  • On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.

  • You have logged in to Open Data Hub.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

  • Your data science project contains distributed workloads.

Procedure
  1. In the Open Data Hub left navigation pane, click Distributed Workloads Metrics.

  2. From the Project list, select the project that contains the distributed workloads that you want to monitor.

  3. Click the Project metrics tab.

  4. Optional: From the Refresh interval list, select a value to specify how frequently the graphs on the metrics page are refreshed to show the latest data.

    You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, or 1 day.

  5. In the Requested resources section, review the CPU and Memory graphs to identify the resources requested by distributed workloads as follows:

    • Requested by the selected project

    • Requested by all projects, including the selected project and projects that you cannot access

    • Total shared quota for all projects, as provided by the cluster queue

    For each resource type (CPU and Memory), subtract the Requested by all projects value from the Total shared quota value to calculate how much of that resource quota has not been requested and is available for all projects.

  6. Scroll down to the Top resource-consuming distributed workloads section to review the following graphs:

    • Top 5 distributed workloads that are consuming the most CPU resources

    • Top 5 distributed workloads that are consuming the most memory

    You can also identify how much CPU or memory is used in each case.

  7. Scroll down to view the Distributed workload resource metrics table, which lists all of the distributed workloads in the selected project, and indicates the current resource usage and the status of each distributed workload.

    In each table entry, progress bars indicate how much of the requested CPU and memory is currently being used by this distributed workload. To see numeric values for the actual usage and requested usage for CPU (measured in cores) and memory (measured in GiB), hover the cursor over each progress bar. Compare the actual usage with the requested usage to assess the distributed workload configuration. If necessary, reconfigure the distributed workload to reduce or increase the requested resources.

Verification

On the Project metrics tab, the graphs and table provide resource-usage data for the distributed workloads in the selected project.

Viewing the status of distributed workloads

In Open Data Hub, you can view the status of all distributed workloads in the selected project. You can track the progress of the distributed workloads, and identify corrective action when necessary.

Prerequisites
  • You have installed Open Data Hub.

  • On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.

  • You have logged in to Open Data Hub.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

  • Your data science project contains distributed workloads.

Procedure
  1. In the Open Data Hub left navigation pane, click Distributed Workloads Metrics.

  2. From the Project list, select the project that contains the distributed workloads that you want to monitor.

  3. Click the Distributed workload status tab.

  4. Optional: From the Refresh interval list, select a value to specify how frequently the graphs on the metrics page are refreshed to show the latest data.

    You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, or 1 day.

  5. In the Status overview section, review a summary of the status of all distributed workloads in the selected project.

    The status can be Pending, Inadmissible, Admitted, Running, Evicted, Succeeded, or Failed.

  6. Scroll down to view the Distributed workloads table, which lists all of the distributed workloads in the selected project. The table provides the priority, status, creation date, and latest message for each distributed workload.

    The latest message provides more information about the current status of the distributed workload. Review the latest message to identify any corrective action needed. For example, a distributed workload might be Inadmissible because the requested resources exceed the available resources. In such cases, you can either reconfigure the distributed workload to reduce the requested resources, or reconfigure the cluster queue for the project to increase the resource quota.

Verification

On the Distributed workload status tab, the graph provides a summarized view of the status of all distributed workloads in the selected project, and the table provides more details about the status of each distributed workload.

Viewing Kueue alerts for distributed workloads

In Open Data Hub, you can view Kueue alerts for your cluster. Each alert provides a link to a runbook. The runbook provides instructions on how to resolve the situation that triggered the alert.

Prerequisites
  • You have logged in to OpenShift Container Platform with the cluster-admin role.

  • You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.

  • You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.

  • You have logged in to Open Data Hub.

  • Your data science project contains distributed workloads.

Procedure
  1. In the OpenShift Container Platform console, in the Administrator perspective, click ObserveAlerting.

  2. Click the Alerting rules tab to view a list of alerting rules for default and user-defined projects.

    • The Severity column indicates whether the alert is informational, a warning, or critical.

    • The Alert state column indicates whether a rule is currently firing.

  3. Click the name of an alerting rule to see more details, such as the condition that triggers the alert. The following table summarizes the alerting rules for Kueue resources.

    Table 2. Alerting rules for Kueue resources
    Severity Name Alert condition

    Critical

    KueuePodDown

    The Kueue pod is not ready for a period of 5 minutes.

    Info

    LowClusterQueueResourceUsage

    Resource usage in the cluster queue is below 20% of its nominal quota for more than 1 day. Resource usage refers to any resources listed in the cluster queue, such as CPU, memory, and so on.

    Info

    ResourceReservationExceedsQuota

    Resource reservation is 10 times the available quota in the cluster queue. Resource reservation refers to any resources listed in the cluster queue, such as CPU, memory, and so on.

    Info

    PendingWorkloadPods

    A pod has been in a Pending state for more than 3 days.

  4. If the Alert state of an alerting rule is set to Firing, complete the following steps:

    1. Click ObserveAlerting and then click the Alerts tab.

    2. Click each alert for the firing rule, to see more details. Note that a separate alert is fired for each resource type affected by the alerting rule.

    3. On the alert details page, in the Runbook section, click the link to open a GitHub page that provides troubleshooting information.

    4. Complete the runbook steps to identify the cause of the alert and resolve the situation.

Verification

After you resolve the cause of the alert, the alerting rule stops firing.

Tuning a model by using the Training Operator

To tune a model by using the Kubeflow Training Operator, you configure and run a training job.

Optionally, you can use Low-Rank Adaptation (LoRA) to efficiently fine-tune large language models, such as Llama 3. The integration optimizes computational requirements and reduces memory footprint, allowing fine-tuning on consumer-grade GPUs. The solution combines PyTorch Fully Sharded Data Parallel (FSDP) and LoRA to enable scalable, cost-effective model training and inference, enhancing the flexibility and performance of AI workloads within OpenShift environments.

Configuring the training job

Before you can use a training job to tune a model, you must configure the training job. The example training job in this section is based on the IBM and Hugging Face tuning example provided in GitHub.

Prerequisites
  • You have logged in to OpenShift Container Platform.

  • You have access to a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.

  • You have created a data science project. For information about how to create a project, see Creating a data science project.

  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.

    • If you did not create the project, your cluster administrator must give you Admin access.

  • You have access to a model.

  • You have access to data that you can use to train the model.

Procedure
  1. In a terminal window, if you are not already logged in to your OpenShift cluster, log in to the OpenShift CLI as shown in the following example:

    $ oc login <openshift_cluster_url> -u <username> -p <password>
  2. Configure a training job, as follows:

    1. Create a YAML file named config_trainingjob.yaml.

    2. Add the ConfigMap object definition as follows:

      Example training-job configuration
      kind: ConfigMap
      apiVersion: v1
      metadata:
        name: training-config
        namespace: kfto
      data:
        config.json: |
          {
            "model_name_or_path": "bigscience/bloom-560m",
            "training_data_path": "/data/input/twitter_complaints.json",
            "output_dir": "/data/output/tuning/bloom-twitter",
            "save_model_dir": "/mnt/output/model",
            "num_train_epochs": 10.0,
            "per_device_train_batch_size": 4,
            "per_device_eval_batch_size": 4,
            "gradient_accumulation_steps": 4,
            "save_strategy": "no",
            "learning_rate": 1e-05,
            "weight_decay": 0.0,
            "lr_scheduler_type": "cosine",
            "include_tokens_per_second": true,
            "response_template": "\n### Label:",
            "dataset_text_field": "output",
            "padding_free": ["huggingface"],
            "multipack": [16],
            "use_flash_attn": false
          }
    3. Optional: To fine-tune with Low Rank Adaptation (LoRA), update the config.json section as follows:

      1. Set the peft_method parameter to "lora".

      2. Add the lora_r, lora_alpha, lora_dropout, bias, and target_modules parameters.

        Example LoRA configuration
              ...
              "peft_method": "lora",
              "lora_r": 8,
              "lora_alpha": 8,
              "lora_dropout": 0.1,
              "bias": "none",
              "target_modules": ["all-linear"]
            }
    4. Optional: To fine-tune with Quantized Low Rank Adaptation (QLoRA), update the config.json section as follows:

      1. Set the use_flash_attn parameter to "true".

      2. Set the peft_method parameter to "lora".

      3. Add the LoRA parameters: lora_r, lora_alpha, lora_dropout, bias, and target_modules.

      4. Add the QLoRA mandatory parameters: auto_gptq, torch_dtype, and fp16.

      5. If required, add the QLoRA optional parameters: fused_lora and fast_kernels.

        Example QLoRA configuration
              ...
              "use_flash_attn": true,
              "peft_method": "lora",
              "lora_r": 8,
              "lora_alpha": 8,
              "lora_dropout": 0.1,
              "bias": "none",
              "target_modules": ["all-linear"],
              "auto_gptq": ["triton_v2"],
              "torch_dtype": float16,
              "fp16": true,
              "fused_lora": ["auto_gptq", true],
              "fast_kernels": [true, true, true]
            }
    5. Edit the metadata of the training-job configuration as shown in the following table.

      Table 3. Training-job configuration metadata
      Parameter Value

      name

      Name of the training-job configuration

      namespace

      Name of your project

    6. Edit the parameters of the training-job configuration as shown in the following table.

      Table 4. Training-job configuration parameters
      Parameter Value

      model_name_or_path

      Name of the pre-trained model or the path to the model in the training-job container; in this example, the model name is taken from the Hugging Face web page

      training_data_path

      Path to the training data that you set in the training_data.yaml ConfigMap

      output_dir

      Output directory for the model

      save_model_dir

      Directory where the tuned model is saved

      num_train_epochs

      Number of epochs for training; in this example, the training job is set to run 10 times

      per_device_train_batch_size

      Batch size, the number of data set examples to process together; in this example, the training job processes 4 examples at a time

      per_device_eval_batch_size

      Batch size, the number of data set examples to process together per GPU or TPU core or CPU; in this example, the training job processes 4 examples at a time

      gradient_accumulation_steps

      Number of gradient accumulation steps

      save_strategy

      How often model checkpoints can be saved; the default value is "epoch" (save model checkpoint every epoch), other possible values are "steps" (save model checkpoint for every training step) and "no" (do not save model checkpoints)

      save_total_limit

      Number of model checkpoints to save; omit if save_strategy is set to "no" (no model checkpoints saved)

      learning_rate

      Learning rate for the training

      weight_decay

      Weight decay to apply

      lr_scheduler_type

      Optional: Scheduler type to use; the default value is "linear", other possible values are "cosine", "cosine_with_restarts", "polynomial", "constant", and "constant_with_warmup"

      include_tokens_per_second

      Optional: Whether or not to compute the number of tokens per second per device for training speed metrics

      response_template

      Template formatting for the response

      dataset_text_field

      Dataset field for training output, as set in the training_data.yaml config map

      padding_free

      Whether to use a technique to process multiple examples in a single batch without adding padding tokens that waste compute resources; if used, this parameter must be set to ["huggingface"]

      multipack

      Whether to use a technique for multi-GPU training to balance the number of tokens processed in each device, to minimize waiting time; you can experiment with different values to find the optimum value for your training job

      use_flash_attn

      Whether to use flash attention

      peft_method

      Tuning method: for full fine-tuning, omit this parameter; for LoRA and QLoRA, set to "lora"; for prompt tuning, set to "pt"

      lora_r

      LoRA: Rank of the low-rank decomposition

      lora_alpha

      LoRA: Scale the low-rank matrices to control their influence on the model’s adaptations

      lora_dropout

      LoRA: Dropout rate applied to the LoRA layers, a regularization technique to prevent overfitting

      bias

      LoRA: Whether to adapt bias terms in the model; setting the bias to "none" indicates that no bias terms will be adapted

      target_modules

      LoRA: Names of the modules to apply LoRA to; to include all linear layers, set to "all_linear"; optional parameter for some models

      auto_gptq

      QLoRA: Sets 4-bit GPTQ-LoRA with AutoGPTQ; when used, this parameter must be set to ["triton_v2"]

      torch_dtype

      QLoRA: Tensor datatype; when used, this parameter must be set to float16

      fp16

      QLoRA: Whether to use half-precision floating-point format; when used, this parameter must be set to true

      fused_lora

      QLoRA: Whether to use fused LoRA for more efficient LoRA training; if used, this parameter must be set to ["auto_gptq", true]

      fast_kernels

      QLoRA: Whether to use fast cross-entropy, rope, rms loss kernels; if used, this parameter must be set to [true, true, true]

    7. Save your changes in the config_trainingjob.yaml file.

    8. Apply the configuration to create the training-config object:

      $ oc apply -f config_trainingjob.yaml
  3. Create the training data.

    Note

    The training data in this simple example is for demonstration purposes only, and is not suitable for production use. The usual method for providing training data is to use persistent volumes.

    1. Create a YAML file named training_data.yaml.

    2. Add the following ConfigMap object definition:

      kind: ConfigMap
      apiVersion: v1
      metadata:
        name: twitter-complaints
        namespace: kfto
      data:
        twitter_complaints.json: |
          [
              {"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"},
              {"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"},
              {"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"},
              {"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"},
              {"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp… https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp… https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"},
              {"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"},
              {"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"},
              {"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year ��","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year ��\n\n### Label: no complaint"}
          ]
    3. Replace the example namespace value kfto with the name of your project.

    4. Replace the example training data with your training data.

    5. Save your changes in the training_data.yaml file.

    6. Apply the configuration to create the training data:

      $ oc apply -f training_data.yaml
  4. Create a persistent volume claim (PVC), as follows:

    1. Create a YAML file named trainedmodelpvc.yaml.

    2. Add the following PersistentVolumeClaim object definition:

      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: trained-model
        namespace: kfto
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
    3. Replace the example namespace value kfto with the name of your project, and update the other parameters to suit your environment. To calculate the storage value, multiply the model size by the number of epochs, and add a little extra as a buffer.

    4. Save your changes in the trainedmodelpvc.yaml file.

    5. Apply the configuration to create a Persistent Volume Claim (PVC) for the training job:

      $ oc apply -f trainedmodelpvc.yaml
Verification
  1. In the OpenShift Container Platform console, select your project from the Project list.

  2. Click ConfigMaps and verify that the training-config and twitter-complaints ConfigMaps are listed.

  3. Click Search. From the Resources list, select PersistentVolumeClaim and verify that the trained-model PVC is listed.

Running the training job

You can run a training job to tune a model. The example training job in this section is based on the IBM and Hugging Face tuning example provided here.

Prerequisites
  • You have access to a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.

  • You have created a data science project. For information about how to create a project, see Creating a data science project.

  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.

    • If you did not create the project, your cluster administrator must give you Admin access.

  • You have access to a model.

  • You have access to data that you can use to train the model.

  • You have configured the training job as described in Configuring the training job.

Procedure
  1. In a terminal window, log in to the OpenShift CLI as shown in the following example:

    $ oc login <openshift_cluster_url> -u <username> -p <password>
  2. Create a PyTorch training job, as follows:

    1. Create a YAML file named pytorchjob.yaml.

    2. Add the following PyTorchJob object definition:

      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      metadata:
        name: kfto-demo
        namespace: kfto
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: Never
            template:
              spec:
                containers:
                  - env:
                      - name: SFT_TRAINER_CONFIG_JSON_PATH
                        value: /etc/config/config.json
                    image: 'quay.io/modh/fms-hf-tuning:release'
                    imagePullPolicy: IfNotPresent
                    name: pytorch
                    volumeMounts:
                      - mountPath: /etc/config
                        name: config-volume
                      - mountPath: /data/input
                        name: dataset-volume
                      - mountPath: /data/output
                        name: model-volume
                volumes:
                  - configMap:
                      items:
                        - key: config.json
                          path: config.json
                      name: training-config
                    name: config-volume
                  - configMap:
                      name: twitter-complaints
                    name: dataset-volume
                  - name: model-volume
                    persistentVolumeClaim:
                      claimName: trained-model
        runPolicy:
          suspend: false
    3. Replace the example namespace value kfto with the name of your project, and update the other parameters to suit your environment.

    4. Edit the parameters of the PyTorch training job, to provide the details for your training job and environment.

    5. Save your changes in the pytorchjob.yaml file.

    6. Apply the configuration to run the PyTorch training job:

      $ oc apply -f pytorchjob.yaml
Verification
  1. In the OpenShift Container Platform console, select your project from the Project list.

  2. Click WorkloadsPods and verify that the <training-job-name>-master-0 pod is listed.

Monitoring the training job

When you run a training job to tune a model, you can monitor the progress of the job. The example training job in this section is based on the IBM and Hugging Face tuning example provided here.

Prerequisites
  • You have access to a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.

  • You have created a data science project. For information about how to create a project, see Creating a data science project.

  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.

    • If you did not create the project, your cluster administrator must give you Admin access.

  • You have access to a model.

  • You have access to data that you can use to train the model.

  • You are running the training job as described in Running the training job.

Procedure
  1. In the OpenShift Container Platform console, select your project from the Project list.

  2. Click WorkloadsPods.

  3. Search for the pod that corresponds to the PyTorch job, that is, <training-job-name>-master-0.

    For example, if the training job name is kfto-demo, the pod name is kfto-demo-master-0.

  4. Click the pod name to open the pod details page.

  5. Click the Logs tab to monitor the progress of the job and view status updates, as shown in the following example output:

    0%| | 0/10 [00:00<?, ?it/s] 10%|█ | 1/10 [01:10<10:32, 70.32s/it] {'loss': 6.9531, 'grad_norm': 1104.0, 'learning_rate': 9e-06, 'epoch': 1.0}
    10%|█ | 1/10 [01:10<10:32, 70.32s/it] 20%|██ | 2/10 [01:40<06:13, 46.71s/it] 30%|███ | 3/10 [02:26<05:25, 46.55s/it] {'loss': 2.4609, 'grad_norm': 736.0, 'learning_rate': 7e-06, 'epoch': 2.0}
    30%|███ | 3/10 [02:26<05:25, 46.55s/it] 40%|████ | 4/10 [03:23<05:02, 50.48s/it] 50%|█████ | 5/10 [03:41<03:13, 38.66s/it] {'loss': 1.7617, 'grad_norm': 328.0, 'learning_rate': 5e-06, 'epoch': 3.0}
    50%|█████ | 5/10 [03:41<03:13, 38.66s/it] 60%|██████ | 6/10 [04:54<03:22, 50.58s/it] {'loss': 3.1797, 'grad_norm': 1016.0, 'learning_rate': 4.000000000000001e-06, 'epoch': 4.0}
    60%|██████ | 6/10 [04:54<03:22, 50.58s/it] 70%|███████ | 7/10 [06:03<02:49, 56.59s/it] {'loss': 2.9297, 'grad_norm': 984.0, 'learning_rate': 3e-06, 'epoch': 5.0}
    70%|███████ | 7/10 [06:03<02:49, 56.59s/it] 80%|████████ | 8/10 [06:38<01:39, 49.57s/it] 90%|█████████ | 9/10 [07:22<00:48, 48.03s/it] {'loss': 1.4219, 'grad_norm': 684.0, 'learning_rate': 1.0000000000000002e-06, 'epoch': 6.0}
    90%|█████████ | 9/10 [07:22<00:48, 48.03s/it]100%|██████████| 10/10 [08:25<00:00, 52.53s/it] {'loss': 1.9609, 'grad_norm': 648.0, 'learning_rate': 0.0, 'epoch': 6.67}
    100%|██████████| 10/10 [08:25<00:00, 52.53s/it] {'train_runtime': 508.0444, 'train_samples_per_second': 0.197, 'train_steps_per_second': 0.02, 'train_loss': 2.63125, 'epoch': 6.67}
    100%|██████████| 10/10 [08:28<00:00, 52.53s/it]100%|██████████| 10/10 [08:28<00:00, 50.80s/it]

    In the example output, the solid blocks indicate progress bars.

Verification
  1. The <training-job-name>-master-0 pod is running.

  2. The Logs tab provides information about the job progress and job status.

Troubleshooting common problems with distributed workloads for users

If you are experiencing errors in Open Data Hub relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.

My Ray cluster is in a suspended state

Problem

The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.

Diagnosis

The Ray cluster head pod or worker pods remain in a suspended state.

Resolution
  1. In the OpenShift Container Platform console, select your project from the Project list.

  2. Check the workload resource:

    1. Click Search, and from the Resources list, select Workload.

    2. Select the workload resource that is created with the Ray cluster resource, and click the YAML tab.

    3. Check the text in the status.conditions.message field, which provides the reason for the suspended state, as shown in the following example:

      status:
       conditions:
         - lastTransitionTime: '2024-05-29T13:05:09Z'
           message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'
  3. Check the Ray cluster resource:

    1. Click Search, and from the Resources list, select RayCluster.

    2. Select the Ray cluster resource, and click the YAML tab.

    3. Check the text in the status.conditions.message field.

  4. Check the cluster queue resource:

    1. Click Search, and from the Resources list, select ClusterQueue.

    2. Check your cluster queue configuration to ensure that the resources that you requested are within the limits defined for the project.

    3. Either reduce your requested resources, or contact your administrator to request more resources.

My Ray cluster is in a failed state

Problem

You might have insufficient resources.

Diagnosis

The Ray cluster head pod or worker pods are not running. When a Ray cluster is created, it initially enters a failed state. This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.

Resolution

If the failed state persists, complete the following steps:

  1. In the OpenShift Container Platform console, select your project from the Project list.

  2. Click Search, and from the Resources list, select Pod.

  3. Click your pod name to open the pod details page.

  4. Click the Events tab, and review the pod events to identify the cause of the problem.

  5. If you cannot resolve the problem, contact your administrator to request assistance.

I see a failed to call webhook error message for the CodeFlare Operator

Problem

After you run the cluster.up() command, the following error is shown:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}
Diagnosis

The CodeFlare Operator pod might not be running.

Resolution

Contact your administrator to request assistance.

I see a failed to call webhook error message for Kueue

Problem

After you run the cluster.up() command, the following error is shown:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}
Diagnosis

The Kueue pod might not be running.

Resolution

Contact your administrator to request assistance.

My Ray cluster doesn’t start

Problem

After you run the cluster.up() command, when you run either the cluster.details() command or the cluster.status() command, the Ray Cluster remains in the Starting status instead of changing to the Ready status. No pods are created.

Diagnosis
  1. In the OpenShift Container Platform console, select your project from the Project list.

  2. Check the workload resource:

    1. Click Search, and from the Resources list, select Workload.

    2. Select the workload resource that is created with the Ray cluster resource, and click the YAML tab.

    3. Check the text in the status.conditions.message field, which provides the reason for remaining in the Starting state.

  3. Check the Ray cluster resource:

    1. Click Search, and from the Resources list, select RayCluster.

    2. Select the Ray cluster resource, and click the YAML tab.

    3. Check the text in the status.conditions.message field.

Resolution

If you cannot resolve the problem, contact your administrator to request assistance.

I see a Default Local Queue …​ not found error message

Problem

After you run the cluster.up() command, the following error is shown:

Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.
Diagnosis

No default local queue is defined, and a local queue is not specified in the cluster configuration.

Resolution
  1. In the OpenShift Container Platform console, select your project from the Project list.

  2. Click Search, and from the Resources list, select LocalQueue.

  3. Resolve the problem in one of the following ways:

    • If a local queue exists, add it to your cluster configuration as follows:

      local_queue="<local_queue_name>"
    • If no local queue exists, contact your administrator to request assistance.

I see a local_queue provided does not exist error message

Problem

After you run the cluster.up() command, the following error is shown:

local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.
Diagnosis

An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.

Resolution
  1. In the OpenShift Container Platform console, select your project from the Project list.

  2. Click Search, and from the Resources list, select LocalQueue.

  3. Resolve the problem in one of the following ways:

    • If a local queue exists, ensure that you spelled the local queue name correctly in your cluster configuration, and that the namespace value in the cluster configuration matches your project name. If you do not specify a namespace value in the cluster configuration, the Ray cluster is created in the current project.

    • If no local queue exists, contact your administrator to request assistance.

I cannot create a Ray cluster or submit jobs

Problem

After you run the cluster.up() command, an error similar to the following error is shown:

RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}
Diagnosis

The correct OpenShift login credentials are not specified in the TokenAuthentication section of your notebook code.

Resolution
  1. Identify the correct OpenShift login credentials as follows:

    1. In the OpenShift Container Platform console header, click your username and click Copy login command.

    2. In the new tab that opens, log in as the user whose credentials you want to use.

    3. Click Display Token.

    4. From the Log in with this token section, copy the token and server values.

  2. In your notebook code, specify the copied token and server values as follows:

    auth = TokenAuthentication(
        token = "<token>",
        server = "<server>",
        skip_tls=False
    )
    auth.login()

My pod provisioned by Kueue is terminated before my image is pulled

Problem

Kueue waits for a period of time before marking a workload as ready, to enable all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.

Diagnosis
  1. In the OpenShift Container Platform console, select your project from the Project list.

  2. Click Search, and from the Resources list, select Pod.

  3. Click the Ray head pod name to open the pod details page.

  4. Click the Events tab, and review the pod events to check whether the image pull completed successfully.

Resolution

If the pod takes more than 5 minutes to pull the image, contact your administrator to request assistance.