Open Data Hub logo

Info alert:Important Notice

Please note that more information about the previous v2 releases can be found here. You can use "Find a release" search bar to search for a particular release.

Managing Open Data Hub

As an OpenShift cluster administrator, you can manage the following Open Data Hub resources:

  • Users and groups

  • The dashboard interface, including the visibility of navigation menu options

  • Applications that show in the dashboard

  • Custom deployment resources that are related to the Open Data Hub Operator, for example, CPU and memory limits and requests

  • Accelerators

  • Distributed workloads

Managing users and groups

Users with administrator access to OpenShift Container Platform can add, modify, and remove user permissions for Open Data Hub.

Overview of user types and permissions

Table 1 describes the Open Data Hub user types.

Table 1. User types
User Type Permissions

Users

Machine learning operations (MLOps) engineers and data scientists can access and use individual components of Open Data Hub, such as workbenches and data science pipelines.

Administrators

In addition to the actions permitted to users, administrators can perform these actions:

  • Configure Open Data Hub settings.

  • Access and manage notebook servers.

  • Access and manage data science pipeline applications for any data science project.

By default, all OpenShift users have access to Open Data Hub. In addition, users in the OpenShift administrator group (cluster admins), automatically have administrator access in Open Data Hub.

Optionally, if you want to restrict access to your Open Data Hub deployment to specific users or groups, you can create user groups for users and administrators.

If you decide to restrict access, and you already have groups defined in your configured identity provider, you can add these groups to your Open Data Hub deployment. If you decide to use groups without adding these groups from an identity provider, you must create the groups in OpenShift Container Platform and then add users to them.

There are some operations relevant to Open Data Hub that require the cluster-admin role. Those operations include:

  • Adding users to the Open Data Hub user and administrator groups, if you are using groups.

  • Removing users from the Open Data Hub user and administrator groups, if you are using groups.

  • Managing custom environment and storage configuration for users in OpenShift Container Platform, such as Jupyter notebook resources, ConfigMaps, and persistent volume claims (PVCs).

Important

Although users of Open Data Hub and its components are authenticated through OpenShift, session management is separate from authentication. This means that logging out of OpenShift Container Platform or Open Data Hub does not affect a logged in Jupyter session running on those platforms. This means that when a user’s permissions change, that user must log out of all current sessions in order for the changes to take effect.

Viewing Open Data Hub users

Additional resources

If you have defined specialized user groups for Open Data Hub, you can view the users that belong to these groups.

Prerequisites
  • The Open Data Hub user group, administrator group, or both exist.

  • You have the cluster-admin role in OpenShift Container Platform.

  • You have configured a supported identity provider for OpenShift Container Platform.

Procedure
  1. In the OpenShift Container Platform web console, click User ManagementGroups.

  2. Click the name of the group containing the users that you want to view.

    • For administrative users, click the name of your administrator group. for example, odh-admins.

    • For normal users, click the name of your user group, for example, odh-users.

      The Group details page for the group appears.

Verification
  • In the Users section for the relevant group, you can view the users who have permission to access Open Data Hub.

Adding users to Open Data Hub user groups

By default, all OpenShift users have access to Open Data Hub.

Optionally, you can restrict user access to your Open Data Hub instance by defining user groups. You must grant users permission to access Open Data Hub by adding user accounts to the Open Data Hub user group, administrator group, or both. You can either use the default group name, or specify a group name that already exists in your identity provider.

The user group provides the user with access to product components in the Open Data Hub dashboard, such as data science pipelines, and associated services, such as Jupyter. By default, users in the user group have access to data science pipeline applications within data science projects that they created.

The administrator group provides the user with access to developer and administrator functions in the Open Data Hub dashboard, such as data science pipelines, and associated services, such as Jupyter. Users in the administrator group can configure data science pipeline applications in the Open Data Hub dashboard for any data science project.

If you restrict access by using user groups, users that are not in the Open Data Hub user group or administrator group cannot view the dashboard and use associated services, such as Jupyter. They are also unable to access the Cluster settings page.

Important

If you are using LDAP as your identity provider, you need to configure LDAP syncing to OpenShift Container Platform. For more information, see Syncing LDAP groups.

Follow the steps in this section to add users to your Open Data Hub administrator and user groups.

Note: You can add users in Open Data Hub but you must manage the user lists in the OpenShift Container Platform web console.

Prerequisites
  • You have configured a supported identity provider for OpenShift Container Platform.

  • You are assigned the cluster-admin role in OpenShift Container Platform.

  • You have defined an administrator group and user group for Open Data Hub.

Procedure
  1. In the OpenShift Container Platform web console, click User ManagementGroups.

  2. Click the name of the group you want to add users to.

    • For administrative users, click the administrator group, for example, {oai-admin-group}.

    • For normal users, click the user group, for example, {oai-user-group}.

      The Group details page for that group appears.

  3. Click ActionsAdd Users.

    The Add Users dialog appears.

  4. In the Users field, enter the relevant user name to add to the group.

  5. Click Save.

Verification
  • Click the Details tab for each group and confirm that the Users section contains the user names that you added.

Selecting Open Data Hub administrator and user groups

By default, all users authenticated in OpenShift can access Open Data Hub.

Also by default, users with cluster-admin permissions are Open Data Hub administrators. A cluster admin is a superuser that can perform any action in any project in the OpenShift cluster. When bound to a user with a local binding, they have full control over quota and every action on every resource in the project.

After a cluster admin user defines additional administrator and user groups in OpenShift, you can add those groups to Open Data Hub by selecting them in the Open Data Hub dashboard.

Prerequisites
  • You have administrator privileges in Open Data Hub and can view the Settings navigation option in the Open Data Hub dashboard.

  • You have logged in to Open Data Hub as described in Logging in to Open Data Hub.

  • The groups that you want to select as administrator and user groups for Open Data Hub already exist in OpenShift Container Platform. For more information, see Managing users and groups.

Procedure
  1. From the Open Data Hub dashboard, click SettingsUser management.

  2. Select your Open Data Hub admin groups: Under Data science administrator groups, click the text box and select an OpenShift group. Repeat this process to define multiple admin groups.

  3. Select your Open Data Hub user groups: Under Data science user groups, click the text box and select an OpenShift group. Repeat this process to define multiple user groups.

    Important
    The system:authenticated setting allows all users authenticated in OpenShift to access Open Data Hub.
  4. Click Save changes.

Verification
  • Administrator users can successfully log in to Open Data Hub and have access to the Settings navigation menu.

  • Non-administrator users can successfully log in to Open Data Hub. They can also access and use individual components, such as projects and workbenches.

Deleting users

About deleting users and their resources

If you have administrator access to OpenShift Container Platform, you can revoke a user’s access to Jupyter and delete the user’s resources from Open Data Hub.

Deleting a user and the user’s resources involves the following tasks:

  • Before you delete a user from Open Data Hub, it is good practice to back up the data on your persistent volume claims (PVCs).

  • Stop notebook servers owned by the user.

  • Revoke user access to Jupyter.

  • Remove the user from the allowed group in your OpenShift identity provider.

  • After you delete a user, delete their associated configuration files from OpenShift Container Platform.

Stopping notebook servers owned by other users

Administrators can stop notebook servers that are owned by other users to reduce resource consumption on the cluster, or as part of removing a user and their resources from the cluster.

Prerequisites
  • If you are using Open Data Hub groups, you are part of the administrator group (for example, {oai-admin-group}). If you are not using groups, you have OpenShift cluster-admin privileges.

  • You have launched the Jupyter application, as described in Starting a Jupyter notebook server.

  • The notebook server that you want to stop is running.

Procedure
  1. On the page that opens when you launch Jupyter, click the Administration tab.

  2. Stop one or more servers.

    • If you want to stop one or more specific servers, perform the following actions:

      1. In the Users section, locate the user that the notebook server belongs to.

      2. To stop the notebook server, perform one of the following actions:

        • Click the action menu () beside the relevant user and select Stop server.

        • Click View server beside the relevant user and then click Stop notebook server.

          The Stop server dialog box appears.

      3. Click Stop server.

    • If you want to stop all servers, perform the following actions:

      1. Click the Stop all servers button.

      2. Click OK to confirm stopping all servers.

Verification
  • The Stop server link beside each server changes to a Start server link when the notebook server has stopped.

Revoking user access to Jupyter

You can revoke a user’s access to Jupyter by removing the user from the specialized user groups that define access to Open Data Hub. When you remove a user from the specialized user groups, the user is prevented from accessing the Open Data Hub dashboard and from using associated services that consume resources in your cluster.

Important
Follow these steps only if you have implemented specialized user groups to restrict access to Open Data Hub. To completely remove a user from Open Data Hub, you must remove them from the allowed group in your OpenShift identity provider.
Prerequisites
  • You have stopped any notebook servers owned by the user you want to delete.

  • You are using specialized user groups for Open Data Hub, and the user is part of the specialized user group, administrator group, or both.

Procedure
  1. In the OpenShift Container Platform web console, click User ManagementGroups.

  2. Click the name of the group that you want to remove the user from.

    • For administrative users, click the name of your administrator group, for example, {oai-admin-group}.

    • For non-administrator users, click the name of your user group, for example, {oai-user-group}.

    The Group details page for the group appears.

  3. In the Users section on the Details tab, locate the user that you want to remove.

  4. Click the action menu () beside the user that you want to remove and click Remove user.

Verification
  • Check the Users section on the Details tab and confirm that the user that you removed is not visible.

  • In the rhods-notebooks project, check under WorkloadsPods and ensure that there is no notebook server pod for this user. If you see a pod named jupyter-nb-<username>-* for the user that you have removed, delete that pod to ensure that the deleted user is not consuming resources on the cluster.

  • In the Open Data Hub dashboard, check the list of data science projects. Delete any projects that belong to the user.

Backing up storage data

It is a best practice to back up the data on your persistent volume claims (PVCs) regularly.

Backing up your data is particularly important before you delete a user and before you uninstall Open Data Hub, as all PVCs are deleted when Open Data Hub is uninstalled.

See the documentation for your cluster platform for more information about backing up your PVCs.

Additional resources

Cleaning up after deleting users

After you remove a user’s access to Open Data Hub or Jupyter, you must also delete the configuration files for the user from OpenShift Container Platform. Red Hat recommends that you back up the user’s data before removing their configuration files.

Prerequisites
  • (Optional) If you want to completely remove the user’s access to Open Data Hub, you have removed their credentials from your identity provider.

  • You have revoked the user’s access to Jupyter.

  • If you are using Open Data Hub groups, you are part of the administrator group (for example, {oai-admin-group}). If you are not using groups, you have OpenShift cluster-admin privileges.

  • You have logged in to the OpenShift Container Platform web console.

  • You have logged in to Open Data Hub.

Procedure
  1. Delete the user’s persistent volume claim (PVC).

    1. Click StoragePersistentVolumeClaims.

    2. If it is not already selected, select the rhods-notebooks project from the project list.

    3. Locate the jupyter-nb-<username> PVC.

      Replace <username> with the relevant user name.

    4. Click the action menu (⋮) and select Delete PersistentVolumeClaim from the list.

      The Delete PersistentVolumeClaim dialog appears.

    5. Inspect the dialog and confirm that you are deleting the correct PVC.

    6. Click Delete.

  2. Delete the user’s ConfigMap.

    1. Click WorkloadsConfigMaps.

    2. If it is not already selected, select the rhods-notebooks project from the project list.

    3. Locate the jupyterhub-singleuser-profile-<username> ConfigMap.

      Replace <username> with the relevant user name.

    4. Click the action menu (⋮) and select Delete ConfigMap from the list.

      The Delete ConfigMap dialog appears.

    5. Inspect the dialog and confirm that you are deleting the correct ConfigMap.

    6. Click Delete.

Verification
  • The user cannot access Jupyter any more, and sees an "Access permission needed" message if they try.

  • The user’s single-user profile, persistent volume claim (PVC), and ConfigMap are not visible in OpenShift Container Platform.

Customizing the dashboard

The Open Data Hub dashboard provides features that are designed to work for most scenarios. These features are configured in the OdhDashboardConfig custom resource (CR) file.

To see a description of the options in the Open Data Hub dashboard configuration file, see Dashboard configuration options.

As an administrator, you can customize the interface of the dashboard, for example to show or hide some of the dashboard navigation menu options. To change the default settings of the dashboard, edit the OdhDashboardConfig custom resource (CR) file as described in Editing the dashboard configuration file.

Editing the dashboard configuration file

As an administrator, you can customize the interface of the dashboard by editing the dashboard configuration file.

Prerequisites
  • You have cluster administrator privileges for your OpenShift Container Platform cluster.

Procedure
  1. Log in to the OpenShift Container Platform console as a cluster administrator.

  2. In the Administrator perspective, click HomeAPI Explorer.

  3. In the search bar, enter OdhDashboardConfig to filter by kind.

  4. Click the OdhDashboardConfig custom resource (CR) to open the resource details page.

  5. Select the redhat-ods-applications project from the Project list.

  6. Click the Instances tab.

  7. Click the odh-dashboard-config instance to open the details page.

  8. Click the YAML tab.

  9. Edit the values of the options that you want to change.

  10. Click Save to apply your changes and then click Reload to synchronize your changes to the cluster.

Verification

Log in to Open Data Hub and verify that your dashboard configurations apply.

Dashboard configuration options

The Open Data Hub dashboard includes a set of core features enabled by default that are designed to work for most scenarios. Administrators can configure the Open Data Hub dashboard from the OdhDashboardConfig custom resource (CR) in OpenShift Container Platform.

Table 2. Dashboard feature configuration options

Feature

Default

Description

dashboardConfig: disableAcceleratorProfiles

false

Shows the Settings → Accelerator profiles option in the dashboard navigation menu. To hide this menu option, set the value to true.

dashboardConfig: disableBiasMetrics

false

Shows the Model Bias tab on the Model Serving page. To hide this tab, set the value to true.

dashboardConfig: disableBYONImageStream

false

Shows the Settings → Notebook images option in the dashboard navigation menu. To hide this menu option, set the value to true.

dashboardConfig: disableClusterManager

false

Shows the Settings → Cluster settings option in the dashboard navigation menu. To hide this menu option, set the value to true.

dashboardConfig: disableConnectionTypes

true

Note: Do not edit the default value; this feature is not available in this version of Open Data Hub.

dashboardConfig: disableCustomServingRuntimes

false

Shows the Settings → Serving runtimes option in the dashboard navigation menu. To hide this menu option, set the value to true.

dashboardConfig: disableDistributedWorkloads

false

Shows the Distributed Workload Metrics option in the dashboard navigation menu. To hide this menu option, set the value to true.

dashboardConfig: disableHome

false

Shows the Home option in the dashboard navigation menu. To hide this menu option, set the value to true.

dashboardConfig: disableInfo

false

On the Applications → Explore page, when a user clicks on an application tile, an information panel opens with more details about the application. To disable the information panel for all applications on the Applications → Explore page , set the value to true.

dashboardConfig: disableISVBadges

false

Shows the label on a tile that indicates whether the application is “Red Hat managed”, “Partner managed”, or “Self-managed”. To hide these labels, set the value to true.

dashboardConfig: disableKServe

false

Enables the ability to select KServe as a model-serving platform. To disable this ability, set the value to true.

dashboardConfig: disableKServeAuth

false

Enables the ability to use authentication with KServe. To disable this ability, set the value to true.

dashboardConfig: disableKServeMetrics

false

Enables the ability to view KServe metrics. To disable this ability, set the value to true.

dashboardConfig: disableModelMesh

false

Enables the ability to select ModelMesh as a model-serving platform. To disable this ability, set the value to true.

dashboardConfig: disableModelRegistry

false

Shows the Model Registry option and the Settings → Model registry settings option in the dashboard navigation menu. To hide these menu options, set the value to true.

dashboardConfig: disableModelServing

false

Shows the Model Serving option in the dashboard navigation menu and in the list of components for the data science projects. To hide Model Serving from the dashboard navigation menu and from the list of components for data science projects, set the value to true.

dashboardConfig: disableNIMModelServing

true

Disables the ability to select NVIDIA NIM as a model-serving platform. To enable this ability, set the value to false.

dashboardConfig: disablePerformanceMetrics

false

Shows the Endpoint Performance tab on the Model Serving page. To hide this tab, set the value to true.

dashboardConfig: disablePipelines

false

Shows the Data Science Pipelines option in the dashboard navigation menu. To hide this menu option, set the value to true.

dashboardConfig: disableProjects

false

Shows the Data Science Projects option in the dashboard navigation menu. To hide this menu option, set the value to true.

dashboardConfig: disableProjectSharing

false

Allows users to share access to their data science projects with other users. To prevent users from sharing data science projects, set the value to true.

dashboardConfig: disableStorageClasses

false

Shows the Settings → Storage classes option in the dashboard navigation menu. To hide this menu option, set the value to true.

dashboardConfig: disableSupport

false

Shows the Support menu option when a user clicks the Help icon in the dashboard toolbar. To hide this menu option, set the value to true.

dashboardConfig: disableTracking

true

Allows Red Hat to collect data about Open Data Hub usage in your cluster. To enable data collection, set the value to false. You can also set this option in the Open Data Hub dashboard interface from the Settings → Cluster settings navigation menu.

dashboardConfig: disableUserManagement

false

Shows the Settings → User management option in the dashboard navigation menu. To hide this menu option, set the value to true.

dashboardConfig: enablement

true

Enables admin users to add applications to the Open Data Hub dashboard Applications → Enabled page. To disable this ability, set the value to false.

notebookController: enabled

true

Controls the Notebook Controller options, such as whether it is enabled in the dashboard and which parts are visible.

notebookSizes

Allows you to customize names and resources for notebooks. The Kubernetes-style sizes are shown in the drop-down menu that appears when spawning notebooks with the Notebook Controller. Note: These sizes must follow conventions. For example, requests must be smaller than limits.

ModelServerSizes

Allows you to customize names and resources for model servers.

groupsConfig

Controls access to dashboard features, such as the spawner for allowed users and the cluster settings UI for admin users.

templateOrder

Specifies the order of custom Serving Runtime templates. When the user creates a new template, it is added to this list.

Managing applications that show in the dashboard

Adding an application to the dashboard

If you have installed an application in your OpenShift Container Platform cluster, you can add a tile for that application to the Open Data Hub dashboard (the ApplicationsEnabled page) to make it accessible for Open Data Hub users.

Prerequisites
  • You have cluster administrator privileges for your OpenShift Container Platform cluster.

  • The dashboard configuration enablement option is set to true (the default). Note that an admin user can disable this ability as described in Preventing users from adding applications to the dashboard.

Procedure
  1. Log in to the OpenShift Container Platform console as a cluster administrator.

  2. In the Administrator perspective, click HomeAPI Explorer.

  3. On the API Explorer page, search for the OdhApplication kind.

  4. Click the OdhApplication kind to open the resource details page.

  5. On the OdhApplication details page, select the redhat-ods-applications project from the Project list.

  6. Click the Instances tab.

  7. Click Create OdhApplication.

  8. On the Create OdhApplication page, copy the following code and paste it into the YAML editor.

    apiVersion: dashboard.opendatahub.io/v1
    kind: OdhApplication
    metadata:
      name: examplename
      namespace: redhat-ods-applications
      labels:
        app: odh-dashboard
        app.kubernetes.io/part-of: odh-dashboard
    spec:
      enable:
        validationConfigMap: examplename-enable
      img: >-
        <svg width="24" height="25" viewBox="0 0 24 25" fill="none" xmlns="http://www.w3.org/2000/svg">
        <path d="path data" fill="#ee0000"/>
        </svg>
      getStartedLink: 'https://example.org/docs/quickstart.html'
      route: exampleroutename
      routeNamespace: examplenamespace
      displayName: Example Name
      kfdefApplications: []
      support: third party support
      csvName: ''
      provider: example
      docsLink: 'https://example.org/docs/index.html'
      quickStart: ''
      getStartedMarkDown: >-
        # Example
    
        Enter text for the information panel.
    
      description: >-
        Enter summary text for the tile.
      category: Self-managed | Partner managed | {org-name} managed
  9. Modify the parameters in the code for your application.

    Tip
    To see example YAML files, click HomeAPI Explorer, select OdhApplication, click the Instances tab, select an instance, and then click the YAML tab.
  10. Click Create. The application details page appears.

  11. Log in to Open Data Hub.

  12. In the left menu, click ApplicationsExplore.

  13. Locate the new tile for your application and click it.

  14. In the information pane for the application, click Enable.

Verification
  • In the left menu of the Open Data Hub dashboard, click ApplicationsEnabled and verify that your application is available.

Preventing users from adding applications to the dashboard

By default, admin users are allowed to add applications to the Open Data Hub dashboard Application → Enabled page.

You can disable the ability for admin users to add applications to the dashboard.

Note: The Jupyter tile is enabled by default. To disable it, see Hiding the default Jupyter application.

Prerequisite
  • You have cluster administrator privileges for your OpenShift Container Platform cluster.

Procedure
  1. Log in to the OpenShift Container Platform console as a cluster administrator.

  2. Open the dashboard configuration file:

    1. In the Administrator perspective, click HomeAPI Explorer.

    2. In the search bar, enter OdhDashboardConfig to filter by kind.

    3. Click the OdhDashboardConfig custom resource (CR) to open the resource details page.

    4. Select the redhat-ods-applications project from the Project list.

    5. Click the Instances tab.

    6. Click the odh-dashboard-config instance to open the details page.

    7. Click the YAML tab.

  3. In the spec:dashboardConfig section, set the value of enablement to false to disable the ability for dashboard users to add applications to the dashboard.

  4. Click Save to apply your changes and then click Reload to make sure that your changes are synced to the cluster.

Verification

Open the Open Data Hub dashboard Application → Enabled page.

Disabling applications connected to Open Data Hub

You can disable applications and components so that they do not appear on the Open Data Hub dashboard when you no longer want to use them, for example, when data scientists no longer use an application or when the application license expires.

Disabling unused applications allows your data scientists to manually remove these application tiles from their Open Data Hub dashboard so that they can focus on the applications that they are most likely to use.

Important

Do not follow this procedure when disabling the following applications:

  • Anaconda Professional Edition. You cannot manually disable Anaconda Professional Edition. It is automatically disabled only when its license expires.

Prerequisites
  • You have logged in to the OpenShift Container Platform web console.

  • You are part of the cluster-admins user group in OpenShift Container Platform.

  • You have installed or configured the service on your OpenShift Container Platform cluster.

  • The application or component that you want to disable is enabled and appears on the Enabled page.

Procedure
  1. In the OpenShift Container Platform web console, switch to the Administrator perspective.

  2. Switch to the odh project.

  3. Click OperatorsInstalled Operators.

  4. Click on the Operator that you want to uninstall. You can enter a keyword into the Filter by name field to help you find the Operator faster.

  5. Delete any Operator resources or instances by using the tabs in the Operator interface.

    During installation, some Operators require the administrator to create resources or start process instances using tabs in the Operator interface. These must be deleted before the Operator can uninstall correctly.

  6. On the Operator Details page, click the Actions drop-down menu and select Uninstall Operator.

    An Uninstall Operator? dialog box is displayed.

  7. Select Uninstall to uninstall the Operator, Operator deployments, and pods. After this is complete, the Operator stops running and no longer receives updates.

Important

Removing an Operator does not remove any custom resource definitions or managed resources for the Operator. Custom resource definitions and managed resources still exist and must be cleaned up manually. Any applications deployed by your Operator and any configured off-cluster resources continue to run and must be cleaned up manually.

Verification
  • The Operator is uninstalled from its target clusters.

  • The Operator no longer appears on the Installed Operators page.

  • The disabled application is no longer available for your data scientists to use, and is marked as Disabled on the Enabled page of the Open Data Hub dashboard. This action may take a few minutes to occur following the removal of the Operator.

Showing or hiding information about enabled applications

If you have installed another application in your OpenShift Container Platform cluster, you can add a tile for that application to the Open Data Hub dashboard (the ApplicationsEnabled page) to make it accessible for Open Data Hub users.

Prerequisites
  • You have cluster administrator privileges for your OpenShift Container Platform cluster.

Procedure
  1. Log in to the OpenShift Container Platform console as a cluster administrator.

  2. In the Administrator perspective, click HomeAPI Explorer.

  3. On the API Explorer page, search for the OdhApplication kind.

  4. Click the OdhApplication kind to open the resource details page.

  5. On the OdhApplication details page, select the redhat-ods-applications project from the Project list.

  6. Click the Instances tab.

  7. Click Create OdhApplication.

  8. On the Create OdhApplication page, copy the following code and paste it into the YAML editor.

    apiVersion: dashboard.opendatahub.io/v1
    kind: OdhApplication
    metadata:
      name: examplename
      namespace: redhat-ods-applications
      labels:
        app: odh-dashboard
        app.kubernetes.io/part-of: odh-dashboard
    spec:
      enable:
        validationConfigMap: examplename-enable
      img: >-
        <svg width="24" height="25" viewBox="0 0 24 25" fill="none" xmlns="http://www.w3.org/2000/svg">
        <path d="path data" fill="#ee0000"/>
        </svg>
      getStartedLink: 'https://example.org/docs/quickstart.html'
      route: exampleroutename
      routeNamespace: examplenamespace
      displayName: Example Name
      kfdefApplications: []
      support: third party support
      csvName: ''
      provider: example
      docsLink: 'https://example.org/docs/index.html'
      quickStart: ''
      getStartedMarkDown: >-
        # Example
    
        Enter text for the information panel.
    
      description: >-
        Enter summary text for the tile.
      category: Self-managed | Partner managed | Red Hat managed
  9. Modify the parameters in the code for your application.

    Tip
    To see example YAML files, click HomeAPI Explorer, select OdhApplication, click the Instances tab, select an instance, and then click the YAML tab.
  10. Click Create. The application details page appears.

  11. Log in to Open Data Hub.

  12. In the left menu, click ApplicationsExplore.

  13. Locate the new tile for your application and click it.

  14. In the information pane for the application, click Enable.

Verification
  • In the left menu of the Open Data Hub dashboard, click ApplicationsEnabled and verify that your application is available.

Hiding the default Jupyter application

The Open Data Hub dashboard includes Jupyter as an enabled application by default.

To hide the Jupyter tile from the list of Enabled applications, edit the dashboard configuration file.

Prerequisite
  • You have cluster administrator privileges for your OpenShift Container Platform cluster.

Procedure
  1. Log in to the OpenShift Container Platform console as a cluster administrator.

  2. Open the dashboard configuration file:

    1. In the Administrator perspective, click HomeAPI Explorer.

    2. In the search bar, enter OdhDashboardConfig to filter by kind.

    3. Click the OdhDashboardConfig custom resource (CR) to open the resource details page.

    4. Select the redhat-ods-applications project from the Project list.

    5. Click the Instances tab.

    6. Click the odh-dashboard-config instance to open the details page.

    7. Click the YAML tab.

  3. In the spec:notebookController section, set the value of enabled to false to hide the Jupyter tile from the list of Enabled applications.

  4. Click Save to apply your changes and then click Reload to make sure that your changes are synced to the cluster.

Verification

In the Open Data Hub dashboard, select Applications> Enabled. You should not see the Jupyter tile.

Allocating additional resources to Open Data Hub users

As a cluster administrator, you can allocate additional resources to a cluster to support compute-intensive data science work. This support includes increasing the number of nodes in the cluster and changing the cluster’s allocated machine pool.

For more information about allocating additional resources to an OpenShift Container Platform cluster, see Manually scaling a compute machine set.

Customizing component deployment resources

Overview of component resource customization

You can customize deployment resources that are related to the Open Data Hub Operator, for example, CPU and memory limits and requests. For resource customizations to persist without being overwritten by the Operator, the opendatahub.io/managed: true annotation must not be present in the YAML file for the component deployment. This annotation is absent by default.

The following table shows the deployment names for each component in the opendatahub namespace:

Component Deployment names

CodeFlare

codeflare-operator-manager

KServe

  • kserve-controller-manager

  • odh-model-controller

TrustyAI

trustyai-service-operator-controller-manager

Ray

kuberay-operator

Kueue

kueue-controller-manager

Workbenches

  • notebook-controller-deployment

  • odh-notebook-controller-manager

Dashboard

odh-dashboard

Model serving

  • modelmesh-controller

  • odh-model-controller

Model registry

model-registry-operator-controller-manager

Data science pipelines

data-science-pipelines-operator-controller-manager

Training Operator

kubeflow-training-operator

Customizing component resources

You can customize component deployment resources by updating the .spec.template.spec.containers.resources section of the YAML file for the component deployment.

Prerequisites
  • You have cluster administrator privileges for your OpenShift Container Platform cluster.

  • You are part of the administrator group for Open Data Hub in OpenShift Container Platform.

Procedure
  1. Log in to the OpenShift Container Platform console as a cluster administrator.

  2. In the Administrator perspective, click Workloads > Deployments.

  3. From the Project drop-down list, select opendatahub.

  4. In the Name column, click the name of the deployment for the component that you want to customize resources for.

    Note

    For more information about the deployment names for each component, see Overview of component resource customization.

  5. On the Deployment details page that appears, click the YAML tab.

  6. Find the .spec.template.spec.containers.resources section.

  7. Update the value of the resource that you want to customize. For example, to update the memory limit to 500Mi, make the following change:

    containers:
            - resources:
                limits:
                    cpu: '2'
                    memory: 500Mi
                requests:
                    cpu: '1'
                    memory: 1Gi
  8. Click Save.

  9. Click Reload.

Verification
  • Log in to Open Data Hub and verify that your resource changes apply.

Disabling component resource customization

You can disable customization of component deployment resources, and restore default values, by adding the opendatahub.io/managed: true annotation to the YAML file for the component deployment.

Important

Manually removing or setting the opendatahub.io/managed: true annotation to false after manually adding it to the YAML file for a component deployment might cause unexpected cluster issues.

To remove the annotation from a deployment, use the steps described in Re-enabling component resource customization.

Prerequisites
  • You have cluster administrator privileges for your OpenShift Container Platform cluster.

  • You are part of the administrator group for Open Data Hub in OpenShift Container Platform.

Procedure
  1. Log in to the OpenShift Container Platform console as a cluster administrator.

  2. In the Administrator perspective, click Workloads > Deployments.

  3. From the Project drop-down list, select opendatahub.

  4. In the Name column, click the name of the deployment for the component to which you want to add the annotation.

    Note

    For more information about the deployment names for each component, see Overview of component resource customization.

  5. On the Deployment details page that appears, click the YAML tab.

  6. Find the metadata.annotations: section.

  7. Add the opendatahub.io/managed: true annotation.

    metadata:
      annotations:
         opendatahub.io/managed: true
  8. Click Save.

  9. Click Reload.

Verification
  • The opendatahub.io/managed: true annotation appears in the YAML file for the component deployment.

Re-enabling component resource customization

You can re-enable customization of component deployment resources after manually disabling it.

Important

Manually removing or setting the opendatahub.io/managed: annotation to false after adding it to the YAML file for a component deployment might cause unexpected cluster issues.

To remove the annotation from a deployment, use the following steps to delete the deployment. The controller pod for the deployment will automatically redeploy with the default settings.

Prerequisites
  • You have cluster administrator privileges for your OpenShift Container Platform cluster.

  • You are part of the administrator group for Open Data Hub in OpenShift Container Platform.

Procedure
  1. Log in to the OpenShift Container Platform console as a cluster administrator.

  2. In the Administrator perspective, click Workloads > Deployments.

  3. From the Project drop-down list, select opendatahub.

  4. In the Name column, click the name of the deployment for the component for which you want to remove the annotation.

  5. Click the Options menu Options menu.

  6. Click Delete Deployment.

Verification
  • The controller pod for the deployment automatically redeploys with the default settings.

Enabling accelerators

Enabling NVIDIA GPUs

Before you can use NVIDIA GPUs in Open Data Hub, you must install the NVIDIA GPU Operator.

Prerequisites
  • You have logged in to your OpenShift Container Platform cluster.

  • You have the cluster-admin role in your OpenShift Container Platform cluster.

Procedure
  1. To enable GPU support on an OpenShift cluster, follow the instructions here: NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.

  2. Delete the migration-gpu-status ConfigMap.

    1. In the OpenShift Container Platform web console, switch to the Administrator perspective.

    2. Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate ConfigMap.

    3. Search for the migration-gpu-status ConfigMap.

    4. Click the action menu (⋮) and select Delete ConfigMap from the list.

      The Delete ConfigMap dialog appears.

    5. Inspect the dialog and confirm that you are deleting the correct ConfigMap.

    6. Click Delete.

  3. Restart the dashboard replicaset.

    1. In the OpenShift Container Platform web console, switch to the Administrator perspective.

    2. Click WorkloadsDeployments.

    3. Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate deployment.

    4. Search for the rhods-dashboard deployment.

    5. Click the action menu (⋮) and select Restart Rollout from the list.

    6. Wait until the Status column indicates that all pods in the rollout have fully restarted.

Verification
  • The NVIDIA GPU Operator appears on the OperatorsInstalled Operators page in the OpenShift Container Platform web console.

  • The reset migration-gpu-status instance is present on the Instances tab on the AcceleratorProfile custom resource definition (CRD) details page.

After installing the NVIDIA GPU Operator, create an accelerator profile as described in Working with accelerators.

Intel Gaudi AI Accelerator integration

To accelerate your high-performance deep learning (DL) models, you can integrate Intel Gaudi AI accelerators in Open Data Hub. This allows your data scientists to use Gaudi libraries and software associated with Intel Gaudi AI accelerators from their workbench. Before you can enable Intel Gaudi AI accelerators in Open Data Hub, you must install the necessary dependencies. Also, the version of the Intel Gaudi AI Operator that you install must match the version of the corresponding workbench image in your deployment. However, a workbench image for Intel Gaudi accelerators is not included in Open Data Hub by default. Instead, you must create and configure a custom notebook to enable Intel Gaudi AI support.

You can use Intel Gaudi AI accelerators in an Amazon EC2 DL1 instance on OpenShift. Therefore, your OpenShift platform must support EC2 DL1 instances. Before you can use your Intel Gaudi AI accelerators, you must enable them in your OpenShift environment and configure an accelerator profile for each device. When enabled and fully configured with a custom notebook, Intel Gaudi AI accelerators are available to your data scientists when they create a workbench instance or serve a model.

To identify the Intel Gaudi AI accelerators present in your deployment, use the lspci utility. For more information, see lspci(8) - Linux man page.

Important

If the lspci utility indicates that Intel Gaudi AI accelerators are present in your deployment, it does not necessarily mean that the devices are ready to use.

Managing distributed workloads

Overview of Kueue resources

Cluster administrators can configure Kueue objects (such as resource flavors, a cluster queue, and local queues) to manage distributed workload resources across multiple nodes in an OpenShift cluster.

Resource flavor

The Kueue ResourceFlavor object describes the resource variations that are available in a cluster.

Resources in a cluster can be homogenous or heterogeneous:

  • Homogeneous resources are identical across the cluster: same node type, CPUs, memory, accelerators, and so on.

  • Heterogeneous resources have variations across the cluster.

If a cluster has homogeneous resources, or if it is not necessary to manage separate quotas for different flavors of a resource, a cluster administrator can create an empty ResourceFlavor object named default-flavor, without any labels or taints, as follows:

Empty Kueue resource flavor for homegeneous resources
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default-flavor

For more information about configuring resource flavors, see Resource Flavor in the Kueue documentation.

Cluster queue

The Kueue ClusterQueue object manages a pool of cluster resources such as pods, CPUs, memory, and accelerators. A cluster queue can reference multiple resource flavors.

Cluster administrators can configure a cluster queue to define the resource flavors that the queue manages, and assign a quota for each resource in each resource flavor.

The following example configures a cluster queue to assign a quota of 9 CPUs, 36 GiB memory, 5 pods, and 5 NVIDIA GPUs.

Example cluster queue
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "pods", "nvidia.com/gpu"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi
      - name: "pods"
        nominalQuota: 5
      - name: "nvidia.com/gpu"
        nominalQuota: '5'

The cluster queue starts a distributed workload only if the total required resources are within these quota limits. If the sum of the requests for a resource in a distributed workload is greater than the specified quota for that resource in the cluster queue, the cluster queue does not admit the distributed workload.

For more information about configuring cluster queues, see Cluster Queue in the Kueue documentation.

Local queue

The Kueue LocalQueue object groups closely related distributed workloads in a project. Cluster administrators can configure local queues to specify the project name and the associated cluster queue. Each local queue then grants access to the resources that its specified cluster queue manages. A cluster administrator can optionally define one local queue in a project as the default local queue for that project.

When configuring a distributed workload, the user specifies the local queue name. If a cluster administrator configured a default local queue, the user can omit the local queue specification from the distributed workload code.

Kueue allocates the resources for a distributed workload from the cluster queue that is associated with the local queue, if the total requested resources are within the quota limits specified in that cluster queue.

The following example configures a local queue called team-a-queue for the team-a project, and specifies cluster-queue as the associated cluster queue.

Example local queue
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: team-a
  name: team-a-queue
  annotations:
    kueue.x-k8s.io/default-queue: "true"
spec:
  clusterQueue: cluster-queue

In this example, the kueue.x-k8s.io/default-queue: "true" annotation defines this local queue as the default local queue for the team-a project. If a user submits a distributed workload in the team-a project and that distributed workload does not specify a local queue in the cluster configuration, Kueue automatically routes the distributed workload to the team-a-queue local queue. The distributed workload can then access the resources that the cluster-queue cluster queue manages.

For more information about configuring local queues, see Local Queue in the Kueue documentation.

Configuring quota management for distributed workloads

Configure quotas for distributed workloads on a cluster, so that you can share resources between several data science projects.

Prerequisites
  • You have logged in to OpenShift Container Platform with the cluster-admin role.

  • You have installed the required distributed workloads components as described in Installing the distributed workloads components.

  • You have created a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about how to create a project, see Creating a data science project.

  • You have sufficient resources. In addition to the base Open Data Hub resources, you need 1.6 vCPU and 2 GiB memory to deploy the distributed workloads infrastructure.

  • The resources are physically available in the cluster.

    Note

    For more information about Kueue resources, see Overview of Kueue resources.

  • If you want to use graphics processing units (GPUs), you have enabled GPU support. This process includes installing the Node Feature Discovery Operator and the NVIDIA GPU Operator. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.

Procedure
  1. In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:

    $ oc login <openshift_cluster_url> -u <admin_username> -p <password>
  2. Create an empty Kueue resource flavor, as follows:

    1. Create a file called default_flavor.yaml and populate it with the following content:

      Empty Kueue resource flavor
      apiVersion: kueue.x-k8s.io/v1beta1
      kind: ResourceFlavor
      metadata:
        name: default-flavor
    2. Apply the configuration to create the default-flavor object:

      $ oc apply -f default_flavor.yaml
  3. Create a cluster queue to manage the empty Kueue resource flavor, as follows:

    1. Create a file called cluster_queue.yaml and populate it with the following content:

      Example cluster queue
      apiVersion: kueue.x-k8s.io/v1beta1
      kind: ClusterQueue
      metadata:
        name: "cluster-queue"
      spec:
        namespaceSelector: {}  # match all.
        resourceGroups:
        - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
          flavors:
          - name: "default-flavor"
            resources:
            - name: "cpu"
              nominalQuota: 9
            - name: "memory"
              nominalQuota: 36Gi
            - name: "nvidia.com/gpu"
              nominalQuota: 5
    2. Replace the example quota values (9 CPUs, 36 GiB memory, and 5 NVIDIA GPUs) with the appropriate values for your cluster queue. The cluster queue will start a distributed workload only if the total required resources are within these quota limits.

      You must specify a quota for each resource that the user can request, even if the requested value is 0, by updating the spec.resourceGroups section as follows:

      • Include the resource name in the coveredResources list.

      • Specify the resource name and nominalQuota in the flavors.resources section, even if the nominalQuota value is 0.

    3. Apply the configuration to create the cluster-queue object:

      $ oc apply -f cluster_queue.yaml
  4. Create a local queue that points to your cluster queue, as follows:

    1. Create a file called local_queue.yaml and populate it with the following content:

      Example local queue
      apiVersion: kueue.x-k8s.io/v1beta1
      kind: LocalQueue
      metadata:
        namespace: test
        name: local-queue-test
        annotations:
          kueue.x-k8s.io/default-queue: 'true'
      spec:
        clusterQueue: cluster-queue

      The kueue.x-k8s.io/default-queue: 'true' annotation defines this queue as the default queue. Distributed workloads are submitted to this queue if no local_queue value is specified in the ClusterConfiguration section of the data science pipeline or Jupyter notebook or Microsoft Visual Studio Code file.

    2. Update the namespace value to specify the same namespace as in the ClusterConfiguration section that creates the Ray cluster.

    3. Optional: Update the name value accordingly.

    4. Apply the configuration to create the local-queue object:

      $ oc apply -f local_queue.yaml

      The cluster queue allocates the resources to run distributed workloads in the local queue.

Verification

Check the status of the local queue in a project, as follows:

$ oc get -n <project-name> localqueues
Additional resources

Configuring the CodeFlare Operator

If you want to change the default configuration of the CodeFlare Operator for distributed workloads in Open Data Hub, you can edit the associated config map.

Prerequisites
Procedure
  1. In the OpenShift Container Platform console, click WorkloadsConfigMaps.

  2. From the Project list, select odh.

  3. Search for the codeflare-operator-config config map, and click the config map name to open the ConfigMap details page.

  4. Click the YAML tab to show the config map specifications.

  5. In the data:config.yaml:kuberay section, you can edit the following entries:

    ingressDomain

    This configuration option is null (ingressDomain: "") by default. Do not change this option unless the Ingress Controller is not running on OpenShift. Open Data Hub uses this value to generate the dashboard and client routes for every Ray Cluster, as shown in the following examples:

    Example dashboard and client routes
    ray-dashboard-<clustername>-<namespace>.<your.ingress.domain>
    ray-client-<clustername>-<namespace>.<your.ingress.domain>
    mTLSEnabled

    This configuration option is enabled (mTLSEnabled: true) by default. When this option is enabled, the Ray Cluster pods create certificates that are used for mutual Transport Layer Security (mTLS), a form of mutual authentication, between Ray Cluster nodes. When this option is enabled, Ray clients cannot connect to the Ray head node unless they download the generated certificates from the ca-secret-_<cluster_name>_ secret, generate the necessary certificates for mTLS communication, and then set the required Ray environment variables. Users must then re-initialize the Ray clients to apply the changes. The CodeFlare SDK provides the following functions to simplify the authentication process for Ray clients:

    Example Ray client authentication code
    from codeflare_sdk import generate_cert
    
    generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace)
    generate_cert.export_env(cluster.config.name, cluster.config.namespace)
    
    ray.init(cluster.cluster_uri())
    rayDashboardOauthEnabled

    This configuration option is enabled (rayDashboardOAuthEnabled: true) by default. When this option is enabled, Open Data Hub places an OpenShift OAuth proxy in front of the Ray Cluster head node. Users must then authenticate by using their OpenShift cluster login credentials when accessing the Ray Dashboard through the browser. If users want to access the Ray Dashboard in another way (for example, by using the Ray JobSubmissionClient class), they must set an authorization header as part of their request, as shown in the following example:

    Example authorization header
    {Authorization: "Bearer <your-openshift-token>"}
  6. To save your changes, click Save.

  7. To apply your changes, delete the pod:

    1. Click WorkloadsPods.

    2. Find the codeflare-operator-manager-<pod-id> pod.

    3. Click the options menu (⋮) for that pod, and then click Delete Pod. The pod restarts with your changes applied.

Verification

Check the status of the codeflare-operator-manager pod, as follows:

  1. In the OpenShift Container Platform console, click WorkloadsDeployments.

  2. Search for the codeflare-operator-manager deployment, and then click the deployment name to open the deployment details page.

  3. Click the Pods tab. When the status of the codeflare-operator-manager-<pod-id> pod is Running, the pod is ready to use. To see more information about the pod, click the pod name to open the pod details page, and then click the Logs tab.

Troubleshooting common problems with distributed workloads for administrators

If your users are experiencing errors in Open Data Hub relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.

A user’s Ray cluster is in a suspended state

Problem

The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.

Diagnosis

The user’s Ray cluster head pod or worker pods remain in a suspended state. Check the status of the Workloads resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for the suspended state, as shown in the following example:

status:
 conditions:
   - lastTransitionTime: '2024-05-29T13:05:09Z'
     message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'
Resolution
  1. Check whether the resource flavor is created, as follows:

    1. In the OpenShift Container Platform console, select the user’s project from the Project list.

    2. Click Home → Search, and from the Resources list, select ResourceFlavor.

    3. If necessary, create the resource flavor.

  2. Check the cluster queue configuration in the user’s code, to ensure that the resources that they requested are within the limits defined for the project.

  3. If necessary, increase the resource quota.

For information about configuring resource flavors and quotas, see Configuring quota management for distributed workloads.

A user’s Ray cluster is in a failed state

Problem

The user might have insufficient resources.

Diagnosis

The user’s Ray cluster head pod or worker pods are not running. When a Ray cluster is created, it initially enters a failed state. This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.

Resolution

If the failed state persists, complete the following steps:

  1. In the OpenShift Container Platform console, select the user’s project from the Project list.

  2. Click Workloads → Pods.

  3. Click the user’s pod name to open the pod details page.

  4. Click the Events tab, and review the pod events to identify the cause of the problem.

  5. Check the status of the Workloads resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for the failed state.

A user receives a failed to call webhook error message for the CodeFlare Operator

Problem

After the user runs the cluster.up() command, the following error is shown:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}
Diagnosis

The CodeFlare Operator pod might not be running.

Resolution
  1. In the OpenShift Container Platform console, select the user’s project from the Project list.

  2. Click Workloads → Pods.

  3. Verify that the CodeFlare Operator pod is running. If necessary, restart the CodeFlare Operator pod.

  4. Review the logs for the CodeFlare Operator pod to verify that the webhook server is serving, as shown in the following example:

    INFO	controller-runtime.webhook	  Serving webhook server	{"host": "", "port": 9443}

A user receives a failed to call webhook error message for Kueue

Problem

After the user runs the cluster.up() command, the following error is shown:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}
Diagnosis

The Kueue pod might not be running.

Resolution
  1. In the OpenShift Container Platform console, select the user’s project from the Project list.

  2. Click Workloads → Pods.

  3. Verify that the Kueue pod is running. If necessary, restart the Kueue pod.

  4. Review the logs for the Kueue pod to verify that the webhook server is serving, as shown in the following example:

    {"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}

A user’s Ray cluster does not start

Problem

After the user runs the cluster.up() command, when they run either the cluster.details() command or the cluster.status() command, the Ray cluster status remains as Starting instead of changing to Ready. No pods are created.

Diagnosis

Check the status of the Workloads resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for remaining in the Starting state. Similarly, check the status.conditions.message field for the RayCluster resource.

Resolution
  1. In the OpenShift Container Platform console, select the user’s project from the Project list.

  2. Click Workloads → Pods.

  3. Verify that the KubeRay pod is running. If necessary, restart the KubeRay pod.

  4. Review the logs for the KubeRay pod to identify errors.

A user receives a Default Local Queue …​ not found error message

Problem

After the user runs the cluster.up() command, the following error is shown:

Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.
Diagnosis

No default local queue is defined, and a local queue is not specified in the cluster configuration.

Resolution
  1. Check whether a local queue exists in the user’s project, as follows:

    1. In the OpenShift Container Platform console, select the user’s project from the Project list.

    2. Click Home → Search, and from the Resources list, select LocalQueue.

    3. If no local queues are found, create a local queue.

    4. Provide the user with the details of the local queues in their project, and advise them to add a local queue to their cluster configuration.

  2. Define a default local queue.

    For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.

A user receives a local_queue provided does not exist error message

Problem

After the user runs the cluster.up() command, the following error is shown:

local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.
Diagnosis

An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.

Resolution
  1. In the OpenShift Container Platform console, select the user’s project from the Project list.

    1. Click Search, and from the Resources list, select LocalQueue.

    2. Resolve the problem in one of the following ways:

      • If no local queues are found, create a local queue.

      • If one or more local queues are found, provide the user with the details of the local queues in their project. Advise the user to ensure that they spelled the local queue name correctly in their cluster configuration, and that the namespace value in the cluster configuration matches their project name. If the user does not specify a namespace value in the cluster configuration, the Ray cluster is created in the current project.

    3. Define a default local queue.

      For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.

A user cannot create a Ray cluster or submit jobs

Problem

After the user runs the cluster.up() command, an error similar to the following text is shown:

RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}
Diagnosis

The correct OpenShift login credentials are not specified in the TokenAuthentication section of the user’s notebook code.

Resolution
  1. Advise the user to identify and specify the correct OpenShift login credentials as follows:

    1. In the OpenShift Container Platform console header, click your username and click Copy login command.

    2. In the new tab that opens, log in as the user whose credentials you want to use.

    3. Click Display Token.

    4. From the Log in with this token section, copy the token and server values.

    5. Specify the copied token and server values in your notebook code as follows:

      auth = TokenAuthentication(
          token = "<token>",
          server = "<server>",
          skip_tls=False
      )
      auth.login()
  2. Verify that the user has the correct permissions and is part of the rhoai-users group.

The user’s pod provisioned by Kueue is terminated before the user’s image is pulled

Problem

Kueue waits for a period of time before marking a workload as ready, to enable all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.

Diagnosis
  1. In the OpenShift Container Platform console, select the user’s project from the Project list.

  2. Click Workloads → Pods.

  3. Click the user’s pod name to open the pod details page.

  4. Click the Events tab, and review the pod events to check whether the image pull completed successfully.

Resolution

If the pod takes more than 5 minutes to pull the image, resolve the problem in one of the following ways:

  • Add an OnFailure restart policy for resources that are managed by Kueue.

  • In the redhat-ods-applications namespace, edit the kueue-manager-config ConfigMap to set a custom timeout for the waitForPodsReady property. For more information about this configuration option, see Enabling waitForPodsReady in the Kueue documentation.