Open Data Hub logo

Info alert:Important Notice

Please note that more information about the previous v2 releases can be found here. You can use "Find a release" search bar to search for a particular release.

Serving models

About model serving

Serving trained models on Open Data Hub means deploying the models on your OpenShift cluster to test and then integrate them into intelligent applications. Deploying a model makes it available as a service that you can access by using an API. This enables you to return predictions based on data inputs that you provide through API calls. This process is known as model inferencing. When you serve a model on Open Data Hub, the inference endpoints that you can access for the deployed model are shown in the dashboard.

Open Data Hub provides the following model serving platforms:

Single-model serving platform

For deploying large models such as large language models (LLMs), Open Data Hub includes a single-model serving platform that is based on the KServe component. Because each model is deployed from its own model server, the single-model serving platform helps you to deploy, monitor, scale, and maintain large models that require increased resources.

Multi-model serving platform

For deploying small and medium-sized models, Open Data Hub includes a multi-model serving platform that is based on the ModelMesh component. On the multi-model serving platform, you can deploy multiple models on the same model server. Each of the deployed models shares the server resources. This approach can be advantageous on OpenShift clusters that have finite compute resources or pods.

Serving small and medium-sized models

On the multi-model serving platform, multiple models can be deployed from the same model server and share the server resources.

Configuring model servers

Enabling the multi-model serving platform

To use the multi-model serving platform, you must first enable the platform.

Prerequisites
  • You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.

  • Your cluster administrator has not edited the Open Data Hub dashboard configuration to disable the ability to select the multi-model serving platform, which uses the ModelMesh component. For more information, see Dashboard configuration options.

Procedure
  1. In the left menu of the Open Data Hub dashboard, click SettingsCluster settings.

  2. Locate the Model serving platforms section.

  3. Select the Multi-model serving platform checkbox.

  4. Click Save changes.

Adding a custom model-serving runtime for the multi-model serving platform

A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. By default, the multi-model serving platform includes the OpenVINO Model Server runtime. You can also add your own custom runtime if the default runtime does not meet your needs, such as supporting a specific model format.

As an administrator, you can use the Open Data Hub dashboard to add and enable a custom model-serving runtime. You can then choose the custom runtime when you create a new model server for the multi-model serving platform.

Note
Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.
Prerequisites
  • You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.

  • You are familiar with how to add a model server to your project. When you have added a custom model-serving runtime, you must configure a new model server to use the runtime.

  • You have reviewed the example runtimes in the kserve/modelmesh-serving repository. You can use these examples as starting points. However, each runtime requires some further modification before you can deploy it in Open Data Hub. The required modifications are described in the following procedure.

    Note
    Open Data Hub includes the OpenVINO Model Server runtime by default. You do not need to add this runtime to Open Data Hub.
Procedure
  1. From the Open Data Hub dashboard, click Settings > Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. To add a custom runtime, choose one of the following options:

    • To start with an existing runtime (for example the OpenVINO Model Server runtime), click the action menu (⋮) next to the existing runtime and then click Duplicate.

    • To add a new custom runtime, click Add serving runtime.

  3. In the Select the model serving platforms this runtime supports list, select Multi-model serving platform.

    Note
    The multi-model serving platform supports only the REST protocol. Therefore, you cannot change the default value in the Select the API protocol this runtime supports list.
  4. Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:

    • Upload a YAML file

      1. Click Upload files.

      2. In the file browser, select a YAML file on your computer. This file might be the one of the example runtimes that you downloaded from the kserve/modelmesh-serving repository.

        The embedded YAML editor opens and shows the contents of the file that you uploaded.

    • Enter YAML code directly in the editor

      1. Click Start from scratch.

      2. Enter or paste YAML code directly in the embedded editor. The YAML that you paste might be copied from one of the example runtimes in the kserve/modelmesh-serving repository.

  5. Optional: If you are adding one of the example runtimes in the kserve/modelmesh-serving repository, perform the following modifications:

    1. In the YAML editor, locate the kind field for your runtime. Update the value of this field to ServingRuntime.

    2. In the kustomization.yaml file in the kserve/modelmesh-serving repository, take note of the newName and newTag values for the runtime that you want to add. You will specify these values in a later step.

    3. In the YAML editor for your custom runtime, locate the containers.image field.

    4. Update the value of the containers.image field in the format newName:newTag, based on the values that you previously noted in the kustomization.yaml file. Some examples are shown.

      Nvidia Triton Inference Server

      image: nvcr.io/nvidia/tritonserver:23.04-py3

      Seldon Python MLServer

      image: seldonio/mlserver:1.3.2

      TorchServe

      image: pytorch/torchserve:0.7.1-cpu

  6. In the metadata.name field, ensure that the value of the runtime you are adding is unique (that is, the value doesn’t match a runtime that you have already added).

  7. Optional: To configure a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:

    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      name: mlserver-0.x
      annotations:
        openshift.io/display-name: MLServer
    Note
    If you do not configure a custom display name for your runtime, Open Data Hub shows the value of the metadata.name field.
  8. Click Add.

    The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime you added is automatically enabled.

  9. Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification
  • The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Additional resources

Adding a tested and verified model-serving runtime for the multi-model serving platform

In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes such as the NVIDIA Triton Inference Server to support your needs. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Open Data Hub.

You can use the Open Data Hub dashboard to add and enable the NVIDIA Triton Inference Server runtime and then choose the runtime when you create a new model server for the multi-model serving platform.

Prerequisites
  • You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.

  • You are familiar with how to add a model server to your project. After you have added a tested and verified model-serving runtime, you must configure a new model server to use the runtime.

Procedure
  1. From the Open Data Hub dashboard, click Settings > Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. To add a tested and verified runtime, click Add serving runtime.

  3. In the Select the model serving platforms this runtime supports list, select Multi-model serving platform.

    Note
    The multi-model serving platform supports only the REST protocol. Therefore, you cannot change the default value in the Select the API protocol this runtime supports list.
  4. Click Start from scratch.

  5. Enter or paste the following YAML code directly in the embedded editor.

    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      annotations:
        enable-route: "true"
      name: modelmesh-triton
      labels:
        opendatahub.io/dashboard: "true"
    spec:
      annotations:
        opendatahub.io/modelServingSupport: '["multi"x`x`]'
        prometheus.kserve.io/path: /metrics
        prometheus.kserve.io/port: "8002"
      builtInAdapter:
        env:
          - name: CONTAINER_MEM_REQ_BYTES
            value: "268435456"
          - name: USE_EMBEDDED_PULLER
            value: "true"
        memBufferBytes: 134217728
        modelLoadingTimeoutMillis: 90000
        runtimeManagementPort: 8001
        serverType: triton
      containers:
        - args:
            - -c
            - 'mkdir -p /models/_triton_models;  chmod 777
              /models/_triton_models;  exec
              tritonserver "--model-repository=/models/_triton_models" "--model-control-mode=explicit" "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true" "--allow-grpc=true"  '
          command:
            - /bin/sh
          image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
          name: triton
          resources:
            limits:
              cpu: "1"
              memory: 2Gi
            requests:
              cpu: "1"
              memory: 2Gi
      grpcDataEndpoint: port:8001
      grpcEndpoint: port:8085
      multiModel: true
      protocolVersions:
        - grpc-v2
        - v2
      supportedModelFormats:
        - autoSelect: true
          name: onnx
          version: "1"
        - autoSelect: true
          name: pytorch
          version: "1"
        - autoSelect: true
          name: tensorflow
          version: "1"
        - autoSelect: true
          name: tensorflow
          version: "2"
        - autoSelect: true
          name: tensorrt
          version: "7"
        - autoSelect: false
          name: xgboost
          version: "1"
        - autoSelect: true
          name: python
          version: "1"
  6. In the metadata.name field, make sure that the value of the runtime you are adding does not match a runtime that you have already added).

  7. Optional: To use a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:

    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      name: modelmesh-triton
      annotations:
        openshift.io/display-name: Triton ServingRuntime
    Note
    If you do not configure a custom display name for your runtime, Open Data Hub shows the value of the metadata.name field.
  8. Click Create.

    The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime you added is automatically enabled.

  9. Optional: To edit the runtime, click the action menu (⋮) and select Edit.

Verification
  • The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Additional resources

Adding a model server for the multi-model serving platform

When you have enabled the multi-model serving platform, you must configure a model server to deploy models. If you require extra computing power for use with large datasets, you can assign accelerators to your model server.

Prerequisites
  • You have logged in to Open Data Hub.

  • If you use Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

  • You have created a data science project that you can add a model server to.

  • You have enabled the multi-model serving platform.

  • If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See Adding a custom model-serving runtime.

  • If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.

Procedure
  1. In the left menu of the Open Data Hub dashboard, click Data Science Projects.

    The Data Science Projects page opens.

  2. Click the name of the project that you want to configure a model server for.

    A project details page opens.

  3. Click the Models tab.

  4. Perform one of the following actions:

    • If you see a ​Multi-model serving platform tile, click Add model server on the tile.

    • If you do not see any tiles, click the Add model server button.

    The Add model server dialog opens.

  5. In the Model server name field, enter a unique name for the model server.

  6. From the Serving runtime list, select a model-serving runtime that is installed and enabled in your Open Data Hub deployment.

    Note

    If you are using a custom model-serving runtime with your model server and want to use GPUs, you must ensure that your custom runtime supports GPUs and is appropriately configured to use them.

  7. In the Number of model replicas to deploy field, specify a value.

  8. From the Model server size list, select a value.

  9. Optional: If you selected Custom in the preceding step, configure the following settings in the Model server size section to customize your model server:

    1. In the CPUs requested field, specify the number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.

    2. In the CPU limit field, specify the maximum number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.

    3. In the Memory requested field, specify the requested memory for the model server in gibibytes (Gi).

    4. In the Memory limit field, specify the maximum memory limit for the model server in gibibytes (Gi).

  10. Optional: From the Accelerator list, select an accelerator.

    1. If you selected an accelerator in the preceding step, specify the number of accelerators to use.

  11. Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.

  12. Optional: In the Token authorization section, select the Require token authentication checkbox to require token authentication for your model server. To finish configuring token authentication, perform the following actions:

    1. In the Service account name field, enter a service account name for which the token will be generated. The generated token is created and displayed in the Token secret field when the model server is configured.

    2. To add an additional service account, click Add a service account and enter another service account name.

  13. Click Add.

    • The model server that you configured appears on the Models tab for the project, in the Models and model servers list.

  14. Optional: To update the model server, click the action menu () beside the model server and select Edit model server.

Deleting a model server

When you no longer need a model server to host models, you can remove it from your data science project.

Note
When you remove a model server, you also remove the models that are hosted on that model server. As a result, the models are no longer available to applications.
Prerequisites
  • You have created a data science project and an associated model server.

  • You have notified the users of the applications that access the models that the models will no longer be available.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

Procedure
  1. From the Open Data Hub dashboard, click Data Science Projects.

    The Data Science Projects page opens.

  2. Click the name of the project from which you want to delete the model server.

    A project details page opens.

  3. Click the Models tab.

  4. Click the action menu () beside the project whose model server you want to delete and then click Delete model server.

    The Delete model server dialog opens.

  5. Enter the name of the model server in the text field to confirm that you intend to delete it.

  6. Click Delete model server.

Verification
  • The model server that you deleted is no longer displayed on the Models tab for the project.

Working with deployed models

Deploying a model by using the multi-model serving platform

You can deploy trained models on Open Data Hub to enable you to test and implement them into intelligent applications. Deploying a model makes it available as a service that you can access by using an API. This enables you to return predictions based on data inputs.

When you have enabled the multi-model serving platform, you can deploy models on the platform.

Prerequisites
  • You have logged in to Open Data Hub.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users) in OpenShift.

  • You have enabled the multi-model serving platform.

  • You have created a data science project and added a model server.

  • You have access to S3-compatible object storage.

  • For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.

Procedure
  1. In the left menu of the Open Data Hub dashboard, click Data Science Projects.

    The Data Science Projects page opens.

  2. Click the name of the project that you want to deploy a model in.

    A project details page opens.

  3. Click the Models tab.

  4. Click Deploy model.

  5. Configure properties for deploying your model as follows:

    1. In the Model name field, enter a unique name for the model that you are deploying.

    2. From the Model framework list, select a framework for your model.

      Note
      The Model framework list shows only the frameworks that are supported by the model-serving runtime that you specified when you configured your model server.
    3. To specify the location of the model you want to deploy from S3-compatible object storage, perform one of the following sets of actions:

      • To use an existing connection

        1. Select Existing connection.

        2. From the Name list, select a connection that you previously defined.

        3. In the Path field, enter the folder path that contains the model in your specified data source.

      • To use a new connection

        1. To define a new connection that your model can access, select New connection.

        2. In the Name field, enter a unique name for the connection.

        3. In the Access key field, enter the access key ID for the S3-compatible object storage provider.

        4. In the Secret key field, enter the secret access key for the S3-compatible object storage account that you specified.

        5. In the Endpoint field, enter the endpoint of your S3-compatible object storage bucket.

        6. In the Region field, enter the default region of your S3-compatible object storage account.

        7. In the Bucket field, enter the name of your S3-compatible object storage bucket.

        8. In the Path field, enter the folder path in your S3-compatible object storage that contains your data file.

    4. Click Deploy.

Verification
  • Confirm that the deployed model is shown on the Models tab for the project, and on the Model Serving page of the dashboard with a checkmark in the Status column.

Additional resources

Viewing a deployed model

To analyze the results of your work, you can view a list of deployed models on Open Data Hub. You can also view the current statuses of deployed models and their endpoints.

Prerequisites
  • You have logged in to Open Data Hub.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

Procedure
  1. From the Open Data Hub dashboard, click Model Serving.

    The Deployed models page opens.

    For each model, the page shows details such as the model name, the project in which the model is deployed, the model-serving runtime that the model uses, and the deployment status.

  2. Optional: For a given model, click the link in the Inference endpoint column to see the inference endpoints for the deployed model.

Verification
  • A list of previously deployed data science models is displayed on the Deployed models page.

Additional resources

Updating the deployment properties of a deployed model

You can update the deployment properties of a model that has been deployed previously. For example, you can change the model’s connection and name.

Prerequisites
  • You have logged in to Open Data Hub.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

  • You have deployed a model on Open Data Hub.

Procedure
  1. From the Open Data Hub dashboard, click Model Serving.

    The Deployed models page opens.

  2. Click the action menu () beside the model whose deployment properties you want to update and click Edit.

    The Edit model dialog opens.

  3. Update the deployment properties of the model as follows:

    1. In the Model name field, enter a new, unique name for your model.

    2. From the Model servers list, select a model server for your model.

    3. From the Model framework list, select a framework for your model.

      Note
      The Model framework list shows only the frameworks that are supported by the model-serving runtime that you specified when you configured your model server.
    4. To update how you have specified the location of your model, perform one of the following sets of actions:

      • If you previously specified an existing connection

        1. In the Path field, update the folder path that contains the model in your specified data source.

      • If you previously specified a new connection

        1. In the Name field, update a unique name for the connection.

        2. In the Access key field, update the access key ID for the S3-compatible object storage provider.

        3. In the Secret key field, update the secret access key for the S3-compatible object storage account that you specified.

        4. In the Endpoint field, update the endpoint of your S3-compatible object storage bucket.

        5. In the Region field, update the default region of your S3-compatible object storage account.

        6. In the Bucket field, update the name of your S3-compatible object storage bucket.

        7. In the Path field, update the folder path in your S3-compatible object storage that contains your data file.

    5. Click Redeploy.

Verification
  • The model whose deployment properties you updated is displayed on the Model Serving page of the dashboard.

Deleting a deployed model

You can delete models you have previously deployed. This enables you to remove deployed models that are no longer required.

Prerequisites
  • You have logged in to Open Data Hub.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

  • You have deployed a model.

Procedure
  1. From the Open Data Hub dashboard, click Model serving.

    The Deployed models page opens.

  2. Click the action menu () beside the deployed model that you want to delete and click Delete.

    The Delete deployed model dialog opens.

  3. Enter the name of the deployed model in the text field to confirm that you intend to delete it.

  4. Click Delete deployed model.

Verification
  • The model that you deleted is no longer displayed on the Deployed models page.

Configuring monitoring for the multi-model serving platform

The multi-model serving platform includes model and model server metrics for the ModelMesh component. ModelMesh generates its own set of metrics and does not rely on the underlying model-serving runtimes to provide them. The set of metrics that ModelMesh generates includes metrics for model request rates and timings, model loading and unloading rates, times and sizes, internal queuing delays, capacity and usage, cache state, and least recently-used models. For more information, see ModelMesh metrics.

After you have configured monitoring, you can view metrics for the ModelMesh component.

Prerequisites
  • You have cluster administrator privileges for your OpenShift Container Platform cluster.

  • You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.

  • You are familiar with creating a config map for monitoring a user-defined workflow. You will perform similar steps in this procedure.

  • You are familiar with enabling monitoring for user-defined projects in OpenShift. You will perform similar steps in this procedure.

  • You have assigned the monitoring-rules-view role to users that will monitor metrics.

Procedure
  1. In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:

    $ oc login <openshift_cluster_url> -u <admin_username> -p <password>
  2. Define a ConfigMap object in a YAML file called uwm-cm-conf.yaml with the following contents:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: user-workload-monitoring-config
      namespace: openshift-user-workload-monitoring
    data:
      config.yaml: |
        prometheus:
          logLevel: debug
          retention: 15d

    The user-workload-monitoring-config object configures the components that monitor user-defined projects. Observe that the retention time is set to the recommended value of 15 days.

  3. Apply the configuration to create the user-workload-monitoring-config object.

    $ oc apply -f uwm-cm-conf.yaml
  4. Define another ConfigMap object in a YAML file called uwm-cm-enable.yaml with the following contents:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cluster-monitoring-config
      namespace: openshift-monitoring
    data:
      config.yaml: |
        enableUserWorkload: true

    The cluster-monitoring-config object enables monitoring for user-defined projects.

  5. Apply the configuration to create the cluster-monitoring-config object.

    $ oc apply -f uwm-cm-enable.yaml

Viewing model-serving runtime metrics for the multi-model serving platform

After a cluster administrator has configured monitoring for the multi-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the ModelMesh component.

Procedure
  1. Log in to the OpenShift Container Platform web console.

  2. Switch to the Developer perspective.

  3. In the left menu, click Observe.

  4. As described in Monitoring your project metrics, use the web console to run queries for modelmesh_* metrics.

Monitoring model performance

In the multi-model serving platform, you can view performance metrics for all models deployed on a model server and for a specific model that is deployed on the model server.

Viewing performance metrics for all models on a model server

You can monitor the following metrics for all the models that are deployed on a model server:

  • HTTP requests per 5 minutes - The number of HTTP requests that have failed or succeeded for all models on the server.

  • Average response time (ms) - For all models on the server, the average time it takes the model server to respond to requests.

  • CPU utilization (%) - The percentage of the CPU’s capacity that is currently being used by all models on the server.

  • Memory utilization (%) - The percentage of the system’s memory that is currently being used by all models on the server.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the models are performing at a specified time.

Prerequisites
  • You have installed Open Data Hub.

  • On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.

  • You have logged in to Open Data Hub.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

  • You have deployed models on the multi-model serving platform.

Procedure
  1. From the Open Data Hub dashboard navigation menu, click Data Science Projects.

    The Data Science Projects page opens.

  2. Click the name of the project that contains the data science models that you want to monitor.

  3. In the project details page, click the Models tab.

  4. In the row for the model server that you are interested in, click the action menu (⋮) and then select View model server metrics.

  5. Optional: On the metrics page for the model server, set the following options:

    • Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.

    • Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.

  6. Scroll down to view data graphs for HTTP requests per 5 minutes, average response time, CPU utilization, and memory utilization.

Verification

On the metrics page for the model server, the graphs provide data on performance metrics.

Viewing HTTP request metrics for a deployed model

You can view a graph that illustrates the HTTP requests that have failed or succeeded for a specific model that is deployed on the multi-model serving platform.

Prerequisites
  • You have installed Open Data Hub.

  • On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.

  • The following dashboard configuration options are set to the default values as shown:

    disablePerformanceMetrics:false
    disableKServeMetrics:false

    For more information, see Dashboard configuration options.

  • You have logged in to Open Data Hub.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

  • You have deployed models on the multi-model serving platform.

Procedure
  1. From the Open Data Hub dashboard navigation menu, select Model Serving.

  2. On the Deployed models page, select the model that you are interested in.

  3. Optional: On the Endpoint performance tab, set the following options:

    • Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.

    • Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.

Verification

The Endpoint performance tab shows a graph of the HTTP metrics for the model.

Serving large models

For deploying large models such as large language models (LLMs), Open Data Hub includes a single-model serving platform that is based on the KServe component. Because each model is deployed from its own model server, the single-model serving platform helps you to deploy, monitor, scale, and maintain large models that require more resources.

About the single-model serving platform

The single-model serving platform consists of the following components:

  • KServe: A Kubernetes custom resource definition (CRD) that orchestrates model serving for all types of models. It includes model-serving runtimes that implement the loading of given types of model servers. KServe handles the lifecycle of the deployment object, storage access, and networking setup.

  • Red Hat OpenShift Serverless: A cloud-native development model that allows for serverless deployments of models. OpenShift Serverless is based on the open source Knative project.

To install the single-model serving platform, you have the following options:

Automated installation

If you have not already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you can configure the Open Data Hub Operator to install KServe and configure its dependencies.

Manual installation

If you have already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you cannot configure the Open Data Hub Operator to install KServe and configure its dependencies. In this situation, you must install KServe manually.

When you have installed KServe, you can use the Open Data Hub dashboard to deploy models using preinstalled or custom model-serving runtimes.

Open Data Hub includes preinstalled runtimes for KServe. For more information, see Supported model-serving runtimes.

You can also configure monitoring for the single-model serving platform and use Prometheus to scrape the available metrics.

About KServe deployment modes

By default, you can deploy models on the single-model serving platform with KServe by using Red Hat OpenShift Serverless, which is a cloud-native development model that allows for serverless deployments of models. OpenShift Serverless is based on the open source Knative project. In addition, serverless mode is dependent on the Red Hat OpenShift Serverless Operator.

Alternatively, you can use raw deployment mode, which is not dependent on the Red Hat OpenShift Serverless Operator. With raw deployment mode, you can deploy models with Kubernetes resources, such as Deployment, Service, Ingress, and Horizontal Pod Autoscaler.

Important

Deploying a machine learning model using KServe raw deployment mode is a Limited Availability feature. Limited Availability means that you can install and receive support for the feature only with specific approval from the Red Hat AI Business Unit. Without such approval, the feature is unsupported. In addition, this feature is only supported on Self-Managed deployments of single node OpenShift.

There are both advantages and disadvantages to using each of these deployment modes:

Serverless mode

Advantages:

  • Enables autoscaling based on request volume:

    • Resources scale up automatically when receiving incoming requests.

    • Optimizes resource usage and maintains performance during peak times.

  • Supports scale down to and from zero using Knative:

    • Allows resources to scale down completely when there are no incoming requests.

    • Saves costs by not running idle resources.

Disadvantages:

  • Has customization limitations:

    • Serverless is limited to Knative, such as when mounting multiple volumes.

  • Dependency on Knative for scaling:

    • Introduces additional complexity in setup and management compared to traditional scaling methods.

Raw deployment mode

Advantages:

  • Enables deployment with Kubernetes resources, such as Deployment, Service, Ingress, and Horizontal Pod Autoscaler:

    • Provides full control over Kubernetes resources, allowing for detailed customization and configuration of deployment settings.

  • Unlocks Knative limitations, such as being unable to mount multiple volumes:

    • Beneficial for applications requiring complex configurations or multiple storage mounts.

Disadvantages:

  • Does not support automatic scaling:

    • Does not support automatic scaling down to zero resources when idle.

    • Might result in higher costs during periods of low traffic.

  • Requires manual management of scaling.

Installing KServe

To learn how to perform both automated and manual installation of KServe, see Installation in the caikit-tgis-serving repository.

Deploying models by using the single-model serving platform

On the single-model serving platform, each model is deployed on its own model server. This helps you to deploy, monitor, scale, and maintain large models that require increased resources.

Important

If you want to use the single-model serving platform to deploy a model from S3-compatible storage that uses a self-signed SSL certificate, you must install a certificate authority (CA) bundle on your OpenShift cluster. For more information, see Understanding certificates in Open Data Hub.

Enabling the single-model serving platform

When you have installed KServe, you can use the Open Data Hub dashboard to enable the single-model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.

Prerequisites
  • You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.

  • You have installed KServe.

  • Your cluster administrator has not edited the Open Data Hub dashboard configuration to disable the ability to select the single-model serving platform, which uses the KServe component. For more information, see Dashboard configuration options.

Procedure
  1. Enable the single-model serving platform as follows:

    1. In the left menu, click SettingsCluster settings.

    2. Locate the Model serving platforms section.

    3. To enable the single-model serving platform for projects, select the Single-model serving platform checkbox.

    4. Click Save changes.

  2. Enable preinstalled runtimes for the single-model serving platform as follows:

    1. In the left menu of the Open Data Hub dashboard, click SettingsServing runtimes.

      The Serving runtimes page shows preinstalled runtimes and any custom runtimes that you have added.

      For more information about preinstalled runtimes, see Supported runtimes.

    2. Set the runtime that you want to use to Enabled.

      The single-model serving platform is now available for model deployments.

Adding a custom model-serving runtime for the single-model serving platform

A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the pre-installed runtimes that are included with Open Data Hub. You can also add your own custom runtimes if the default runtimes do not meet your needs. For example, if the TGIS runtime does not support a model format that is supported by Hugging Face Text Generation Inference (TGI), you can create a custom runtime to add support for the model.

As an administrator, you can use the Open Data Hub interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the single-model serving platform.

Note
Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.
Prerequisites
  • You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.

  • You have built your custom runtime and added the image to a container image repository such as Quay.

Procedure
  1. From the Open Data Hub dashboard, click Settings > Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. To add a custom runtime, choose one of the following options:

    • To start with an existing runtime (for example, TGIS Standalone ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.

    • To add a new custom runtime, click Add serving runtime.

  3. In the Select the model serving platforms this runtime supports list, select Single-model serving platform.

  4. In the Select the API protocol this runtime supports list, select REST or gRPC.

  5. Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:

    • Upload a YAML file

      1. Click Upload files.

      2. In the file browser, select a YAML file on your computer.

        The embedded YAML editor opens and shows the contents of the file that you uploaded.

    • Enter YAML code directly in the editor

      1. Click Start from scratch.

      2. Enter or paste YAML code directly in the embedded editor.

    Note
    In many cases, creating a custom runtime will require adding new or custom parameters to the env section of the ServingRuntime specification.
  6. Click Add.

    The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.

  7. Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification
  • The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Adding a tested and verified model-serving runtime for the single-model serving platform

In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes such as the NVIDIA Triton Inference Server to support your needs. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Open Data Hub.

You can use the Open Data Hub dashboard to add and enable the NVIDIA Triton Inference Server runtime for the single-model serving platform. You can then choose the runtime when you deploy a model on the single-model serving platform.

Prerequisites
  • You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.

Procedure
  1. From the Open Data Hub dashboard, click Settings > Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. Click Add serving runtime.

  3. In the Select the model serving platforms this runtime supports list, select Single-model serving platform.

  4. In the Select the API protocol this runtime supports list, select REST or gRPC.

  5. Click Start from scratch.

    1. If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: triton-kserve-rest
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        annotations:
          prometheus.kserve.io/path: /metrics
          prometheus.kserve.io/port: "8002"
        containers:
          - args:
              - tritonserver
              - --model-store=/mnt/models
              - --grpc-port=9000
              - --http-port=8080
              - --allow-grpc=true
              - --allow-http=true
            image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
            name: kserve-container
            resources:
              limits:
                cpu: "1"
                memory: 2Gi
              requests:
                cpu: "1"
                memory: 2Gi
            ports:
              - containerPort: 8080
                protocol: TCP
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - autoSelect: true
            name: tensorrt
            version: "8"
          - autoSelect: true
            name: tensorflow
            version: "1"
          - autoSelect: true
            name: tensorflow
            version: "2"
          - autoSelect: true
            name: onnx
            version: "1"
          - name: pytorch
            version: "1"
          - autoSelect: true
            name: triton
            version: "2"
          - autoSelect: true
            name: xgboost
            version: "1"
          - autoSelect: true
            name: python
            version: "1"
    2. If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: triton-kserve-grpc
        labels:
          opendatahub.io/dashboard: "true"
      spec:
        annotations:
          prometheus.kserve.io/path: /metrics
          prometheus.kserve.io/port: "8002"
        containers:
          - args:
              - tritonserver
              - --model-store=/mnt/models
              - --grpc-port=9000
              - --http-port=8080
              - --allow-grpc=true
              - --allow-http=true
            image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
            name: kserve-container
            ports:
              - containerPort: 9000
                name: h2c
                protocol: TCP
            volumeMounts:
              - mountPath: /dev/shm
                name: shm
            resources:
              limits:
                cpu: "1"
                memory: 2Gi
              requests:
                cpu: "1"
                memory: 2Gi
        protocolVersions:
          - v2
          - grpc-v2
        supportedModelFormats:
          - autoSelect: true
            name: tensorrt
            version: "8"
          - autoSelect: true
            name: tensorflow
            version: "1"
          - autoSelect: true
            name: tensorflow
            version: "2"
          - autoSelect: true
            name: onnx
            version: "1"
          - name: pytorch
            version: "1"
          - autoSelect: true
            name: triton
            version: "2"
          - autoSelect: true
            name: xgboost
            version: "1"
          - autoSelect: true
            name: python
            version: "1"
      volumes:
        - emptyDir: null
          medium: Memory
          sizeLimit: 2Gi
          name: shm
  6. In the metadata.name field, make sure that the value of the runtime you are adding does not match a runtime that you have already added).

  7. Optional: To use a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:

    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      name: kserve-triton
      annotations:
        openshift.io/display-name: Triton ServingRuntime
    Note
    If you do not configure a custom display name for your runtime, Open Data Hub shows the value of the metadata.name field.
  8. Click Create.

    The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.

  9. Optional: To edit the runtime, click the action menu (⋮) and select Edit.

Verification
  • The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Deploying models on the single-model serving platform

When you have enabled the single-model serving platform, you can enable a pre-installed or custom model-serving runtime and start to deploy models on the platform.

Note
Text Generation Inference Server (TGIS) is based on an early fork of Hugging Face TGI. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model does not work in the current version of Open Data Hub, support might be added in a future version. In the meantime, you can also add your own, custom runtime to support a TGI model. For more information, see Adding a custom model-serving runtime for the single-model serving platform.
Prerequisites
  • You have logged in to Open Data Hub.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

  • You have installed KServe.

  • You have enabled the single-model serving platform.

  • To enable token authorization and external model routes for deployed models, you have added Authorino as an authorization provider.

  • You have created a data science project.

  • You have access to S3-compatible object storage.

  • For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.

  • To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.

  • To use the vLLM ServingRuntime for KServe runtime or use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.

  • To use the vLLM ServingRuntime with Gaudi accelerators support for KServe runtime, you have enabled support for hybrid processing units (HPUs) in Open Data Hub. This includes installing the Intel Gaudi Base Operator and configuring an accelerator profile. For more information, see Setting up Gaudi for OpenShift and Working with accelerators.

  • To deploy RHEL AI models:

    • You have enabled the vLLM ServingRuntime for KServe runtime.

    • You have downloaded the model from the Red Hat container registry and uploaded it to S3-compatible object storage.

Procedure
  1. In the left menu, click Data Science Projects.

    The Data Science Projects page opens.

  2. Click the name of the project that you want to deploy a model in.

    A project details page opens.

  3. Click the Models tab.

  4. Perform one of the following actions:

    • If you see a ​​Single-model serving platform tile, click Deploy model on the tile.

    • If you do not see any tiles, click the Deploy model button.

    The Deploy model dialog opens.

  5. In the Model deployment name field, enter a unique name for the model that you are deploying.

  6. In the Serving runtime field, select an enabled runtime.

  7. From the Model framework (name - version) list, select a value.

  8. In the Number of model server replicas to deploy field, specify a value.

  9. From the Model server size list, select a value.

  10. The following options are only available if you have enabled accelerator support on your cluster and created an accelerator profile:

    1. From the Accelerator list, select an accelerator.

    2. If you selected an accelerator in the preceding step, specify the number of accelerators to use in the Number of accelerators field.

  11. Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.

  12. To require token authorization for inference requests to the deployed model, perform the following actions:

    1. Select Require token authorization.

    2. In the Service account name field, enter the service account name that the token will be generated for.

  13. To specify the location of your model, perform one of the following sets of actions:

    • To use an existing connection

      1. Select Existing connection.

      2. From the Name list, select a connection that you previously defined.

      3. In the Path field, enter the folder path that contains the model in your specified data source.

    • To use a new connection

      1. To define a new connection that your model can access, select New connection.

      2. In the Name field, enter a unique name for the connection.

      3. In the Access key field, enter the access key ID for your S3-compatible object storage provider.

      4. In the Secret key field, enter the secret access key for the S3-compatible object storage account that you specified.

      5. In the Endpoint field, enter the endpoint of your S3-compatible object storage bucket.

      6. In the Region field, enter the default region of your S3-compatible object storage account.

      7. In the Bucket field, enter the name of your S3-compatible object storage bucket.

      8. In the Path field, enter the folder path in your S3-compatible object storage that contains your data file.

  14. (Optional) Customize the runtime parameters in the Configuration parameters section:

    1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.

    2. Modify the values in Additional environment variables to define variables in the model’s environment.

      The Configuration parameters section shows predefined serving runtime parameters, if any are available.

      Note
      Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
  15. Click Deploy.

Verification
  • Confirm that the deployed model is shown on the Models tab for the project, and on the Model Serving page of the dashboard with a checkmark in the Status column.

Customizing the parameters of a deployed model-serving runtime

You might need additional parameters beyond the default ones to deploy specific models or to enhance an existing model deployment. In such cases, you can modify the parameters of an existing runtime to suit your deployment needs.

Note
Customizing the parameters of a runtime only affects the selected model deployment.
Prerequisites
  • You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.

  • You have deployed a model on the single-model serving platform.

Procedure
  1. From the Open Data Hub dashboard, click Model Serving in the left menu.

    The Deployed models page opens.

  2. Click the action menu (⋮) next to the name of the model you want to customize and select Edit.

    The Configuration parameters section shows predefined serving runtime parameters, if any are available.

  3. Customize the runtime parameters in the Configuration parameters section:

    1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.

    2. Modify the values in Additional environment variables to define variables in the model’s environment.

      Note
      Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
  4. After you are done customizing the runtime parameters, click Redeploy to save and deploy the model with your changes.

Verification
  • Confirm that the deployed model is shown on the Models tab for the project, and on the Model Serving page of the dashboard with a checkmark in the Status column.

  • Confirm that the arguments and variables that you set appear in spec.predictor.model.args and spec.predictor.model.env by one of the following methods:

    • Checking the InferenceService YAML from the OpenShift Container Platform Console.

    • Using the following command in the OpenShift Container Platform CLI:

      oc get -o json inferenceservice <inferenceservicename/modelname> -n <projectname>

Customizable model serving runtime parameters

You can modify the parameters of an existing model serving runtime to suit your deployment needs.

For more information about parameters for each of the supported serving runtimes, see the following table:

Serving runtime Resource

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server: Model Parameters

Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe

Caikit NLP: Configuration
TGIS: Model configuration

Caikit Standalone ServingRuntime for KServe

Caikit NLP: Configuration

OpenVINO Model Server

OpenVINO Model Server Features: Dynamic Input Parameters

Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe

TGIS: Model configuration

vLLM ServingRuntime for KServe

vLLM: Engine arguments
OpenAI Compatible Server

Using OCI containers for model storage

As an alternative to storing a model in an S3 bucket or URI, you can upload models to OCI containers. Using OCI containers for model storage can help you:

  • Reduce startup times by avoiding downloading the same model multiple times.

  • Reduce disk space usage by reducing the number of models downloaded locally.

  • Improve model performance by allowing pre-fetched images.

This guide shows you how to manually deploy a MobileNet v2-7 model in an ONNX format, stored in an OCI image on an OpenVINO model server.

Prerequisites
  • You have a model in the ONNX format.

Creating an OCI image and storing a model in the container image
Procedure
  1. From your local machine, create a temporary directory to store both the downloaded model and support files to create the OCI image:

    cd $(mktemp -d)
  2. Create a models folder inside the temporary directory and download your model:

    mkdir -p models/1
    
    DOWNLOAD_URL=https://github.com/onnx/models/raw/main/validated/vision/classification/mobilenet/model/mobilenetv2-7.onnx
    curl -L $DOWNLOAD_URL -O --output-dir models/1/
    Note

    The subdirectory 1 is used because OpenVINO requires numbered subdirectories for model versioning. If you are not using OpenVINO, you do not need to create the 1 subdirectory to use OCI container images.

  3. Create a Docker file named Containerfile with the following contents:

    FROM registry.access.redhat.com/ubi9/ubi-micro:latest
    COPY --chown=0:0 models /models
    RUN chmod -R a=rX /models
    
    # nobody user
    USER 65534
    Note
    • In this example, ubi9-micro is used as a base container image. You cannot use empty images that do not provide a shell, such as scratch, because KServe uses the shell to ensure the model files are accessible to the model server.

    • Ownership of the copied model files and read permissions are granted to the root group. OpenShift runs containers with a random user ID and the root group ID. Changing ownership of the group ensures that the model server can access them.

  4. Confirm that the models follow the directory structure shown using the tree command:

    tree
    
    .
    ├── Containerfile
    └── models
        └── 1
            └── mobilenetv2-7.onnx
  5. Create the OCI container image with Podman, and upload it to a registry. For example, using Quay as the registry:

    podman build --format=oci -t quay.io/<user_name>/<repository_name>:<tag_name> .
    podman push quay.io/<user_name>/<repository_name>:<tag_name>
    Note

    If your repository is private, ensure you are authenticated to the registry before uploading your container image.

Deploying a model stored in an OCI image from a public repository
Note

By default in KServe, models are exposed outside the cluster and not protected with authorization.

  1. Create a namespace to deploy the model:

    oc new-project oci-model-example
  2. Use the Open Data Hub project kserve-ovms template to create a ServingRuntime resource and configure the OpenVINO model server in the new namespace:

    oc process -n opendatahub -o yaml kserve-ovms | oc apply -f -
    1. Verify that the ServingRuntime has been created with the kserve-ovms name:

      oc get servingruntimes
      
      NAME          DISABLED   MODELTYPE     CONTAINERS         AGE
      kserve-ovms              openvino_ir   kserve-container   1m
  3. Create an InferenceService YAML resource with the following values:

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: sample-isvc-using-oci
    spec:
      predictor:
        model:
          runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
          modelFormat:
            name: onnx
          storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
    Important

    The ServingRuntime and InferenceService configurations do not set any resource limits.

Verification

After you create the InferenceService resource, KServe deploys the model stored in the OCI image referred to by the storageUri field. Check the status of the deployment with the following command:

oc get inferenceservice

NAME                    URL                                                       READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                     AGE
sample-isvc-using-oci   https://sample-isvc-using-oci-oci-model-example.example   True           100                              sample-isvc-using-oci-predictor-00001   1m
Deploying a model stored in an OCI image from a private repository

To deploy a model stored from a private OCI repository, you must configure an image pull secret. For more information about creating an image pull secret, see Using image pull secrets.

  1. Follow the steps in the previous section for deploying a model. However, when creating the InferenceService in step 3, specify your pull secret in the spec.predictor.imagePullSecrets field:

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: sample-isvc-using-private-oci
    spec:
      predictor:
        model:
          runtime: kserve-ovms
          modelFormat:
            name: onnx
          storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
        imagePullSecrets: # Specify image pull secrets to use for fetching container images (including OCI model images)
        - name: <pull-secret-name>
Additional resources

Accessing the inference endpoint for a deployed model

To make inference requests to your deployed model, you must know how to access the inference endpoint that is available.

For a list of paths to use with the supported runtimes and example commands, see Inference endpoints.

Prerequisites
  • You have logged in to Open Data Hub.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

  • You have deployed a model by using the single-model serving platform.

  • If you enabled token authorization for your deployed model, you have the associated token value.

Procedure
  1. From the Open Data Hub dashboard, click Model Serving.

    The inference endpoint for the model is shown in the Inference endpoint field.

  2. Depending on what action you want to perform with the model (and if the model supports that action), copy the inference endpoint and then add a path to the end of the URL.

  3. Use the endpoint to make API requests to your deployed model.

Deploying models using multiple GPU nodes

Deploying models using multiple graphical processing units (GPUs) across multiple nodes can help when deploying large models such as large language models (LLMs).

This procedure shows you how to serve models on Open Data Hub across multiple GPU nodes using the vLLM serving framework. Multi-node inferencing uses the vllm-multinode-runtime ServingRuntime as a custom runtime. The vllm-multinode-runtime ServingRuntime uses the same vLLM image as the vllm-runtime and also includes information necessary for multi-GPU inferencing.

Prerequisites
  • You have cluster administrator privileges for your OpenShift Container Platform cluster.

  • You have downloaded and installed the OpenShift Container Platform command-line interface (CLI). See Installing the OpenShift CLI.

  • You have enabled the operators for your GPU type, such as Node Feature Discovery Operator, NVIDIA GPU Operator. For more information about enabling accelerators, see Working with accelerators.

    • You are using an NVIDIA GPU (nvidia.com/gpu).

    • Specify the GPU type through either the ServingRuntime or InferenceService. If the GPU type specified in the ServingRuntime differs from what is set in the InferenceService, both GPU types are assigned to the resource and can cause errors.

  • You have enabled KServe on your cluster.

  • You have only 1 head pod in your setup. Do not adjust the replica count using the min_replicas or max_replicas settings in the InferenceService. Creating additional head pods can cause them to be excluded from the Ray cluster.

  • You have a persistent volume claim (PVC) set up and configured for ReadWriteMany (RWX) access mode.

Procedure
  1. In a terminal window, if you are not already logged in to your OpenShift Container Platform cluster as a cluster administrator, log in to the OpenShift Container Platform CLI as shown in the following example:

    $ oc login <openshift_cluster_url> -u <admin_username> -p <password>
  2. Select or create a namespace where you would like to deploy the model. In this example, kserve-demo is used and can be created using the following command:

    oc new-project kserve-demo
  3. From the namespace where you would like to deploy the model, create a PVC for model storage and provide the name of your storage class. The storage class must be file storage.

    Note
    If you have already configured a PVC, you can skip this step.
    kubectl apply -f -
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: granite-8b-code-base-pvc
    spec:
      accessModes:
        - ReadWriteMany
      volumeMode: Filesystem
      resources:
        requests:
          storage: 50Gi
      storageClassName: __<fileStorageClassName>__
  4. Download the model to the PVC. For example:

    apiVersion: v1
    kind: Pod
    metadata:
      name: download-granite-8b-code
      labels:
        name: download-granite-8b-code
    spec:
      volumes:
        - name: model-volume
          persistentVolumeClaim:
            claimName: granite-8b-code-claim
      restartPolicy: Never
      initContainers:
        - name: fix-volume-permissions
          image: quay.io/quay/busybox@sha256:xxxxx
          command: ["sh"]
          args: ["-c", "mkdir -p /mnt/models/granite-8b-code-base && chmod -R 777 /mnt/models"]
          volumeMounts:
            - mountPath: "/mnt/models/"
              name: model-volume
      containers:
        - resources:
            requests:
              memory: 40Gi
          name: download-model
          imagePullPolicy: IfNotPresent
          image: quay.io/modh/kserve-storage-initializer@sha256:xxxxx
          args:
            - 's3://$<bucket_name>/granite-8b-code-base/'
            - /mnt/models/granite-8b-code-base
          env:
            - name: AWS_ACCESS_KEY_ID
              value: <id>
            - name: AWS_SECRET_ACCESS_KEY
              value: <secret>
            - name: BUCKET_NAME
              value: <bucket_name>
            - name: S3_USE_HTTPS
              value: "1"
            - name: AWS_ENDPOINT_URL
              value: <AWS endpoint>
            - name: awsAnonymousCredential
              value: 'false'
            - name: AWS_DEFAULT_REGION
              value: <region>
          volumeMounts:
            - mountPath: "/mnt/models/"
              name: model-volume
  5. Create the vllm-multinode-runtime ServingRuntime as a custom runtime:

    oc process vllm-multinode-runtime-template -n redhat-ods-applications|oc apply  -f -
  6. Deploy the model using the following InferenceService configuration:

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      annotations:
        serving.kserve.io/deploymentMode: RawDeployment
        serving.kserve.io/autoscalerClass: external
      name: granite-8b-code-base-pvc
    spec:
      predictor:
        model:
          modelFormat:
            name: vLLM
          runtime: vllm-multinode-runtime
          storageUri: pvc://granite-8b-code-base-pvc/hf/8b_instruction_tuned
        workerSpec: {}

    The following configuration can be added to the InferenceService:

    • workerSpec.tensorParallelSize: Determines how many GPUs are used per node. The GPU type count in both the head and worker node deployment resources is updated automatically. Ensure that the value of workerSpec.tensorParallelSize is at least 1.

    • workerSpec.pipelineParallelSize: Determines how many nodes are involved in the deployment. This variable represents the total number of nodes, including both the head and worker nodes. Ensure that the value of workerSpec.pipelineParallelSize is at least 2.

Verification

To confirm that you have set up your environment to deploy models on multiple GPU nodes, check the GPU resource status, the InferenceService status, the ray cluster status, and send a request to the model.

  • Check the GPU resource status:

    • Retrieve the pod names for the head and worker nodes:

      # Get pod name
      podName=$(oc get pod -l app=isvc.granite-8b-code-base-pvc-predictor --no-headers|cut -d' ' -f1)
      workerPodName=$(oc get pod -l app=isvc.granite-8b-code-base-pvc-predictor-worker --no-headers|cut -d' ' -f1)
      
      oc wait --for=condition=ready pod/${podName} --timeout=300s
      # Check the GPU memory size for both the head and worker pods:
      echo "### HEAD NODE GPU Memory Size"
      kubectl exec $podName -- nvidia-smi
      echo "### Worker NODE GPU Memory Size"
      kubectl exec $workerPodName -- nvidia-smi
      Sample response
      +-----------------------------------------------------------------------------------------+
      | NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
      |-----------------------------------------+------------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
      |                                         |                        |               MIG M. |
      |=========================================+========================+======================|
      |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
      |  0%   33C    P0             71W /  300W |19031MiB /  23028MiB <1>|      0%      Default |
      |                                         |                        |                  N/A |
      +-----------------------------------------+------------------------+----------------------+
               ...
      +-----------------------------------------------------------------------------------------+
      | NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
      |-----------------------------------------+------------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
      |                                         |                        |               MIG M. |
      |=========================================+========================+======================|
      |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
      |  0%   30C    P0             69W /  300W |18959MiB /  23028MiB <2>|      0%      Default |
      |                                         |                        |                  N/A |
      +-----------------------------------------+------------------------+----------------------+

      Confirm that the model loaded properly by checking the values of <1> and <2>. If the model did not load, the value of these fields is 0MiB.

  • Verify the status of your InferenceService using the following command:

    +

oc wait --for=condition=ready pod/${podName} -n $DEMO_NAMESPACE --timeout=300s
export MODEL_NAME=granite-8b-code-base-pvc

+

Sample response
   NAME                 URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
   granite-8b-code-base-pvc   http://granite-8b-code-base-pvc.default.example.com
  • Send a request to the model to confirm that the model is available for inference:

    oc wait --for=condition=ready pod/${podName} -n vllm-multinode --timeout=300s
    
    oc port-forward $podName 8080:8080 &
    
    curl http://localhost:8080/v1/completions \
           -H "Content-Type: application/json" \
           -d "{
                'model': "$MODEL_NAME",
                'prompt': 'At what temperature does Nitrogen boil?',
                'max_tokens': 100,
                'temperature': 0
            }"

Configuring monitoring for the single-model serving platform

The single-model serving platform includes metrics for supported runtimes of the KServe component. KServe does not generate its own metrics, and relies on the underlying model-serving runtimes to provide them. The set of available metrics for a deployed model depends on its model-serving runtime.

In addition to runtime metrics for KServe, you can also configure monitoring for OpenShift Service Mesh. The OpenShift Service Mesh metrics help you to understand dependencies and traffic flow between components in the mesh.

Prerequisites
  • You have cluster administrator privileges for your OpenShift Container Platform cluster.

  • You have created OpenShift Service Mesh and Knative Serving instances and installed KServe.

  • You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.

  • You are familiar with creating a config map for monitoring a user-defined workflow. You will perform similar steps in this procedure.

  • You are familiar with enabling monitoring for user-defined projects in OpenShift. You will perform similar steps in this procedure.

  • You have assigned the monitoring-rules-view role to users that will monitor metrics.

Procedure
  1. In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:

    $ oc login <openshift_cluster_url> -u <admin_username> -p <password>
  2. Define a ConfigMap object in a YAML file called uwm-cm-conf.yaml with the following contents:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: user-workload-monitoring-config
      namespace: openshift-user-workload-monitoring
    data:
      config.yaml: |
        prometheus:
          logLevel: debug
          retention: 15d

    The user-workload-monitoring-config object configures the components that monitor user-defined projects. Observe that the retention time is set to the recommended value of 15 days.

  3. Apply the configuration to create the user-workload-monitoring-config object.

    $ oc apply -f uwm-cm-conf.yaml
  4. Define another ConfigMap object in a YAML file called uwm-cm-enable.yaml with the following contents:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cluster-monitoring-config
      namespace: openshift-monitoring
    data:
      config.yaml: |
        enableUserWorkload: true

    The cluster-monitoring-config object enables monitoring for user-defined projects.

  5. Apply the configuration to create the cluster-monitoring-config object.

    $ oc apply -f uwm-cm-enable.yaml
  6. Create ServiceMonitor and PodMonitor objects to monitor metrics in the service mesh control plane as follows:

    1. Create an istiod-monitor.yaml YAML file with the following contents:

      apiVersion: monitoring.coreos.com/v1
      kind: ServiceMonitor
      metadata:
        name: istiod-monitor
        namespace: istio-system
      spec:
        targetLabels:
        - app
        selector:
          matchLabels:
            istio: pilot
        endpoints:
        - port: http-monitoring
          interval: 30s
    2. Deploy the ServiceMonitor CR in the specified istio-system namespace.

      $ oc apply -f istiod-monitor.yaml

      You see the following output:

      servicemonitor.monitoring.coreos.com/istiod-monitor created
    3. Create an istio-proxies-monitor.yaml YAML file with the following contents:

      apiVersion: monitoring.coreos.com/v1
      kind: PodMonitor
      metadata:
        name: istio-proxies-monitor
        namespace: istio-system
      spec:
        selector:
          matchExpressions:
          - key: istio-prometheus-ignore
            operator: DoesNotExist
        podMetricsEndpoints:
        - path: /stats/prometheus
          interval: 30s
    4. Deploy the PodMonitor CR in the specified istio-system namespace.

      $ oc apply -f istio-proxies-monitor.yaml

      You see the following output:

      podmonitor.monitoring.coreos.com/istio-proxies-monitor created

Viewing model-serving runtime metrics for the single-model serving platform

When a cluster administrator has configured monitoring for the single-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the KServe component.

Procedure
  1. Log in to the OpenShift Container Platform web console.

  2. Switch to the Developer perspective.

  3. In the left menu, click Observe.

  4. As described in Monitoring your project metrics, use the web console to run queries for model-serving runtime metrics. You can also run queries for metrics that are related to OpenShift Service Mesh. Some examples are shown.

    1. The following query displays the number of successful inference requests over a period of time for a model deployed with the vLLM runtime:

      sum(increase(vllm:request_success_total{namespace=${namespace},model_name=${model_name}}[${rate_interval}]))
      Note

      Certain vLLM metrics are available only after an inference request is processed by a deployed model. To generate and view these metrics, you must first make an inference request to the model.

    2. The following query displays the number of successful inference requests over a period of time for a model deployed with the standalone TGIS runtime:

      sum(increase(tgi_request_success{namespace=${namespace}, pod=~${model_name}-predictor-.*}[${rate_interval}]))
    3. The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:

      sum(increase(predict_rpc_count_total{namespace=${namespace},code=OK,model_id=${model_name}}[${rate_interval}]))
    4. The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:

      sum(increase(ovms_requests_success{namespace=${namespace},name=${model_name}}[${rate_interval}]))
Additional resources

Monitoring model performance

In the single-model serving platform, you can view performance metrics for a specific model that is deployed on the platform.

Viewing performance metrics for a deployed model

You can monitor the following metrics for a specific model that is deployed on the single-model serving platform:

  • Number of requests - The number of requests that have failed or succeeded for a specific model.

  • Average response time (ms) - The average time it takes a specific model to respond to requests.

  • CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.

  • Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the model is performing at a specified time.

Prerequisites
  • You have installed Open Data Hub.

  • A cluster admin has enabled user workload monitoring (UWM) for user-defined projects on your OpenShift cluster. For more information, see Enabling monitoring for user-defined projects and Configuring monitoring for the single-model serving platform.

  • You have logged in to Open Data Hub.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

  • The following dashboard configuration options are set to the default values as shown:

    disablePerformanceMetrics:false
    disableKServeMetrics:false

    For more information, see Dashboard configuration options.

  • You have deployed a model on the single-model serving platform by using a preinstalled runtime.

    Note

    Metrics are only supported for models deployed by using a preinstalled model-serving runtime or a custom runtime that is duplicated from a preinstalled runtime.

Procedure
  1. From the Open Data Hub dashboard navigation menu, click Data Science Projects.

    The Data Science Projects page opens.

  2. Click the name of the project that contains the data science models that you want to monitor.

  3. In the project details page, click the Models tab.

  4. Select the model that you are interested in.

  5. On the Endpoint performance tab, set the following options:

    • Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.

    • Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.

  6. Scroll down to view data graphs for number of requests, average response time, CPU utilization, and memory utilization.

Verification

The Endpoint performance tab shows graphs of metrics for the model.

Optimizing model-serving runtimes

You can optionally enhance the preinstalled model-serving runtimes available in Open Data Hub to leverage additional benefits and capabilities, such as optimized inferencing, reduced latency, and fine-tuned resource allocation.

Enabling speculative decoding and multi-modal inferencing

You can configure the vLLM ServingRuntime for KServe runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for large language models (LLMs).

You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

The following procedure describes customizing the vLLM ServingRuntime for KServe runtime for speculative decoding and multi-modal inferencing.

Prerequisites
  • You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.

  • If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.

Procedure
  1. Follow the steps to deploy a model as described in Depyoing models on the single-model serving platform.

  2. In the Serving runtime field, select the vLLM ServingRuntime for KServe runtime.

  3. To configure the vLLM model-serving runtime for speculative decoding by matching n-grams in the prompt, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --speculative-model=[ngram]
    --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
    --ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX>
    --use-v2-block-manager
    1. Replace <NUM_SPECULATIVE_TOKENS> and <NGRAM_PROMPT_LOOKUP_MAX> with your own values.

      Note

      Inferencing throughput varies depending on the model used for speculating with n-grams.

  4. To configure the vLLM model-serving runtime for speculative decoding with a draft model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --port=8080
    --served-model-name={{.Name}}
    --distributed-executor-backend=mp
    --model=/mnt/models/<path_to_original_model>
    --speculative-model=/mnt/models/<path_to_speculative_model>
    --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
    --use-v2-block-manager
    1. Replace <path_to_speculative_model> and <path_to_original_model> with the paths to the speculative model and original model on your S3-compatible object storage.

    2. Replace <NUM_SPECULATIVE_TOKENS> with your own value.

  5. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --trust-remote-code
    Note

    Only use the --trust-remote-code argument with models from trusted sources.

  6. Click Deploy.

Verification
  • If you have configured the vLLM model-serving runtime for speculative decoding, use the following example command to verify API requests to your deployed model:

    curl -v https://<inference_endpoint_url>:443/v1/chat/completions
    -H "Content-Type: application/json"
    -H "Authorization: Bearer <token>"
  • If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:

    curl -v https://<inference_endpoint_url>:443/v1/chat/completions
    -H "Content-Type: application/json"
    -H "Authorization: Bearer <token>"
    -d '{"model":"<model_name>",
         "messages":
            [{"role":"<role>",
              "content":
                 [{"type":"text", "text":"<text>"
                  },
                  {"type":"image_url", "image_url":"<image_url_link>"
                  }
                 ]
             }
            ]
        }'

Performance tuning on the single-model serving platform

Certain performance issues might require you to tune the parameters of your inference service or model-serving runtime.

Resolving CUDA out-of-memory errors

In certain cases, depending on the model and hardware accelerator used, the TGIS memory auto-tuning algorithm might underestimate the amount of GPU memory needed to process long sequences. This miscalculation can lead to Compute Unified Architecture (CUDA) out-of-memory (OOM) error responses from the model server. In such cases, you must update or add additional parameters in the TGIS model-serving runtime, as described in the following procedure.

Prerequisites
  • You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.

Procedure
  1. From the Open Data Hub dashboard, click Settings > Serving runtimes.

    The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.

  2. Based on the runtime that you used to deploy your model, perform one of the following actions:

    • If you used the pre-installed TGIS Standalone ServingRuntime for KServe runtime, duplicate the runtime to create a custom version and then follow the remainder of this procedure. For more information about duplicating the pre-installed TGIS runtime, see Adding a custom model-serving runtime for the single-model serving platform.

    • If you were already using a custom TGIS runtime, click the action menu (⋮) next to the runtime and select Edit.

      The embedded YAML editor opens and shows the contents of the custom model-serving runtime.

  3. Add or update the BATCH_SAFETY_MARGIN environment variable and set the value to 30. Similarly, add or update the ESTIMATE_MEMORY_BATCH_SIZE environment variable and set the value to 8.

    spec:
      containers:
        env:
        - name: BATCH_SAFETY_MARGIN
          value: 30
        - name: ESTIMATE_MEMORY_BATCH
          value: 8
    Note

    The BATCH_SAFETY_MARGIN parameter sets a percentage of free GPU memory to hold back as a safety margin to avoid OOM conditions. The default value of BATCH_SAFETY_MARGIN is 20. The ESTIMATE_MEMORY_BATCH_SIZE parameter sets the batch size used in the memory auto-tuning algorithm. The default value of ESTIMATE_MEMORY_BATCH_SIZE is 16.

  4. Click Update.

    The Serving runtimes page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.

  5. To redeploy the model for the parameter updates to take effect, perform the following actions:

    1. From the Open Data Hub dashboard, click Model Serving > Deployed Models.

    2. Find the model you want to redeploy, click the action menu (⋮) next to the model, and select Delete.

    3. Redeploy the model as described in Deploying models on the single-model serving platform.

Verification
  • You receive successful responses from the model server and no longer see CUDA OOM errors.

Supported model-serving runtimes

Open Data Hub includes several preinstalled model-serving runtimes. You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. You can also add a custom runtime to support a model.

Table 1. Model-serving runtimes
Name Description Exported model format

Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe (1)

A composite runtime for serving models in the Caikit format

Caikit Text Generation

Caikit Standalone ServingRuntime for KServe (2)

A runtime for serving models in the Caikit embeddings format for embeddings tasks

Caikit Embeddings

OpenVINO Model Server

A scalable, high-performance runtime for serving models that are optimized for Intel architectures

PyTorch, TensorFlow, OpenVINO IR, PaddlePaddle, MXNet, Caffe, Kaldi

Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe (3)

A runtime for serving TGI-enabled models

PyTorch Model Formats

vLLM ServingRuntime for KServe

A high-throughput and memory-efficient inference and serving runtime for large language models

Supported models

vLLM ServingRuntime with Gaudi accelerators support for KServe

A high-throughput and memory-efficient inference and serving runtime that supports Intel Gaudi accelerators

Supported models

  1. The composite Caikit-TGIS runtime is based on Caikit and Text Generation Inference Server (TGIS). To use this runtime, you must convert your models to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.

  2. The Caikit Standalone runtime is based on Caikit NLP. To use this runtime, you must convert your models to the Caikit embeddings format. For an example, see Tests for text embedding module.

  3. Text Generation Inference Server (TGIS) is based on an early fork of Hugging Face TGI. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model is incompatible in the current version of Open Data Hub, support might be added in a future version. In the meantime, you can also add your own custom runtime to support a TGI model. For more information, see Adding a custom model-serving runtime for the single-model serving platform.

Table 2. Deployment requirements
Name Default protocol Additonal protocol Model mesh support Single node OpenShift support Deployment mode

Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe

REST

gRPC

No

Yes

Raw and serverless

Caikit Standalone ServingRuntime for KServe

REST

gRPC

No

Yes

Raw and serverless

OpenVINO Model Server

REST

None

Yes

Yes

Raw and serverless

Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe

gRPC

None

No

Yes

Raw and serverless

vLLM ServingRuntime for KServe

REST

None

No

Yes

Raw and serverless

vLLM ServingRuntime with Gaudi accelerators support for KServe

REST

None

No

Yes

Raw and serverless

Additional resources

Tested and verified model-serving runtimes

Tested and verified runtimes are community versions of model-serving runtimes that have been tested and verified against specific versions of Open Data Hub.

Red Hat tests the current version of a tested and verified runtime each time there is a new version of Open Data Hub. If a new version of a tested and verified runtime is released in the middle of an Open Data Hub release cycle, it will be tested and verified in an upcoming release.

Note

Tested and verified runtimes are not directly supported by Red Hat. You are responsible for ensuring that you are licensed to use any tested and verified runtimes that you add, and for correctly configuring and maintaining them.

Table 3. Model-serving runtimes
Name Description Exported model format

NVIDIA Triton Inference Server

An open-source inference-serving software for fast and scalable AI in applications.

TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more

Table 4. Deployment requirements
Name Default protocol Additonal protocol Model mesh support Single node OpenShift support Deployment mode

NVIDIA Triton Inference Server

gRPC

REST

Yes

Yes

Raw and serverless

Additional resources

Inference endpoints

These examples show how to use inference endpoints to query the model.

Note

If you enabled token authorization when deploying the model, add the Authorization header and specify a token value.

Caikit TGIS ServingRuntime for KServe

  • :443/api/v1/task/text-generation

  • :443/api/v1/task/server-streaming-text-generation

Example command
curl --json '{"model_id": "<model_name>", "inputs": "<text>"}' \
https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation \
-H 'Authorization: Bearer <token>'

Caikit Standalone ServingRuntime for KServe

If you are serving multiple models, you can query /info/models or :443 caikit.runtime.info.InfoService/GetModelsInfo to view a list of served models.

REST endpoints
  • /api/v1/task/embedding

  • /api/v1/task/embedding-tasks

  • /api/v1/task/sentence-similarity

  • /api/v1/task/sentence-similarity-tasks

  • /api/v1/task/rerank

  • /api/v1/task/rerank-tasks

  • /info/models

  • /info/version

  • /info/runtime

gRPC endpoints
  • :443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict

  • :443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict

  • :443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict

  • :443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict

  • :443 caikit.runtime.Nlp.NlpService/RerankTaskPredict

  • :443 caikit.runtime.Nlp.NlpService/RerankTasksPredict

  • :443 caikit.runtime.info.InfoService/GetModelsInfo

  • :443 caikit.runtime.info.InfoService/GetRuntimeInfo

Note

By default, the Caikit Standalone Runtime exposes REST endpoints. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see Adding a custom model-serving runtime for the single-model serving platform.

An example manifest is available in the caikit-tgis-serving GitHub repository.

Example command

REST

curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'

gRPC

grpcurl -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'

TGIS Standalone ServingRuntime for KServe

  • :443 fmaas.GenerationService/Generate

  • :443 fmaas.GenerationService/GenerateStream

    Note

    To query the endpoint for the TGIS standalone runtime, you must also download the files in the proto directory of the Open Data Hub text-generation-inference repository.

Example command
grpcurl -proto text-generation-inference/proto/generation.proto -d \
'{"requests": [{"text":"<text>"}]}' \
-insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate \
-H 'Authorization: Bearer <token>'

OpenVINO Model Server

  • /v2/models/<model-name>/infer

Example command
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d \
'{ "model_name": "<model_name>", \
"inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' \
-H 'Authorization: Bearer <token>'

vLLM ServingRuntime for KServe

  • :443/version

  • :443/docs

  • :443/v1/models

  • :443/v1/chat/completions

  • :443/v1/completions

  • :443/v1/embeddings

  • :443/tokenize

  • :443/detokenize

    Note
    • The vLLM runtime is compatible with the OpenAI REST API. For a list of models that the vLLM runtime supports, see Supported models.

    • To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see Supported embeddings models in vLLM.

    • As of vLLM v0.5.5, you must provide a chat template while querying a model using the /v1/chat/completions endpoint. If your model does not include a predefined chat template, you can use the chat-template command-line parameter to specify a chat template in your custom vLLM runtime, as shown in the example. Replace <CHAT_TEMPLATE> with the path to your template.

      containers:
        - args:
            - --chat-template=<CHAT_TEMPLATE>

      You can use the chat templates that are available as .jinja files here or with the vLLM image under /apps/data/template. For more information, see Chat templates.

    As indicated by the paths shown, the single-model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.

Example command
curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H \
"Content-Type: application/json" -d '{ \
"messages": [{ \
"role": "<role>", \
"content": "<content>" \
}] -H 'Authorization: Bearer <token>'

vLLM ServingRuntime with Gaudi accelerators support for KServe

NVIDIA Triton Inference Server

REST endpoints
  • v2/models/[/versions/<model_version>]/infer

  • v2/models/<model_name>[/versions/<model_version>]

  • v2/health/ready

  • v2/health/live

  • v2/models/<model_name>[/versions/]/ready

  • v2

Note

ModelMesh does not support the following REST endpoints:

  • v2/health/live

  • v2/health/ready

  • v2/models/<model_name>[/versions/]/ready

Example command
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d /
'{ "model_name": "<model_name>", \
   "inputs": \
	[{ "name": "<name_of_model_input>", \
           "shape": [<shape>], \
           "datatype": "<data_type>", \
           "data": [<data>] \
         }]}' -H 'Authorization: Bearer <token>'
gRPC endpoints
  • :443 inference.GRPCInferenceService/ModelInfer

  • :443 inference.GRPCInferenceService/ModelReady

  • :443 inference.GRPCInferenceService/ModelMetadata

  • :443 inference.GRPCInferenceService/ServerReady

  • :443 inference.GRPCInferenceService/ServerLive

  • :443 inference.GRPCInferenceService/ServerMetadata

Example command
grpcurl -cacert ./openshift_ca_istio_knative.crt \
        -proto ./grpc_predict_v2.proto \
        -d @ \
        -H "Authorization: Bearer <token>" \
        <inference_endpoint_url>:443 \
        inference.GRPCInferenceService/ModelMetadata

About the NVIDIA NIM model serving platform

You can deploy models using NVIDIA NIM inference services on the NVIDIA NIM model serving platform.

NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of microservices designed for secure, reliable deployment of high performance AI model inferencing across clouds, data centers and workstations.

Additional resources

Enabling the NVIDIA NIM model serving platform

As an administrator, you can use the Open Data Hub dashboard to enable the NVIDIA NIM model serving platform.

Prerequisites
Procedure
  1. On the Open Data Hub home page, click Explore.

  2. On the Explore page, find the NVIDIA NIM tile.

  3. Click Enable on the application tile.

  4. Enter the NVIDIA AI Enterprise license key and then click Submit.

Verification
  • The NVIDIA NIM application that you enabled appears on the Enabled page.

Deploying models on the NVIDIA NIM model serving platform

When you have enabled the NVIDIA NIM model serving platform, you can start to deploy NVIDIA-optimized models on the platform.

Prerequisites
  • You have logged in to Open Data Hub.

  • If you are using Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

  • You have enabled the NVIDIA NIM model serving platform.

  • You have created a data science project.

    Procedure
    1. In the left menu, click Data Science Projects.

      The Data Science Projects page opens.

    2. Click the name of the project that you want to deploy a model in.

      A project details page opens.

    3. Click the Models tab.

    4. Find the ​​NVIDIA NIM model serving platform tile, then click Deploy model.

      The Deploy model dialog opens.

    5. Configure properties for deploying your model as follows:

      1. In the Model deployment name field, enter a unique name for the deployment.

      2. From the NVIDIA NIM list, select the NVIDIA NIM model that you want to deploy.

      3. In the NVIDIA NIM storage size field, specify the size of the cluster storage instance that will be created to store the NVIDIA NIM model.

      4. In the Number of model server replicas to deploy field, specify a value.

      5. From the Model server size list, select a value.

      6. From the Accelerator list, select the NVIDIA GPU accelerator.

        The Number of accelerators field appears.

      7. In the Number of accelerators field, specify the number of accelerators to use. The default value is 1.

      8. Click Deploy.

Verification
  • Confirm that the deployed model is shown on the Models tab for the project, and on the Model Serving page of the dashboard with a checkmark in the Status column.