Info alert:Important Notice

Please note that more information about the previous v2 releases can be found here. You can use "Find a release" search bar to search for a particular release.

Serving models

Table of Contents

About model serving
Serving models on the single-model serving platform
Serving models on the multi-model serving platform

About model serving

When you serve a model, you upload a trained model into Open Data Hub for querying, which allows you to integrate your trained models into intelligent applications.

You can upload a model to an S3-compatible object storage, persistent volume claim, or Open Container Initiative (OCI) image. You can then access and train the model from your project workbench. After training the model, you can serve or deploy the model using a model-serving platform.

Serving or deploying the model makes the model available as a service, or model runtime server, that you can access using an API. You can then access the inference endpoints for the deployed model from the dashboard and see predictions based on data inputs that you provide through API calls. Querying the model through the API is also called model inferencing.

You can serve models on one of the following model-serving platforms:

Single-model serving platform
Multi-model serving platform
NVIDIA NIM model serving platform

The model-serving platform that you choose depends on your business needs:

If you want to deploy each model on its own runtime server, or want to use a serverless deployment, select the single-model serving platform. The single-model serving platform is recommended for production use.
If you want to deploy multiple models with only one runtime server, select the multi-model serving platform. This option is best if you are deploying more than 1,000 small and medium models and want to reduce resource consumption.
If you want to use NVIDIA Inference Microservices (NIMs) to deploy a model, select the NVIDIA NIM-model serving platform.

Single-model serving platform

You can deploy each model from a dedicated model server on the single-model serving platform. Deploying models from a dedicated model server can help you deploy, monitor, scale, and maintain models that require increased resources. This model serving platform is ideal for serving large models. The single-model serving platform is based on the KServe component.

The single-model serving platform is helpful for use cases such as:

Large language models (LLMs)
Generative AI

Multi-model serving platform

You can deploy multiple models from the same model server on the multi-model serving platform. Each of the deployed models shares the server resources. Deploying multiple models from the same model server can be advantageous on OpenShift clusters that have finite compute resources or pods. This model serving platform is ideal for serving small and medium models in large quantities. The multi-model serving platform is based on the ModelMesh component.

NVIDIA NIM model serving platform

You can deploy models using NVIDIA Inference Microservices (NIM) on the NVIDIA NIM model serving platform.

NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of microservices designed for secure, reliable deployment of high performance AI model inferencing across clouds, data centers and workstations.

NVIDIA NIM inference services are helpful for use cases such as:

Using GPU-accelerated containers inferencing models optimized by NVIDIA
Deploying generative AI for virtual screening, content generation, and avatar creation

Serving models on the single-model serving platform

For deploying large models, such as large language models (LLMs), Open Data Hub includes a single-model serving platform that is based on the KServe component. Because each model is deployed from its own model server, the single-model serving platform helps you to deploy, monitor, scale, and maintain large models that require more resources.

About the single-model serving platform

The single-model serving platform consists of the following components:

KServe: A Kubernetes custom resource definition (CRD) that orchestrates model serving for all types of models. It includes model-serving runtimes that implement the loading of given types of model servers. KServe handles the lifecycle of the deployment object, storage access, and networking setup.
Red Hat OpenShift Serverless: A cloud-native development model that allows for serverless deployments of models. OpenShift Serverless is based on the open source Knative project.

To install the single-model serving platform, you have the following options:

Automated installation: If you have not already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you can configure the Open Data Hub Operator to install KServe and configure its dependencies.
Manual installation: If you have already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you cannot configure the Open Data Hub Operator to install KServe and configure its dependencies. In this situation, you must install KServe manually.

When you have installed KServe, you can use the Open Data Hub dashboard to deploy models using preinstalled or custom model-serving runtimes.

Open Data Hub includes preinstalled runtimes for KServe. For more information, see Supported model-serving runtimes.

You can also configure monitoring for the single-model serving platform and use Prometheus to scrape the available metrics.

Model-serving runtimes

You can serve models on the single-model serving platform by using model-serving runtimes. The configuration of a model-serving runtime is defined by the ServingRuntime and InferenceService custom resource definitions (CRDs).

ServingRuntime

The ServingRuntime CRD creates a serving runtime, an environment for deploying and managing a model. It creates the templates for pods that dynamically load and unload models of various formats and also exposes a service endpoint for inferencing requests.

The following YAML configuration is an example of the vLLM ServingRuntime for KServe model-serving runtime. The configuration includes various flags, environment variables and command-line arguments.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' (1)
    openshift.io/display-name: vLLM ServingRuntime for KServe (2)
  labels:
    opendatahub.io/dashboard: "true"
  name: vllm-runtime
  namespace: <namespace>
spec:
  annotations:
    prometheus.io/path: /metrics (3)
    prometheus.io/port: "8080" (4)
  containers:
    - args:
        - --port=8080
        - --model=/mnt/models (5)
        - --served-model-name={{.Name}} (6)
      command: (7)
        - python
        - '-m'
        - vllm.entrypoints.openai.api_server
      env:
        - name: HF_HOME
          value: /tmp/hf_home
      image: quay.io/modh/vllm@sha256:8a3dd8ad6e15fe7b8e5e471037519719d4d8ad3db9d69389f2beded36a6f5b21 (8)
      name: kserve-container
      ports:
        - containerPort: 8080
          protocol: TCP
  multiModel: false (9)
  supportedModelFormats: (10)
    - autoSelect: true
      name: vLLM

The recommended accelerator to use with the runtime.
The name with which the serving runtime is displayed.
The endpoint used by Prometheus to scrape metrics for monitoring.
The port used by Prometheus to scrape metrics for monitoring.
The path to where the model files are stored in the runtime container.
Passes the model name that is specified by the {{.Name}} template variable inside the runtime container specification to the runtime environment. The {{.Name}} variable maps to the spec.predictor.name field in the InferenceService metadata object.
The entrypoint command that starts the runtime container.
The runtime container image used by the serving runtime. This image differs depending on the type of accelerator used.
Specifies that the runtime is used for single-model serving.
Specifies the model formats supported by the runtime.

InferenceService

The InferenceService CRD creates a server or inference service that processes inference queries, passes it to the model, and then returns the inference output.

The inference service also performs the following actions:

Specifies the location and format of the model.
Specifies the serving runtime used to serve the model.
Enables the passthrough route for gRPC or REST inference.
Defines HTTP or gRPC endpoints for the deployed model.

The following example shows the InferenceService YAML configuration file that is generated when deploying a granite model with the vLLM runtime:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: granite
    serving.knative.openshift.io/enablePassthrough: 'true'
    sidecar.istio.io/inject: 'true'
    sidecar.istio.io/rewriteAppHTTPProbers: 'true'
  name: granite
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '6'
          memory: 24Gi
          nvidia.com/gpu: '1'
        requests:
          cpu: '1'
          memory: 8Gi
          nvidia.com/gpu: '1'
      runtime: vllm-runtime
      storage:
        key: aws-connection-my-storage
        path: models/granite-7b-instruct/
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists

Additional resources

Serving Runtimes

About KServe deployment modes

You can deploy models in either advanced or standard deployment mode.

Advanced deployment mode uses Knative Serverless. By default, KServe integrates with Red Hat OpenShift Serverless and Red Hat OpenShift Service Mesh to deploy models on the single-model serving platform. Red Hat Serverless is based on the open source Knative project and requires the Red Hat OpenShift Serverless Operator.

Alternatively, you can use standard deployment mode, which uses KServe RawDeployment mode and does not require the Red Hat OpenShift Serverless Operator, Red Hat OpenShift Service Mesh, or Authorino.

If you configure KServe for advanced deployment mode, you can set up your data science project to serve models in both advanced and standard deployment mode. However, if you configure KServe for only standard deployment mode, you can only use standard deployment mode.

There are both advantages and disadvantages to using each of these deployment modes:

Advanced mode

Advantages:

Enables autoscaling based on request volume:
- Resources scale up automatically when receiving incoming requests.
- Optimizes resource usage and maintains performance during peak times.
Supports scale down to and from zero using Knative:
- Allows resources to scale down completely when there are no incoming requests.
- Saves costs by not running idle resources.

Disadvantages:

Has customization limitations:
- Serverless is backed by Knative and implicitly inherits the same design choices, such as when mounting multiple volumes.
Dependency on Knative for scaling:
- Introduces additional complexity in setup and management compared to traditional scaling methods.
Cluster scoped component:
- If the cluster already has Serverless configured, you must manually configure the cluster to make it work with Open Data Hub.

Standard mode

Advantages:

Enables deployment with Kubernetes resources, such as Deployment, Service, Route, and Horizontal Pod Autoscaler, without additional dependencies like Red Hat Serverless, Red Hat Service Mesh, and Authorino.
- The resulting model deployment has a smaller resource footprint compared to advanced mode.
Enables traditional Deployment/Pod configurations, such as mounting multiple volumes, which is not available using Knative.
- Beneficial for applications requiring complex configurations or multiple storage mounts.

Disadvantages:

Does not support automatic scaling:
- Does not support automatic scaling down to zero resources when idle.
- Might result in higher costs during periods of low traffic.

Installing KServe

To learn how to perform both automated and manual installation of KServe, see Installation in the caikit-tgis-serving repository.

Deploying models by using the single-model serving platform

On the single-model serving platform, each model is deployed on its own model server. This helps you to deploy, monitor, scale, and maintain large models that require increased resources.

Important

If you want to use the single-model serving platform to deploy a model from S3-compatible storage that uses a self-signed SSL certificate, you must install a certificate authority (CA) bundle on your OpenShift cluster. For more information, see Understanding how Open Data Hub handles certificates.

Enabling the single-model serving platform

When you have installed KServe, you can use the Open Data Hub dashboard to enable the single-model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
You have installed KServe.
The spec.dashboardConfig.disableKServe dashboard configuration option is set to false (the default).

For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

Enable the single-model serving platform as follows:
1. In the left menu, click Settings → Cluster settings.
2. Locate the Model serving platforms section.
3. To enable the single-model serving platform for projects, select the Single-model serving platform checkbox.
4. Select Standard (No additional dependencies) or Advanced (Serverless and Service Mesh) deployment mode.
  
  For more information about these deployment mode options, see About KServe deployment modes.
5. Click Save changes.
Enable preinstalled runtimes for the single-model serving platform as follows:
1. In the left menu of the Open Data Hub dashboard, click Settings → Serving runtimes.
  
  The Serving runtimes page shows preinstalled runtimes and any custom runtimes that you have added.
  
  For more information about preinstalled runtimes, see Supported runtimes.
2. Set the runtime that you want to use to Enabled.
  
  The single-model serving platform is now available for model deployments.

Adding a custom model-serving runtime for the single-model serving platform

A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the preinstalled runtimes that are included with Open Data Hub. You can also add your own custom runtimes if the default runtimes do not meet your needs.

As an administrator, you can use the Open Data Hub interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the single-model serving platform.

Note	Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
You have built your custom runtime and added the image to a container image repository such as Quay.

Procedure

From the Open Data Hub dashboard, click Settings → Serving runtimes.

The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a custom runtime, choose one of the following options:
- To start with an existing runtime (for example, vLLM NVIDIA GPU ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
- To add a new custom runtime, click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
In the Select the API protocol this runtime supports list, select REST or gRPC.
Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:
- Upload a YAML file
  
  Click Upload files.
  
  In the file browser, select a YAML file on your computer.
  
  The embedded YAML editor opens and shows the contents of the file that you uploaded.
- Enter YAML code directly in the editor
  
  Click Start from scratch.
  
  Enter or paste YAML code directly in the embedded editor.
Note
In many cases, creating a custom runtime will require adding new or custom parameters to the env section of the ServingRuntime specification.
Click Add.

The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification

The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Adding a tested and verified model-serving runtime for the single-model serving platform

In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes such as the NVIDIA Triton Inference Server to support your needs. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Open Data Hub.

You can use the Open Data Hub dashboard to add and enable the NVIDIA Triton Inference Server or the Seldon MLServer runtime for the single-model serving platform. You can then choose the runtime when you deploy a model on the single-model serving platform.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.

Procedure

From the Open Data Hub dashboard, click Settings → Serving runtimes.

The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
Click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
In the Select the API protocol this runtime supports list, select REST or gRPC.
Click Start from scratch.

Follow these steps to add the NVIDIA Triton Inference Server runtime:

If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-kserve-rest
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
    - args:
        - tritonserver
        - --model-store=/mnt/models
        - --grpc-port=9000
        - --http-port=8080
        - --allow-grpc=true
        - --allow-http=true
      image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
      name: kserve-container
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 8080
          protocol: TCP
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: tensorrt
      version: "8"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: onnx
      version: "1"
    - name: pytorch
      version: "1"
    - autoSelect: true
      name: triton
      version: "2"
    - autoSelect: true
      name: xgboost
      version: "1"
    - autoSelect: true
      name: python
      version: "1"

If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-kserve-grpc
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
    - args:
        - tritonserver
        - --model-store=/mnt/models
        - --grpc-port=9000
        - --http-port=8080
        - --allow-grpc=true
        - --allow-http=true
      image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
      name: kserve-container
      ports:
        - containerPort: 9000
          name: h2c
          protocol: TCP
      volumeMounts:
        - mountPath: /dev/shm
          name: shm
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: tensorrt
      version: "8"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: onnx
      version: "1"
    - name: pytorch
      version: "1"
    - autoSelect: true
      name: triton
      version: "2"
    - autoSelect: true
      name: xgboost
      version: "1"
    - autoSelect: true
      name: python
      version: "1"
volumes:
  - emptyDir: null
    medium: Memory
    sizeLimit: 2Gi
    name: shm

Follow these steps to add the Seldon MLServer runtime:

If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-kserve-rest
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    openshift.io/display-name: Seldon MLServer
    prometheus.kserve.io/port: "8080"
    prometheus.kserve.io/path: /metrics
  containers:
    - name: kserve-container
      image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390'
      env:
        - name: MLSERVER_HTTP_PORT
          value: "8080"
        - name: MLSERVER_GRPC_PORT
          value: "9000"
        - name: MODELS_DIR
          value: /mnt/models
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 8080
          protocol: TCP
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        privileged: false
        runAsNonRoot: true
  protocolVersions:
    - v2
  multiModel: false
  supportedModelFormats:
    - name: sklearn
      version: "0"
      autoSelect: true
      priority: 2
    - name: sklearn
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "2"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "3"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "4"
      autoSelect: true
      priority: 2
    - name: mlflow
      version: "1"
      autoSelect: true
      priority: 1
    - name: mlflow
      version: "2"
      autoSelect: true
      priority: 1
    - name: catboost
      version: "1"
      autoSelect: true
      priority: 1
    - name: huggingface
      version: "1"
      autoSelect: true
      priority: 1

If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-kserve-grpc
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    openshift.io/display-name: Seldon MLServer
    prometheus.kserve.io/port: "8080"
    prometheus.kserve.io/path: /metrics
  containers:
    - name: kserve-container
      image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390'
      env:
        - name: MLSERVER_HTTP_PORT
          value: "8080"
        - name: MLSERVER_GRPC_PORT
          value: "9000"
        - name: MODELS_DIR
          value: /mnt/models
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 9000
          name: h2c
          protocol: TCP
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        privileged: false
        runAsNonRoot: true
  protocolVersions:
    - v2
  multiModel: false
  supportedModelFormats:
    - name: sklearn
      version: "0"
      autoSelect: true
      priority: 2
    - name: sklearn
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "2"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "3"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "4"
      autoSelect: true
      priority: 2
    - name: mlflow
      version: "1"
      autoSelect: true
      priority: 1
    - name: mlflow
      version: "2"
      autoSelect: true
      priority: 1
    - name: catboost
      version: "1"
      autoSelect: true
      priority: 1
    - name: huggingface
      version: "1"
      autoSelect: true
      priority: 1

In the metadata.name field, make sure that the value of the runtime you are adding does not match a runtime that you have already added.
Optional: To use a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:
```
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: kserve-triton
  annotations:
    openshift.io/display-name: Triton ServingRuntime
```
Note
If you do not configure a custom display name for your runtime, Open Data Hub shows the value of the metadata.name field.
Click Create.

The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
Optional: To edit the runtime, click the action menu (⋮) and select Edit.

Verification

The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Additional resources

Tested and verified model-serving runtimes

Deploying models on the single-model serving platform

When you have enabled the single-model serving platform, you can enable a preinstalled or custom model-serving runtime and deploy models on the platform.

You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. For help adding a custom runtime, see Adding a custom model-serving runtime for the single-model serving platform.

Prerequisites

You have logged in to Open Data Hub.
You have installed KServe.
You have enabled the single-model serving platform.
(Advanced deployments only) To enable token authentication and external model routes for deployed models, you have added Authorino as an authorization provider.
You have created a data science project.
You have access to S3-compatible object storage.
For the model that you want to deploy, you know the associated URI in your S3-compatible object storage bucket or Open Container Initiative (OCI) container.
To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
To use the vLLM NVIDIA GPU ServingRuntime for KServe runtime or use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
To use the VLLM runtime on IBM Z and IBM Power, use the vLLM CPU ServingRuntime for KServe. For IBM Z and IBM Power, vLLM runtime is supported only on CPU.
To use the vLLM Intel Gaudi Accelerator ServingRuntime for KServe runtime, you have enabled support for hybrid processing units (HPUs) in Open Data Hub. This includes installing the Intel Gaudi Base Operator and configuring a hardware profile. For more information, see Setting up Gaudi for OpenShift and Working with hardware profiles.
To use the vLLM AMD GPU ServingRuntime for KServe runtime, you have enabled support for AMD graphic processing units (GPUs) in Open Data Hub. This includes installing the AMD GPU Operator and configuring a hardware profile. For more information, see Deploying the AMD GPU operator on OpenShift and Working with hardware profiles.
To deploy RHEL AI models:
- You have enabled the vLLM NVIDIA GPU ServingRuntime for KServe runtime.
- You have downloaded the model from the Red Hat container registry and uploaded it to S3-compatible object storage.

Procedure

In the left menu, click Data science projects.

The Data science projects page opens.
Click the name of the project that you want to deploy a model in.

A project details page opens.
Click the Models tab.
Perform one of the following actions:
- If you see a Single-model serving platform tile, click Deploy model on the tile.
- If you do not see any tiles, click the Deploy model button.
The Deploy model dialog opens.
In the Model deployment name field, enter a unique name for the model that you are deploying.
In the Serving runtime field, select an enabled runtime. If project-scoped runtimes exist, the Serving runtime list includes subheadings to distinguish between global runtimes and project-scoped runtimes.
From the Model framework (name - version) list, select a value.
From the Deployment mode list, select standard or advanced. For more information about deployment modes, see About KServe deployment modes.
In the Number of model server replicas to deploy field, specify a value.

The following options are only available if you have created a hardware profile:

From the Hardware profile list, select a hardware profile. If project-scoped hardware profiles exist, the Hardware profile list includes subheadings to distinguish between global hardware profiles and project-scoped hardware profiles.

Important

By default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. If you enable hardware profiles, the Hardware profiles list appears instead of the Accelerator profiles list. To show the Settings → Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the disableHardwareProfiles value to false in the OdhDashboardConfig custom resource (CR) in OpenShift Container Platform. For more information about setting dashboard configuration options, see Customizing the dashboard.

Optional To change these default values, click Customize resource requests and limit and enter new minimum (request) and maximum (limit) values. The hardware profile specifies the number of CPUs and the amount of memory allocated to the container, setting the guaranteed minimum (request) and maximum (limit) for both.

Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
To require token authentication for inference requests to the deployed model, perform the following actions:
1. Select Require token authentication.
2. In the Service account name field, enter the service account name that the token will be generated for.
3. To add an additional service account, click Add a service account and enter another service account name.

To specify the location of your model, perform one of the following sets of actions:

To use an existing connection

Select Existing connection.

From the Name list, select a connection that you previously defined.

For S3-compatible object storage: In the Path field, enter the folder path that contains the model in your specified data source.

For Open Container Image connections: In the OCI storage location field, enter the model URI where the model is located.

Note

If you are deploying a registered model version with an existing S3, URI, or OCI data connection, some of your connection details might be autofilled. This depends on the type of data connection and the number of matching connections available in your data science project. For example, if only one matching connection exists, fields like the path, URI, endpoint, model URI, bucket, and region might populate automatically. Matching connections will be labeled as Recommended.

To use a new connection
1. To define a new connection that your model can access, select New connection.
  1. In the Add connection modal, select a Connection type. The OCI-compliant registry, S3 compatible object storage, and URI options are pre-installed connection types. Additional options might be available if your Open Data Hub administrator added them.
    
    The Add connection form opens with fields specific to the connection type that you selected.
2. Fill in the connection detail fields.

(Optional) Customize the runtime parameters in the Configuration parameters section:
1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
2. Modify the values in Additional environment variables to define variables in the model’s environment.
  
  The Configuration parameters section shows predefined serving runtime parameters, if any are available.
  
  Note
  Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
Click Deploy.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.

Stopping and starting a deployed model

You can stop a deployed model to perform edits without consuming cluster resources or triggering a redeployment. When you stop a model, all associated objects are terminated, and the model is unavailable for inference requests. When you start the model again, any pending configuration changes are applied.

Prerequisites

You have logged in to Open Data Hub.
You have deployed a model in a data science project.

Procedure

From the Open Data Hub dashboard, click Models > Model deployments.
Locate the model that you want to stop or start.
In the Status column for the model, click Stop or Start.

When you stop the model, the status changes to Stopping as the pods are terminated, and then changes to Stopped. When you start the model, the status changes to Starting as new pods are created, and then changes to Running.

Deploying models by using multiple GPU nodes

Deploy models across multiple GPU nodes to handle large models, such as large language models (LLMs).

You can serve models on Open Data Hub across multiple GPU nodes using the vLLM serving framework. Multi-node inferencing uses the vllm-multinode-runtime custom runtime, which uses the same image as the vLLM NVIDIA GPU ServingRuntime for KServe runtime and also includes information necessary for multi-GPU inferencing.

You can deploy the model from a persistent volume claim (PVC) or from an Open Container Initiative (OCI) container image.

Prerequisites

You have cluster administrator privileges for your OpenShift Container Platform cluster.
You have downloaded and installed the OpenShift Container Platform command-line interface (CLI). For more information, see Installing the OpenShift CLI.
You have enabled the operators for your GPU type, such as Node Feature Discovery Operator, NVIDIA GPU Operator. For more information about enabling accelerators, see Working with accelerators.
- You are using an NVIDIA GPU (nvidia.com/gpu).
- You have specified the GPU type through either the ServingRuntime or InferenceService. If the GPU type specified in the ServingRuntime differs from what is set in the InferenceService, both GPU types are assigned to the resource and can cause errors.
You have enabled KServe on your cluster.
You have only one head pod in your setup. Do not adjust the replica count using the min_replicas or max_replicas settings in the InferenceService. Creating additional head pods can cause them to be excluded from the Ray cluster.
To deploy from a PVC: You have a persistent volume claim (PVC) set up and configured for ReadWriteMany (RWX) access mode.
To deploy from an OCI container image:
- You have stored a model in an OCI container image.
- If the model is stored in a private OCI repository, you have configured an image pull secret.

Procedure

In a terminal window, if you are not already logged in to your OpenShift Container Platform cluster as a cluster administrator, log in to the OpenShift Container Platform CLI as shown in the following example:
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Select or create a namespace for deploying the model. For example, run the following command to create the kserve-demo namespace:
```
oc new-project kserve-demo
```

(Deploying a model from a PVC only) Create a PVC for model storage in the namespace where you want to deploy the model. Create a storage class using Filesystem volumeMode and use this storage class for your PVC. The storage size must be larger than the size of the model files on disk. For example:

Note	If you have already configured a PVC or are deploying a model from an OCI container image, you can skip this step.

kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: granite-8b-code-base-pvc
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: <model size>
  storageClassName: <storage class>

Create a pod to download the model to the PVC you created. Update the sample YAML with your bucket name, model path, and credentials:

apiVersion: v1
kind: Pod
metadata:
  name: download-granite-8b-code
  labels:
    name: download-granite-8b-code
spec:
  volumes:
    - name: model-volume
      persistentVolumeClaim:
        claimName: granite-8b-code-base-pvc
  restartPolicy: Never
  initContainers:
    - name: fix-volume-permissions
      image: quay.io/quay/busybox@sha256:92f3298bf80a1ba949140d77987f5de081f010337880cd771f7e7fc928f8c74d
      command: ["sh"]
      args: ["-c", "mkdir -p /mnt/models/$(MODEL_PATH) && chmod -R 777 /mnt/models"] (1)
      volumeMounts:
        - mountPath: "/mnt/models/"
          name: model-volume
      env:
        - name: MODEL_PATH
          value: <model path> (2)
  containers:
    - resources:
        requests:
          memory: 40Gi
      name: download-model
      imagePullPolicy: IfNotPresent
      image: quay.io/opendatahub/kserve-storage-initializer:v0.14 (3)
      args:
        - 's3://$(BUCKET_NAME)/$(MODEL_PATH)/'
        - /mnt/models/$(MODEL_PATH)
      env:
        - name: AWS_ACCESS_KEY_ID
          value: <id> (4)
        - name: AWS_SECRET_ACCESS_KEY
          value: <secret> (5)
        - name: BUCKET_NAME
          value: <bucket_name> (6)
        - name: MODEL_PATH
          value: <model path> (2)
        - name: S3_USE_HTTPS
          value: "1"
        - name: AWS_ENDPOINT_URL
          value: <AWS endpoint> (7)
        - name: awsAnonymousCredential
          value: 'false'
        - name: AWS_DEFAULT_REGION
          value: <region> (8)
        - name: S3_VERIFY_SSL
          value: 'true' (9)
      volumeMounts:
        - mountPath: "/mnt/models/"
          name: model-volume

The chmod operation is permitted only if your pod is running as root. Remove`chmod -R 777` from the arguments if you are not running the pod as root.
Specify the path to the model.
The value for containers.image, located in your donwload-model container. To access this value, run the following command: oc get configmap inferenceservice-config -n opendatahub -oyaml | grep kserve-storage-initializer:
The access key ID to your S3 bucket.
The secret access key to your S3 bucket.
The name of your S3 bucket.
The endpoint to your S3 bucket.
The region for your S3 bucket if using an AWS S3 bucket. If using other S3-compatible storage, such as ODF or Minio, you can remove the AWS_DEFAULT_REGION environment variable.
If you encounter SSL errors, change S3_VERIFY_SSL to false.

Create the vllm-multinode-runtime custom runtime in your project namespace:
```
oc process vllm-multinode-runtime-template -n opendatahub|oc apply  -f -
```
Deploy the model using the following InferenceService configuration:
```
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/autoscalerClass: external
  name: <inference service name>
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-multinode-runtime
      storageUri: <storage_uri_path> (1)
    workerSpec: {} (2)
```
1. Specify the path to your model based on your deployment method:
  - For PVC: pvc://<pvc_name>/<model_path>
  - For an OCI container image: oci://<registry_host>/<org_or_username>/<repository_name><tag_or_digest>
2. The following configuration can be added to the InferenceService:
  - workerSpec.tensorParallelSize: Determines how many GPUs are used per node. The GPU type count in both the head and worker node deployment resources is updated automatically. Ensure that the value of workerSpec.tensorParallelSize is at least 1.
  - workerSpec.pipelineParallelSize: Determines how many nodes are used to balance the model in deployment. This variable represents the total number of nodes, including both the head and worker nodes. Ensure that the value of workerSpec.pipelineParallelSize is at least 2. Do not modify this value in production environments.
    
    Note
    You may need to specify additional arguments, depending on your environment and model size.
Deploy the model by applying the InferenceService configuration:
```
oc apply -f <inference-service-file.yaml>
```

Verification

To confirm that you have set up your environment to deploy models on multiple GPU nodes, check the GPU resource status, the InferenceService status, the Ray cluster status, and send a request to the model.

Check the GPU resource status:

Retrieve the pod names for the head and worker nodes:

# Get pod name
podName=$(oc get pod -l app=isvc.granite-8b-code-base-pvc-predictor --no-headers|cut -d' ' -f1)
workerPodName=$(oc get pod -l app=isvc.granite-8b-code-base-pvc-predictor-worker --no-headers|cut -d' ' -f1)

oc wait --for=condition=ready pod/${podName} --timeout=300s
# Check the GPU memory size for both the head and worker pods:
echo "### HEAD NODE GPU Memory Size"
kubectl exec $podName -- nvidia-smi
echo "### Worker NODE GPU Memory Size"
kubectl exec $workerPodName -- nvidia-smi

Sample response

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   33C    P0             71W /  300W |19031MiB /  23028MiB <1>|      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
         ...
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   30C    P0             69W /  300W |18959MiB /  23028MiB <2>|      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Confirm that the model loaded properly by checking the values of <1> and <2>. If the model did not load, the value of these fields is 0MiB.

Verify the status of your InferenceService using the following command:

oc wait --for=condition=ready pod/${podName} -n $DEMO_NAMESPACE --timeout=300s
export MODEL_NAME=granite-8b-code-base-pvc

Sample response

   NAME                 URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
   granite-8b-code-base-pvc   http://granite-8b-code-base-pvc.default.example.com

Send a request to the model to confirm that the model is available for inference:

oc wait --for=condition=ready pod/${podName} -n vllm-multinode --timeout=300s

oc port-forward $podName 8080:8080 &

curl http://localhost:8080/v1/completions \
       -H "Content-Type: application/json" \
       -d "{
            'model': "$MODEL_NAME",
            'prompt': 'At what temperature does Nitrogen boil?',
            'max_tokens': 100,
            'temperature': 0
        }"

Setting a timeout for KServe

When deploying large models or using node autoscaling with KServe, the operation may time out before a model is deployed because the default progress-deadline that KNative Serving sets is 10 minutes.

If a pod using KNative Serving takes longer than 10 minutes to deploy, the pod might be automatically marked as failed. This can happen if you are deploying large models that take longer than 10 minutes to pull from S3-compatible object storage or if you are using node autoscaling to reduce the consumption of GPU nodes.

To resolve this issue, you can set a custom progress-deadline in the KServe InferenceService for your application.

Prerequisites

You have namespace edit access for your OpenShift Container Platform cluster.

Procedure

Log in to the OpenShift Container Platform console as a cluster administrator.
Select the project where you have deployed the model.
In the Administrator perspective, click Home → Search.
From the Resources dropdown menu, search for InferenceService.
Under spec.predictor.annotations, modify the serving.knative.dev/progress-deadline with the new timeout:
```
apiVersion: serving.kserve.io/v1alpha1
kind: InferenceService
metadata:
  name: my-inference-service
spec:
  predictor:
    annotations:
      serving.knative.dev/progress-deadline: 30m
```
Note

Ensure that you set the progress-deadline on the spec.predictor.annotations level, so that the KServe InferenceService can copy the progress-deadline back to the KNative Service object.

Customizing the parameters of a deployed model-serving runtime

You might need additional parameters beyond the default ones to deploy specific models or to enhance an existing model deployment. In such cases, you can modify the parameters of an existing runtime to suit your deployment needs.

Note	Customizing the parameters of a runtime only affects the selected model deployment.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
You have deployed a model on the single-model serving platform.

Procedure

From the Open Data Hub dashboard, click Models → Model deployments.

The Model deployments page opens.
Click Stop next to the name of the model you want to customize.
Click the action menu (⋮) and select Edit.

The Configuration parameters section shows predefined serving runtime parameters, if any are available.
Customize the runtime parameters in the Configuration parameters section:
1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
2. Modify the values in Additional environment variables to define variables in the model’s environment.
  
  Note
  Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
After you are done customizing the runtime parameters, click Redeploy to save.
Click Start to deploy the model with your changes.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.
Confirm that the arguments and variables that you set appear in spec.predictor.model.args and spec.predictor.model.env by one of the following methods:
- Checking the InferenceService YAML from the OpenShift Container Platform Console.
- Using the following command in the OpenShift Container Platform CLI:
  oc get -o json inferenceservice <inferenceservicename/modelname> -n <projectname>

Customizable model serving runtime parameters

You can modify the parameters of an existing model serving runtime to suit your deployment needs.

For more information about parameters for each of the supported serving runtimes, see the following table:

Serving runtime	Resource
Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe	Caikit NLP: Configuration TGIS: Model configuration
Caikit Standalone ServingRuntime for KServe	Caikit NLP: Configuration
NVIDIA Triton Inference Server	NVIDIA Triton Inference Server: Model Parameters
OpenVINO Model Server	OpenVINO Model Server Features: Dynamic Input Parameters
Seldon MLServer	MLServer Documentation: Model Settings
[Deprecated] Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe	TGIS: Model configuration
vLLM NVIDIA GPU ServingRuntime for KServe	vLLM: Engine Arguments OpenAI-Compatible Server
vLLM AMD GPU ServingRuntime for KServe	vLLM: Engine Arguments OpenAI-Compatible Server
vLLM Intel Gaudi Accelerator ServingRuntime for KServe	vLLM: Engine Arguments OpenAI-Compatible Server

Serving runtime

Resource

Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe

Caikit NLP: Configuration
TGIS: Model configuration

Caikit Standalone ServingRuntime for KServe

Caikit NLP: Configuration

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server: Model Parameters

OpenVINO Model Server

OpenVINO Model Server Features: Dynamic Input Parameters

Seldon MLServer

MLServer Documentation: Model Settings

[Deprecated] Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe

TGIS: Model configuration

vLLM NVIDIA GPU ServingRuntime for KServe

vLLM: Engine Arguments
OpenAI-Compatible Server

vLLM AMD GPU ServingRuntime for KServe

vLLM: Engine Arguments
OpenAI-Compatible Server

vLLM Intel Gaudi Accelerator ServingRuntime for KServe

vLLM: Engine Arguments
OpenAI-Compatible Server

Additional resources

Customizing the parameters of a deployed model serving runtime

Using accelerators with vLLM

Open Data Hub includes support for NVIDIA, AMD and Intel Gaudi accelerators. Open Data Hub also includes preinstalled model-serving runtimes that provide accelerator support.

NVIDIA GPUs

You can serve models with NVIDIA graphics processing units (GPUs) by using the vLLM NVIDIA GPU ServingRuntime for KServe runtime. To use the runtime, you must enable GPU support in Open Data Hub. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.

Intel Gaudi accelerators

You can serve models with Intel Gaudi accelerators by using the vLLM Intel Gaudi Accelerator ServingRuntime for KServe runtime. To use the runtime, you must enable hybrid processing support (HPU) support in Open Data Hub. This includes installing the Intel Gaudi AI accelerator operator and configuring a hardware profile. For more information, see Setting up Gaudi for OpenShift and Working with hardware profiles.

For information about recommended vLLM parameters, environment variables, supported configurations and more, see vLLM with Intel® Gaudi® AI Accelerators.

Note

Warm-up is a model initialization and performance optimization step that is useful for reducing cold-start delays and first-inference latency. Depending on the model size, warm-up can lead to longer model loading times.

While highly recommended in production environments to avoid performance limitations, you can choose to skip warm-up for non-production environments to reduce model loading times and accelerate model development and testing cycles. To skip warm-up, follow the steps described in Customizing the parameters of a deployed model-serving runtime to add the following environment variable in the Configuration parameters section of your model deployment:

`VLLM_SKIP_WARMUP="true"`

AMD GPUs

You can serve models with AMD GPUs by using the vLLM AMD GPU ServingRuntime for KServe runtime. To use the runtime, you must enable support for AMD graphic processing units (GPUs) in Open Data Hub. This includes installing the AMD GPU operator and configuring a hardware profile. For more information, see Deploying the AMD GPU operator on OpenShift and Working with hardware profiles.

Additional resources

Supported model-serving runtimes

Using OCI containers for model storage

As an alternative to storing a model in an S3 bucket or URI, you can upload models to Open Container Initiative (OCI) containers. Deploying models from OCI containers is also known as modelcars in KServe.

Using OCI containers for model storage can help you:

Reduce startup times by avoiding downloading the same model multiple times.
Reduce disk space usage by reducing the number of models downloaded locally.
Improve model performance by allowing pre-fetched images.

Using OCI containers for model storage involves the following tasks:

Storing a model in an OCI image.
Deploying a model from an OCI image by using either the user interface or the command line interface. To deploy a model by using:
- The user interface, see Deploying models on the single-model serving platform.
- The command line interface, see Deploying a model stored in an OCI image by using the CLI.

Additional resources

Serving models with OCI images

Storing a model in an OCI image

You can store a model in an OCI image. The following procedure uses the example of storing a MobileNet v2-7 model in ONNX format.

Prerequisites

You have a model in the ONNX format. The example in this procedure uses the MobileNet v2-7 model in ONNX format.
You have installed the Podman tool.

Procedure

In a terminal window on your local machine, create a temporary directory for storing both the model and the support files that you need to create the OCI image:
```
cd $(mktemp -d)
```

Create a models folder inside the temporary directory:

mkdir -p models/1

Note	This example command specifies the subdirectory `1` because OpenVINO requires numbered subdirectories for model versioning. If you are not using OpenVINO, you do not need to create the `1` subdirectory to use OCI container images.

Download the model and support files:

DOWNLOAD_URL=https://github.com/onnx/models/raw/main/validated/vision/classification/mobilenet/model/mobilenetv2-7.onnx
curl -L $DOWNLOAD_URL -O --output-dir models/1/

Use the tree command to confirm that the model files are located in the directory structure as expected:
```
tree
```
The tree command should return a directory structure similar to the following example:
```
.
├── Containerfile
└── models
    └── 1
        └── mobilenetv2-7.onnx
```

Create a Docker file named Containerfile:

Note

Specify a base image that provides a shell. In the following example, ubi9-micro is the base container image. You cannot specify an empty image that does not provide a shell, such as scratch, because KServe uses the shell to ensure the model files are accessible to the model server.
Change the ownership of the copied model files and grant read permissions to the root group to ensure that the model server can access the files. OpenShift runs containers with a random user ID and the root group ID.

FROM registry.access.redhat.com/ubi9/ubi-micro:latest
COPY --chown=0:0 models /models
RUN chmod -R a=rX /models

# nobody user
USER 65534

Use podman build commands to create the OCI container image and upload it to a registry. The following commands use Quay as the registry.
Note

If your repository is private, ensure that you are authenticated to the registry before uploading your container image.
podman build --format=oci -t quay.io/<user_name>/<repository_name>:<tag_name> . podman push quay.io/<user_name>/<repository_name>:<tag_name>

Deploying a model stored in an OCI image by using the CLI

You can deploy a model that is stored in an OCI image from the command line interface.

The following procedure uses the example of deploying a MobileNet v2-7 model in ONNX format, stored in an OCI image on an OpenVINO model server.

Note	By default in KServe, models are exposed outside the cluster and not protected with authentication.

Prerequisites

You have stored a model in an OCI image as described in Storing a model in an OCI image.
If you want to deploy a model that is stored in a private OCI repository, you must configure an image pull secret. For more information about creating an image pull secret, see Using image pull secrets.
You are logged in to your OpenShift cluster.

Procedure

Create a project to deploy the model:
```
oc new-project oci-model-example
```
Use the Open Data Hub project kserve-ovms template to create a ServingRuntime resource and configure the OpenVINO model server in the new project:
```
oc process -n opendatahub -o yaml kserve-ovms | oc apply -f -
```

Verify that the ServingRuntime named kserve-ovms is created:

oc get servingruntimes

The command should return output similar to the following:

NAME          DISABLED   MODELTYPE     CONTAINERS         AGE
kserve-ovms              openvino_ir   kserve-container   1m

Create an InferenceService YAML resource, depending on whether the model is stored from a private or a public OCI repository:

For a model stored in a public OCI repository, create an InferenceService YAML file with the following values, replacing <user_name>, <repository_name>, and <tag_name> with values specific to your environment:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it

For a model stored in a private OCI repository, create an InferenceService YAML file that specifies your pull secret in the spec.predictor.imagePullSecrets field, as shown in the following example:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sample-isvc-using-private-oci
spec:
  predictor:
    model:
      runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource
      modelFormat:
        name: onnx
      storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
      resources:
        requests:
          memory: 500Mi
          cpu: 100m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
        limits:
          memory: 4Gi
          cpu: 500m
          # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
    imagePullSecrets: # Specify image pull secrets to use for fetching container images, including OCI model images
    - name: <pull-secret-name>

After you create the InferenceService resource, KServe deploys the model stored in the OCI image referred to by the storageUri field.

Verification

Check the status of the deployment:

oc get inferenceservice

The command should return output that includes information, such as the URL of the deployed model and its readiness state.

Accessing the authentication token for a deployed model

If you secured your model inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify it in your inference requests.

Prerequisites

You have logged in to Open Data Hub.
You have deployed a model by using the single-model serving platform.

Procedure

From the Open Data Hub dashboard, click Data science projects.

The Data science projects page opens.
Click the name of the project that contains your deployed model.

A project details page opens.
Click the Models tab.
In the Models and model servers list, expand the section for your model.

Your authentication token is shown in the Token authentication section, in the Token secret field.
Optional: To copy the authentication token for use in an inference request, click the Copy button () next to the token value.

Accessing the inference endpoint for a deployed model

To make inference requests to your deployed model, you must know how to access the inference endpoint that is available.

For a list of paths to use with the supported runtimes and example commands, see Inference endpoints.

Prerequisites

You have logged in to Open Data Hub.
You have deployed a model by using the single-model serving platform.
If you enabled token authentication for your deployed model, you have the associated token value.

Procedure

From the Open Data Hub dashboard, click Models → Model deployments.

The inference endpoint for the model is shown in the Inference endpoint field.
Depending on what action you want to perform with the model (and if the model supports that action), copy the inference endpoint and then add a path to the end of the URL.
Use the endpoint to make API requests to your deployed model.

Additional resources

Configuring monitoring for the single-model serving platform

The single-model serving platform includes metrics for supported runtimes of the KServe component. KServe does not generate its own metrics, and relies on the underlying model-serving runtimes to provide them. The set of available metrics for a deployed model depends on its model-serving runtime.

In addition to runtime metrics for KServe, you can also configure monitoring for OpenShift Service Mesh. The OpenShift Service Mesh metrics help you to understand dependencies and traffic flow between components in the mesh.

Prerequisites

You have cluster administrator privileges for your OpenShift Container Platform cluster.
You have created OpenShift Service Mesh and Knative Serving instances and installed KServe.
You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
You are familiar with creating a config map for monitoring a user-defined workflow. You will perform similar steps in this procedure.
You are familiar with enabling monitoring for user-defined projects in OpenShift. You will perform similar steps in this procedure.
You have assigned the monitoring-rules-view role to users that will monitor metrics.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Define a ConfigMap object in a YAML file called uwm-cm-conf.yaml with the following contents:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheus:
      logLevel: debug
      retention: 15d
```
The user-workload-monitoring-config object configures the components that monitor user-defined projects. Observe that the retention time is set to the recommended value of 15 days.
Apply the configuration to create the user-workload-monitoring-config object.
```
$ oc apply -f uwm-cm-conf.yaml
```
Define another ConfigMap object in a YAML file called uwm-cm-enable.yaml with the following contents:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
```
The cluster-monitoring-config object enables monitoring for user-defined projects.
Apply the configuration to create the cluster-monitoring-config object.
```
$ oc apply -f uwm-cm-enable.yaml
```

Create ServiceMonitor and PodMonitor objects to monitor metrics in the service mesh control plane as follows:

Create an istiod-monitor.yaml YAML file with the following contents:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istiod-monitor
  namespace: istio-system
spec:
  targetLabels:
  - app
  selector:
    matchLabels:
      istio: pilot
  endpoints:
  - port: http-monitoring
    interval: 30s

Deploy the ServiceMonitor CR in the specified istio-system namespace.
```
$ oc apply -f istiod-monitor.yaml
```
You see the following output:
```
servicemonitor.monitoring.coreos.com/istiod-monitor created
```

Create an istio-proxies-monitor.yaml YAML file with the following contents:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: istio-proxies-monitor
  namespace: istio-system
spec:
  selector:
    matchExpressions:
    - key: istio-prometheus-ignore
      operator: DoesNotExist
  podMetricsEndpoints:
  - path: /stats/prometheus
    interval: 30s

Deploy the PodMonitor CR in the specified istio-system namespace.

$ oc apply -f istio-proxies-monitor.yaml

You see the following output:

podmonitor.monitoring.coreos.com/istio-proxies-monitor created

Viewing model-serving runtime metrics for the single-model serving platform

When a cluster administrator has configured monitoring for the single-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the KServe component.

Procedure

Log in to the OpenShift Container Platform web console.
Switch to the Developer perspective.
In the left menu, click Observe.

As described in Monitoring your project metrics, use the web console to run queries for model-serving runtime metrics. You can also run queries for metrics that are related to OpenShift Service Mesh. Some examples are shown.

The following query displays the number of successful inference requests over a period of time for a model deployed with the vLLM runtime:

sum(increase(vllm:request_success_total{namespace=${namespace},model_name=${model_name}}[${rate_interval}]))

Note	Certain vLLM metrics are available only after an inference request is processed by a deployed model. To generate and view these metrics, you must first make an inference request to the model.

The following query displays the number of successful inference requests over a period of time for a model deployed with the standalone TGIS runtime:
```
sum(increase(tgi_request_success{namespace=${namespace}, pod=~${model_name}-predictor-.*}[${rate_interval}]))
```
The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:
```
sum(increase(predict_rpc_count_total{namespace=${namespace},code=OK,model_id=${model_name}}[${rate_interval}]))
```
The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:
```
sum(increase(ovms_requests_success{namespace=${namespace},name=${model_name}}[${rate_interval}]))
```

Additional resources

Monitoring model performance

In the single-model serving platform, you can view performance metrics for a specific model that is deployed on the platform.

Viewing performance metrics for a deployed model

You can monitor the following metrics for a specific model that is deployed on the single-model serving platform:

Number of requests - The number of requests that have failed or succeeded for a specific model.
Average response time (ms) - The average time it takes a specific model to respond to requests.
CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the model is performing at a specified time.

Prerequisites

You have installed Open Data Hub.
A cluster admin has enabled user workload monitoring (UWM) for user-defined projects on your OpenShift cluster. For more information, see Enabling monitoring for user-defined projects and Configuring monitoring for the single-model serving platform.
You have logged in to Open Data Hub.
The following dashboard configuration options are set to the default values as shown:
```
disablePerformanceMetrics:false
disableKServeMetrics:false
```
For more information about setting dashboard configuration options, see Customizing the dashboard.
You have deployed a model on the single-model serving platform by using a preinstalled runtime.

Note

Metrics are only supported for models deployed by using a preinstalled model-serving runtime or a custom runtime that is duplicated from a preinstalled runtime.

Procedure

From the Open Data Hub dashboard navigation menu, click Data science projects.

The Data science projects page opens.
Click the name of the project that contains the data science models that you want to monitor.
In the project details page, click the Models tab.
Select the model that you are interested in.
On the Endpoint performance tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
Scroll down to view data graphs for number of requests, average response time, CPU utilization, and memory utilization.

Verification

The Endpoint performance tab shows graphs of metrics for the model.

Deploying a Grafana metrics dashboard

You can deploy a Grafana metrics dashboard for User Workload Monitoring (UWM) to monitor performance and resource usage metrics for models deployed on the single-model serving platform.

You can create a Kustomize overlay, similar to this example. Use the overlay to deploy preconfigured metrics dashboards for models deployed with OpenVino Model Server (OVMS) and vLLM.

Prerequisites

You have cluster admin privileges for your OpenShift cluster.
A cluster admin has enabled user workload monitoring (UWM) for user-defined projects on your OpenShift cluster. For more information, see Enabling monitoring for user-defined projects and Configuring monitoring for the single-model serving platform.

You have created an overlay to deploy a Grafana instance, similar to this example.

Note	To view GPU metrics, you must enable the NVIDIA GPU monitoring dashboard as described in Enabling the GPU monitoring dashboard. The GPU monitoring dashboard provides a comprehensive view of GPU utilization, memory usage, and other metrics for your GPU nodes.

Procedure

In a terminal window, log in to the OpenShift CLI as a cluster administrator.
If you have not already created the overlay to install the Grafana operator and metrics dashboards, refer to the RHOAI UWM repository to create it.
Install the Grafana instance and metrics dashboards on your OpenShift cluster with the overlay that you created. Replace <overlay-name> with the name of your overlay.
```
oc apply -k overlays/<overlay-name>
```
Retrieve the URL of the Grafana instance. Replace <namespace> with the namespace that contains the Grafana instance.
```
oc get route -n <namespace> grafana-route -o jsonpath='{.spec.host}'
```

Use the URL to access the Grafana instance:

grafana-<namespace>.apps.example-openshift.com

Verification

You can access the preconfigured dashboards available for KServe, vLLM and OVMS on the Grafana instance.

Deploying a vLLM/GPU metrics dashboard on a Grafana instance

Deploy Grafana boards to monitor accelerator and vLLM performance metrics.

Prerequisites

You have deployed a Grafana metrics dashboard, as described in Deploying a Grafana metrics dashboard.
You can access a Grafana instance.
You have installed envsubst, a command-line tool used to substitute environment variables in configuration files. For more information, see the GNU gettext documentation.

Procedure

Define a GrafanaDashboard object in a YAML file, similar to the following examples:
1. To monitor NVIDIA accelerator metrics, see nvidia-vllm-dashboard.yaml.
2. To monitor AMD accelerator metrics, see amd-vllm-dashboard.yaml.
3. To monitor Intel accelerator metrics, see gaudi-vllm-dashboard.yaml.
4. To monitor vLLM metrics, see grafana-vllm-dashboard.yaml.
Create an inputs.env file similar to the following example. Replace the NAMESPACE and MODEL_NAME parameters with your own values:
```
NAMESPACE=<namespace> (1)
MODEL_NAME=<model-name> (2)
```
1. NAMESPACE is the target namespace where the model will be deployed.
2. MODEL_NAME is the model name as defined in your InferenceService. The model name is also used to filter the pod name in the Grafana dashboard.
Replace the NAMESPACE and MODEL_NAME parameters in your YAML file with the values from the inputs.env file by performing the following actions:
1. Export the parameters described in the inputs.env as environment variables:
  export $(cat inputs.env | xargs)
2. Update the following YAML file, replacing the ${NAMESPACE} and ${MODEL_NAME} variables with the values of the exported environment variables, and dashboard_template.yaml with the name of the GrafanaDashboard object YAML file that you created earlier:
  envsubst '${NAMESPACE} ${MODEL_NAME}' < dashboard_template.yaml > dashboard_template-replaced.yaml
Confirm that your YAML file contains updated values.

Deploy the dashboard object:

oc create -f dashboard_template-replaced.yaml

Verification

You can see the accelerator and vLLM metrics dashboard on your Grafana instance.

Grafana metrics

You can use Grafana boards to monitor the accelerator and vLLM performance metrics. The datasource, instance and gpu are variables defined inside the board.

Accelerator metrics

Track metrics on your accelerators to ensure the health of the hardware.

NVIDIA GPU utilization

Tracks the percentage of time the GPU is actively processing tasks, indicating GPU workload levels.

Query

DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}

NVIDIA GPU memory utilization

Compares memory usage against free memory, which is critical for identifying memory bottlenecks in GPU-heavy workloads.

Query

DCGM_FI_DEV_POWER_USAGE{instance=~"$instance", gpu=~"$gpu"}

Sum

sum(DCGM_FI_DEV_POWER_USAGE{instance=~"$instance", gpu=~"$gpu"})

NVIDIA GPU temperature

Ensures the GPU operates within safe thermal limits to prevent hardware degradation.

Query

DCGM_FI_DEV_GPU_TEMP{instance=~"$instance", gpu=~"$gpu"}

Avg

avg(DCGM_FI_DEV_GPU_TEMP{instance=~"$instance", gpu=~"$gpu"})

NVIDIA GPU throttling

GPU throttling occurs when the GPU automatically reduces the clock to avoid damage from overheating.

You can access the following metrics to identify GPU throttling:

GPU temperature: Monitor the GPU temperature. Throttling often occurs when the GPU reaches a certain temperature, for example, 85-90°C.
SM clock speed: Monitor the core clock speed. A significant drop in the clock speed while the GPU is under load indicates throttling.

CPU metrics

You can track metrics on your CPU to ensure the health of the hardware.

CPU utilization

Tracks CPU usage to identify workloads that are CPU-bound.

Query

sum(rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$model_name.*"}[5m])) by (namespace)

CPU-GPU bottlenecks

A combination of CPU throttling and GPU usage metrics to identify resource allocation inefficiencies. The following table outlines the combination of CPU throttling and GPU utilizations, and what these metrics mean for your environment:

CPU throttling	GPU utilization	Meaning
Low	High	System well-balanced. GPU is fully used without CPU constraints.
High	Low	CPU resources are constrained. The CPU is unable to keep up with the GPU’s processing demands, and the GPU may be underused.
High	High	Workload is increasing for both CPU and GPU, and you might need to scale up resources.

CPU throttling

GPU utilization

Meaning

Low

High

System well-balanced. GPU is fully used without CPU constraints.

High

Low

CPU resources are constrained. The CPU is unable to keep up with the GPU’s processing demands, and the GPU may be underused.

High

Workload is increasing for both CPU and GPU, and you might need to scale up resources.

Query

sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace", pod=~"$model_name.*"}[5m])) by (namespace)
avg_over_time(DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}[5m])

vLLM metrics

You can track metrics related to your vLLM model.

GPU and CPU cache utilization

Tracks the percentage of GPU memory used by the vLLM model, providing insights into memory efficiency.

Query

sum_over_time(vllm:gpu_cache_usage_perc{namespace="${namespace}",pod=~"$model_name.*"}[24h])

Running requests

The number of requests actively being processed. Helps monitor workload concurrency.

num_requests_running{namespace="$namespace", pod=~"$model_name.*"}

Waiting requests

Tracks requests in the queue, indicating system saturation.

num_requests_waiting{namespace="$namespace", pod=~"$model_name.*"}

Prefix cache hit rates

High hit rates imply efficient reuse of cached computations, optimizing resource usage.

Queries

vllm:gpu_cache_usage_perc{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
vllm:cpu_cache_usage_perc{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}

Request total count

Query

vllm:request_success_total{finished_reason="length",namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}

The request ended because it reached the maximum token limit set for the model inference.

Query

vllm:request_success_total{finished_reason="stop",namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}

The request completed naturally based on the model’s output or a stop condition, for example, the end of a sentence or token completion.

End-to-end latency: Measures the overall time to process a request for an optimal user experience.

Histogram queries

histogram_quantile(0.99, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
rate(vllm:e2e_request_latency_seconds_sum{namespace="$namespace", pod=~"$model_name.*",model_name="$model_name"}[5m])
rate(vllm:e2e_request_latency_seconds_count{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])

Time to first token (TTFT) latency

The time taken to generate the first token in a response.

Histogram queries

histogram_quantile(0.99, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
rate(vllm:time_to_first_token_seconds_sum{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:time_to_first_token_seconds_count{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])

Time per output token (TPOT) latency

The average time taken to generate each output token.

Histogram queries

histogram_quantile(0.99, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
rate(vllm:time_per_output_token_seconds_sum{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:time_per_output_token_seconds_count{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])

Prompt token throughput and generation throughput

Tracks the speed of processing prompt tokens for LLM optimization.

Queries

rate(vllm:prompt_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:generation_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])

Total tokens generated: Measures the efficiency of generating response tokens, critical for real-time applications.

Query

sum(vllm:generation_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"})

Configuring metrics-based autoscaling

Knative-based autoscaling is not available in standard deployment mode. However, you can enable metrics-based autoscaling for an inference service in standard deployment mode. Metrics-based autoscaling helps you efficiently manage accelerator resources, lower operational costs, and ensure that your inference services meet performance requirements.

To set up autoscaling for your inference service in standard deployments, install and configure the OpenShift Custom Metrics Autoscaler (CMA), which is based on Kubernetes Event-driven Autoscaling (KEDA). You can then use various model runtime metrics available in OpenShift Monitoring to trigger autoscaling of your inference service, such as KVCache utilization, Time to First Token (TTFT), and Concurrency.

Prerequisites

You have cluster administrator privileges for your OpenShift Container Platform cluster.

You have installed the CMA operator on your cluster. For more information, see Installing the custom metrics autoscaler.

Note	You must configure the `KedaController` resource after installing the CMA operator. The `odh-controller` automatically creates the `TriggerAuthentication`, `ServiceAccount`, `Role`, `RoleBinding`, and `Secret` resources to allow CMA access to OpenShift Monitoring metrics.

You have enabled User Workload Monitoring (UWM) for your cluster. For more information, see Configuring user workload monitoring.
You have deployed a model on the single-model serving platform in standard deployment mode.

Procedure

Log in to the OpenShift Container Platform console as a cluster administrator.
In the Administrator perspective, click Home → Search.
Select the project where you have deployed your model.
From the Resources dropdown menu, select InferenceService.
Click the InferenceService for your deployed model and then click YAML.

Under spec.predictor, define a metric-based autoscaling policy similar to the following example:

kind: InferenceService
metadata:
  name: my-inference-service
  namespace: my-namespace
  annotations:
    serving.kserve.io/autoscalerClass: keda
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 5
    autoscaling:
      metrics:
        - type: External
          external:
            metric:
              backend: "prometheus"
              serverAddress: "https://thanos-querier.openshift-monitoring.svc:9092"
              query: vllm:num_requests_waiting
          authenticationRef:
            name: inference-prometheus-auth
          authModes: bearer
          target:
            type: Value
            value: 2

The example configuration sets up the inference service to autoscale between 1 and 5 replicas based on the number of requests waiting to be processed, as indicated by the vllm:num_requests_waiting metric.

Click Save.

Verification

Confirm that the KEDA ScaledObject resource is created:
```
oc get scaledobject -n <namespace>
```

Optimizing model-serving runtimes

You can optionally enhance the preinstalled model-serving runtimes available in Open Data Hub to leverage additional benefits and capabilities, such as optimized inferencing, reduced latency, and fine-tuned resource allocation.

You can configure the vLLM NVIDIA GPU ServingRuntime for KServe runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for large language models (LLMs).

You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

The following procedure describes customizing the vLLM NVIDIA GPU ServingRuntime for KServe runtime for speculative decoding and multi-modal inferencing.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.

Procedure

Follow the steps to deploy a model as described in Deploying models on the single-model serving platform.
In the Serving runtime field, select the vLLM NVIDIA GPU ServingRuntime for KServe runtime.
To configure the vLLM model-serving runtime for speculative decoding by matching n-grams in the prompt, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--speculative-model=[ngram]
--num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
--ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX>
--use-v2-block-manager
```
1. Replace <NUM_SPECULATIVE_TOKENS> and <NGRAM_PROMPT_LOOKUP_MAX> with your own values.
  
  Note
  
  Inferencing throughput varies depending on the model used for speculating with n-grams.
To configure the vLLM model-serving runtime for speculative decoding with a draft model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--port=8080
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--model=/mnt/models/<path_to_original_model>
--speculative-model=/mnt/models/<path_to_speculative_model>
--num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
--use-v2-block-manager
```
1. Replace <path_to_speculative_model> and <path_to_original_model> with the paths to the speculative model and original model on your S3-compatible object storage.
2. Replace <NUM_SPECULATIVE_TOKENS> with your own value.
To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--trust-remote-code
```
Note

Only use the --trust-remote-code argument with models from trusted sources.
Click Deploy.

Verification

If you have configured the vLLM model-serving runtime for speculative decoding, use the following example command to verify API requests to your deployed model:
```
curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"
```

If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:

curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"
-d '{"model":"<model_name>",
     "messages":
        [{"role":"<role>",
          "content":
             [{"type":"text", "text":"<text>"
              },
              {"type":"image_url", "image_url":"<image_url_link>"
              }
             ]
         }
        ]
    }'

Additional resources

Performance tuning on the single-model serving platform

Certain performance issues might require you to tune the parameters of your inference service or model-serving runtime.

Resolving CUDA out-of-memory errors

In certain cases, depending on the model and hardware accelerator used, the TGIS memory auto-tuning algorithm might underestimate the amount of GPU memory needed to process long sequences. This miscalculation can lead to Compute Unified Architecture (CUDA) out-of-memory (OOM) error responses from the model server. In such cases, you must update or add additional parameters in the TGIS model-serving runtime, as described in the following procedure.

Note	The Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe is deprecated. For more information, see Open Data Hub release notes.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.

Procedure

From the Open Data Hub dashboard, click Settings → Serving runtimes.

The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
Based on the runtime that you used to deploy your model, perform one of the following actions:
- If you used the preinstalled Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe runtime, duplicate the runtime to create a custom version and then follow the remainder of this procedure. For more information about duplicating the pre-installed TGIS runtime, see Adding a custom model-serving runtime for the single-model serving platform.
- If you were already using a custom TGIS runtime, click the action menu (⋮) next to the runtime and select Edit.
  
  The embedded YAML editor opens and shows the contents of the custom model-serving runtime.

Add or update the BATCH_SAFETY_MARGIN environment variable and set the value to 30. Similarly, add or update the ESTIMATE_MEMORY_BATCH_SIZE environment variable and set the value to 8.

spec:
  containers:
    env:
    - name: BATCH_SAFETY_MARGIN
      value: 30
    - name: ESTIMATE_MEMORY_BATCH
      value: 8

Note

The BATCH_SAFETY_MARGIN parameter sets a percentage of free GPU memory to hold back as a safety margin to avoid OOM conditions. The default value of BATCH_SAFETY_MARGIN is 20. The ESTIMATE_MEMORY_BATCH_SIZE parameter sets the batch size used in the memory auto-tuning algorithm. The default value of ESTIMATE_MEMORY_BATCH_SIZE is 16.

Click Update.

The Serving runtimes page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.
To redeploy the model for the parameter updates to take effect, perform the following actions:
1. From the Open Data Hub dashboard, click Models → Model deployments.
2. Find the model you want to redeploy, click the action menu (⋮) next to the model, and select Delete.
3. Redeploy the model as described in Deploying models on the single-model serving platform.

Verification

You receive successful responses from the model server and no longer see CUDA OOM errors.

Supported model-serving runtimes

Open Data Hub includes several preinstalled model-serving runtimes. You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. You can also add a custom runtime to support a model.

See Supported configurations for a list of the supported model-serving runtimes and deployment requirements.

For help adding a custom runtime, see Adding a custom model-serving runtime for the single-model serving platform.

Additional resources

Inference endpoints

Tested and verified model-serving runtimes

Tested and verified runtimes are community versions of model-serving runtimes that have been tested and verified against specific versions of Open Data Hub.

Red Hat tests the current version of a tested and verified runtime each time there is a new version of Open Data Hub. If a new version of a tested and verified runtime is released in the middle of an Open Data Hub release cycle, it will be tested and verified in an upcoming release.

See Supported configurations for a list of tested and verified runtimes in Open Data Hub.

Note	Tested and verified runtimes are not directly supported by Red Hat. You are responsible for ensuring that you are licensed to use any tested and verified runtimes that you add, and for correctly configuring and maintaining them.

Additional resources

Inference endpoints

Inference endpoints

These examples show how to use inference endpoints to query the model.

Note	If you enabled token authentication when deploying the model, add the `Authorization` header and specify a token value.

Caikit TGIS ServingRuntime for KServe

:443/api/v1/task/text-generation
:443/api/v1/task/server-streaming-text-generation

Example command

curl --json '{"model_id": "<model_name>", "inputs": "<text>"}' \
https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation \
-H 'Authorization: Bearer <token>'

Caikit Standalone ServingRuntime for KServe

If you are serving multiple models, you can query /info/models or :443 caikit.runtime.info.InfoService/GetModelsInfo to view a list of served models.

REST endpoints

/api/v1/task/embedding
/api/v1/task/embedding-tasks
/api/v1/task/sentence-similarity
/api/v1/task/sentence-similarity-tasks
/api/v1/task/rerank
/api/v1/task/rerank-tasks
/info/models
/info/version
/info/runtime

gRPC endpoints

:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict
:443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict
:443 caikit.runtime.Nlp.NlpService/RerankTaskPredict
:443 caikit.runtime.Nlp.NlpService/RerankTasksPredict
:443 caikit.runtime.info.InfoService/GetModelsInfo
:443 caikit.runtime.info.InfoService/GetRuntimeInfo

Note	By default, the Caikit Standalone Runtime exposes REST endpoints. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see Adding a custom model-serving runtime for the single-model serving platform.

An example manifest is available in the caikit-tgis-serving GitHub repository.

Example command

REST

curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'

gRPC

grpcurl -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'

TGIS Standalone ServingRuntime for KServe

Important

The Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe is deprecated. For more information, see Open Data Hub release notes.

:443 fmaas.GenerationService/Generate
:443 fmaas.GenerationService/GenerateStream

Note

To query the endpoint for the TGIS standalone runtime, you must also download the files in the proto directory of the Open Data Hub text-generation-inference repository.

Example command

grpcurl -proto text-generation-inference/proto/generation.proto -d \
'{"requests": [{"text":"<text>"}]}' \
-insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate \
-H 'Authorization: Bearer <token>'

OpenVINO Model Server

/v2/models/<model-name>/infer

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d \
'{ "model_name": "<model_name>", \
"inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' \
-H 'Authorization: Bearer <token>'

vLLM NVIDIA GPU ServingRuntime for KServe

:443/version
:443/docs
:443/v1/models
:443/v1/chat/completions
:443/v1/completions
:443/v1/embeddings
:443/tokenize

:443/detokenize

Note

The vLLM runtime is compatible with the OpenAI REST API. For a list of models that the vLLM runtime supports, see Supported models.
To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see Supported embeddings models in vLLM.
As of vLLM v0.5.5, you must provide a chat template while querying a model using the /v1/chat/completions endpoint. If your model does not include a predefined chat template, you can use the chat-template command-line parameter to specify a chat template in your custom vLLM runtime, as shown in the example. Replace <CHAT_TEMPLATE> with the path to your template.
```
containers:
  - args:
      - --chat-template=<CHAT_TEMPLATE>
```
You can use the chat templates that are available as .jinja files here or with the vLLM image under /app/data/template. For more information, see Chat templates.

As indicated by the paths shown, the single-model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.

Example command

curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H \
"Content-Type: application/json" -d '{ \
"messages": [{ \
"role": "<role>", \
"content": "<content>" \
}] -H 'Authorization: Bearer <token>'

vLLM Intel Gaudi Accelerator ServingRuntime for KServe

See vLLM NVIDIA GPU ServingRuntime for KServe.

vLLM AMD GPU ServingRuntime for KServe

See vLLM NVIDIA GPU ServingRuntime for KServe.

NVIDIA Triton Inference Server

REST endpoints

v2/models/[/versions/<model_version>]/infer
v2/models/<model_name>[/versions/<model_version>]
v2/health/ready
v2/health/live
v2/models/<model_name>[/versions/]/ready
v2

Note	ModelMesh does not support the following REST endpoints: `v2/health/live` `v2/health/ready` `v2/models/<model_name>[/versions/]/ready`

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d /
'{ "model_name": "<model_name>", \
   "inputs": \
	[{ "name": "<name_of_model_input>", \
           "shape": [<shape>], \
           "datatype": "<data_type>", \
           "data": [<data>] \
         }]}' -H 'Authorization: Bearer <token>'

gRPC endpoints

:443 inference.GRPCInferenceService/ModelInfer
:443 inference.GRPCInferenceService/ModelReady
:443 inference.GRPCInferenceService/ModelMetadata
:443 inference.GRPCInferenceService/ServerReady
:443 inference.GRPCInferenceService/ServerLive
:443 inference.GRPCInferenceService/ServerMetadata

Example command

grpcurl -cacert ./openshift_ca_istio_knative.crt \
        -proto ./grpc_predict_v2.proto \
        -d @ \
        -H "Authorization: Bearer <token>" \
        <inference_endpoint_url>:443 \
        inference.GRPCInferenceService/ModelMetadata

Seldon MLServer

REST endpoints

v2/models/[/versions/<model_version>]/infer
v2/models/<model_name>[/versions/<model_version>]
v2/health/ready
v2/health/live
v2/models/<model_name>[/versions/]/ready
v2

Example command

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d /
'{ "model_name": "<model_name>", \
   "inputs": \
        [{ "name": "<name_of_model_input>", \
           "shape": [<shape>], \
           "datatype": "<data_type>", \
           "data": [<data>] \
         }]}' -H 'Authorization: Bearer <token>'

gRPC endpoints

:443 inference.GRPCInferenceService/ModelInfer
:443 inference.GRPCInferenceService/ModelReady
:443 inference.GRPCInferenceService/ModelMetadata
:443 inference.GRPCInferenceService/ServerReady
:443 inference.GRPCInferenceService/ServerLive
:443 inference.GRPCInferenceService/ServerMetadata

Example command

grpcurl -cacert ./openshift_ca_istio_knative.crt \
        -proto ./grpc_predict_v2.proto \
        -d @ \
        -H "Authorization: Bearer <token>" \
        <inference_endpoint_url>:443 \
        inference.GRPCInferenceService/ModelMetadata

Additional resources

About the NVIDIA NIM model serving platform

You can deploy models using NVIDIA NIM inference services on the NVIDIA NIM model serving platform.

NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of microservices designed for secure, reliable deployment of high performance AI model inferencing across clouds, data centers and workstations.

Additional resources

NVIDIA NIM

Enabling the NVIDIA NIM model serving platform

As an Open Data Hub administrator, you can use the Open Data Hub dashboard to enable the NVIDIA NIM model serving platform.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
You have enabled the single-model serving platform. You do not need to enable a preinstalled runtime. For more information about enabling the single-model serving platform, see Enabling the single-model serving platform.
The disableNIMModelServing dashboard configuration option is set to false.

For more information about setting dashboard configuration options, see Customizing the dashboard.
You have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
You have an NVIDIA Cloud Account (NCA) and can access the NVIDIA GPU Cloud (NGC) portal. For more information, see NVIDIA GPU Cloud user guide.
Your NCA account is associated with the NVIDIA AI Enterprise Viewer role.
You have generated a personal API key on the NGC portal. For more information, see Generating a Personal API Key.

Procedure

In the left menu of the Open Data Hub dashboard, click Applications → Explore.
On the Explore page, find the NVIDIA NIM tile.
Click Enable on the application tile.
Enter your personal API key and then click Submit.

Verification

The NVIDIA NIM application that you enabled appears on the Enabled page.

Deploying models on the NVIDIA NIM model serving platform

When you have enabled the NVIDIA NIM model serving platform, you can start to deploy NVIDIA-optimized models on the platform.

Prerequisites

You have logged in to Open Data Hub.
You have enabled the NVIDIA NIM model serving platform.
You have created a data science project.
You have enabled support for graphic processing units (GPUs) in Open Data Hub. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform.

Procedure

In the left menu, click Data science projects.

The Data science projects page opens.
Click the name of the project that you want to deploy a model in.

A project details page opens.
Click the Models tab.
In the Models section, perform one of the following actions:
- On the NVIDIA NIM model serving platform tile, click Select NVIDIA NIM on the tile, and then click Deploy model.
- If you have previously selected the NVIDIA NIM model serving type, the Models page displays NVIDIA model serving enabled on the upper-right corner, along with the Deploy model button. To proceed, click Deploy model.
The Deploy model dialog opens.
Configure properties for deploying your model as follows:
1. In the Model deployment name field, enter a unique name for the deployment.
2. From the NVIDIA NIM list, select the NVIDIA NIM model that you want to deploy. For more information, see Supported Models
3. In the NVIDIA NIM storage size field, specify the size of the cluster storage instance that will be created to store the NVIDIA NIM model.
4. In the Number of model server replicas to deploy field, specify a value.
5. From the Model server size list, select a value.

From the Hardware profile list, select a hardware profile.

Important

Optional: Click Customize resource requests and limit and update the following values:
1. In the CPUs requests field, specify the number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
2. In the CPU limits field, specify the maximum number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
3. In the Memory requests field, specify the requested memory for the model server in gibibytes (Gi).
4. In the Memory limits field, specify the maximum memory limit for the model server in gibibytes (Gi).
Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
To require token authentication for inference requests to the deployed model, perform the following actions:
1. Select Require token authentication.
2. In the Service account name field, enter the service account name that the token will be generated for.
3. To add an additional service account, click Add a service account and enter another service account name.
Click Deploy.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.

Additional resources

Customizing model selection options for the NVIDIA NIM model serving platform

The NVIDIA NIM model serving platform provides access to all available NVIDIA NIM models from the NVIDIA GPU Cloud (NGC). You can deploy a NIM model by selecting it from the NVIDIA NIM list in the Deploy model dialog. To customize the models that appear in the list, you can create a ConfigMap object specifying your preferred models.

Prerequisites

You have cluster administrator privileges for your OpenShift Container Platform cluster.
You have an NVIDIA Cloud Account (NCA) and can access the NVIDIA GPU Cloud (NGC) portal.

You know the IDs of the NVIDIA NIM models that you want to make available for selection on the NVIDIA NIM model serving platform.

Note	You can find the model ID from the NGC Catalog. The ID is usually part of the URL path. You can also find the model ID by using the NGC CLI. For more information, see NGC CLI reference.

You know the name and namespace of your Account custom resource (CR).

Procedure

In a terminal window, log in to the OpenShift Container Platform CLI as a cluster administrator as shown in the following example:
```
oc login <openshift_cluster_url> -u <admin_username> -p <password>
```

Define a ConfigMap object in a YAML file, similar to the one in the following example, containing the model IDs that you want to make available for selection on the NVIDIA NIM model serving platform:

apiVersion: v1
kind: ConfigMap
metadata:
 name: nvidia-nim-enabled-models
data:
 models: |-
    [
    "mistral-nemo-12b-instruct",
    "llama3-70b-instruct",
    "phind-codellama-34b-v2-instruct",
    "deepseek-r1",
    "qwen-2.5-72b-instruct"
    ]

Confirm the name and namespace of your Account CR:

oc get account -A

You see output similar to the following example:

NAMESPACE         NAME       TEMPLATE  CONFIGMAP  SECRET
redhat-ods-applications  odh-nim-account

Deploy the ConfigMap object in the same namespace as your Account CR:
```
oc apply -f <configmap-name> -n <namespace>
```
Replace <configmap-name> with the name of your YAML file, and <namespace> with the namespace of your Account CR.
Add the ConfigMap object that you previously created to the spec.modelListConfig section of your Account CR:
```
oc patch account <account-name> \
  --type='merge' \
  	-p '{"spec": {"modelListConfig": {"name": "<configmap-name>"}}}'
```
Replace <account-name> with the name of your Account CR, and <configmap-name> with your ConfigMap object.
Confirm that the ConfigMap object is added to your Account CR:
```
oc get account <account-name> -o yaml
```
You see the ConfigMap object in the spec.modelListConfig section of your Account CR, similar to the following output:
```
spec:
 enabledModelsConfig:
 modelListConfig:
  name: <configmap-name>
```

Verification

Follow the steps to deploy a model as described in Deploying models on the NVIDIA NIM model serving platform to deploy a NIM model. You see that the NVIDIA NIM list in the Deploy model dialog displays your preferred list of models instead of all the models available in the NGC catalog.

Enabling NVIDIA NIM metrics for an existing NIM deployment

If you have previously deployed a NIM model in Open Data Hub, and then upgraded to the latest version, you must manually enable NIM metrics for your existing deployment by adding annotations to enable metrics collection and graph generation.

Note	NIM metrics and graphs are automatically enabled for new deployments in the latest version of Open Data Hub.

Enabling graph generation for an existing NIM deployment

The following procedure describes how to enable graph generation for an existing NIM deployment.

Prerequisites

You have cluster administrator privileges for your OpenShift Container Platform cluster.
You have downloaded and installed the OpenShift command-line interface (CLI). For more information, see Installing the OpenShift CLI.
You have an existing NIM deployment in Open Data Hub.

Procedure

In a terminal window, if you are not already logged in to your OpenShift Container Platform cluster as a cluster administrator, log in to the OpenShift CLI.
Confirm the name of the ServingRuntime associated with your NIM deployment:
```
oc get servingruntime -n <namespace>
```
Replace <namespace> with the namespace of the project where your NIM model is deployed.
Check for an existing metadata.annotations section in the ServingRuntime configuration:
```
oc get servingruntime -n  <namespace> <servingruntime-name> -o json | jq '.metadata.annotations'
```
Replace <servingruntime-name> with the name of the ServingRuntime from the previous step.

Perform one of the following actions:

If the metadata.annotations section is not present in the configuration, add the section with the required annotations:

oc patch servingruntime -n <namespace> <servingruntime-name> --type json --patch \
 '[{"op": "add", "path": "/metadata/annotations", "value": {"runtimes.opendatahub.io/nvidia-nim": "true"}}]'

You see output similar to the following:

servingruntime.serving.kserve.io/nim-serving-runtime patched

If there is an existing metadata.annotations section, add the required annotations to the section:

oc patch servingruntime -n <project-namespace> <runtime-name> --type json --patch \
 '[{"op": "add", "path": "/metadata/annotations/runtimes.opendatahub.io~1nvidia-nim", "value": "true"}]'

You see output similar to the following:

servingruntime.serving.kserve.io/nim-serving-runtime patched

Verification

Confirm that the annotation has been added to the ServingRuntime of your existing NIM deployment.
```
oc get servingruntime -n <namespace> <servingruntime-name> -o json | jq '.metadata.annotations'
```
The annotation that you added appears in the output:
```
...
"runtimes.opendatahub.io/nvidia-nim": "true"
```
Note

For metrics to be available for graph generation, you must also enable metrics collection for your deployment. Please see Enabling metrics collection for an existing NIM deployment.

Enabling metrics collection for an existing NIM deployment

To enable metrics collection for your existing NIM deployment, you must manually add the Prometheus endpoint and port annotations to the InferenceService of your deployment.

The following procedure describes how to add the required Prometheus annotations to the InferenceService of your NIM deployment.

Prerequisites

You have cluster administrator privileges for your OpenShift Container Platform cluster.
You have downloaded and installed the OpenShift command-line interface (CLI). For more information, see Installing the OpenShift CLI.
You have an existing NIM deployment in Open Data Hub.

Procedure

In a terminal window, if you are not already logged in to your OpenShift Container Platform cluster as a cluster administrator, log in to the OpenShift CLI.
Confirm the name of the InferenceService associated with your NIM deployment:
```
oc get inferenceservice -n <namespace>
```
Replace <namespace> with the namespace of the project where your NIM model is deployed.
Check if there is an existing spec.predictor.annotations section in the InferenceService configuration:
```
oc get inferenceservice -n <namespace> <inferenceservice-name> -o json | jq '.spec.predictor.annotations'
```
Replace <inferenceservice-name> with the name of the InferenceService from the previous step.

Perform one of the following actions:

If the spec.predictor.annotations section does not exist in the configuration, add the section and required annotations:

oc patch inferenceservice -n <namespace> <inference-name> --type json --patch \
 '[{"op": "add", "path": "/spec/predictor/annotations", "value": {"prometheus.io/path": "/metrics", "prometheus.io/port": "8000"}}]'

The annotation that you added appears in the output:

inferenceservice.serving.kserve.io/nim-serving-runtime patched

If there is an existing spec.predictor.annotations section, add the Prometheus annotations to the section:

oc patch inferenceservice -n <namespace> <inference-service-name> --type json --patch \
 '[{"op": "add", "path": "/spec/predictor/annotations/prometheus.io~1path", "value": "/metrics"},
 {"op": "add", "path": "/spec/predictor/annotations/prometheus.io~1port", "value": "8000"}]'

The annotations that you added appears in the output:

inferenceservice.serving.kserve.io/nim-serving-runtime patched

Verification

Confirm that the annotations have been added to the InferenceService.

oc get inferenceservice -n <namespace> <inferenceservice-name> -o json | jq '.spec.predictor.annotations'

You see the annotation that you added in the output:

{
  "prometheus.io/path": "/metrics",
  "prometheus.io/port": "8000"
}

Viewing NVIDIA NIM metrics for a NIM model

In Open Data Hub, you can observe the following NVIDIA NIM metrics for a NIM model deployed on the NVIDIA NIM model serving platform:

GPU cache usage over time (ms)
Current running, waiting, and max requests count
Tokens count
Time to first token
Time per output token
Request outcomes

You can specify a time range and a refresh interval for these metrics to help you determine, for example, the peak usage hours and model performance at a specified time.

Prerequisites

You have enabled the NVIDIA NIM model serving platform.
You have deployed a NIM model on the NVIDIA NIM model serving platform.
The disableKServeMetrics Open Data Hub dashboard configuration option is set to its default value of false:
```
disableKServeMetrics: false
```
For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

From the Open Data Hub dashboard navigation menu, click Data science projects.

The Data science projects page opens.
Click the name of the project that contains the NIM model that you want to monitor.
In the project details page, click the Models tab.
Click the NIM model that you want to observe.
On the NIM Metrics tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
Scroll down to view data graphs for NIM metrics.

Verification

The NIM Metrics tab shows graphs of NIM metrics for the deployed NIM model.

Additional resources

NVIDIA NIM observability

Viewing performance metrics for a NIM model

You can observe the following performance metrics for a NIM model deployed on the NVIDIA NIM model serving platform:

Number of requests - The number of requests that have failed or succeeded for a specific model.
Average response time (ms) - The average time it takes a specific model to respond to requests.
CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, the peak usage hours and model performance at a specified time.

Prerequisites

You have enabled the NVIDIA NIM model serving platform.
You have deployed a NIM model on the NVIDIA NIM model serving platform.
The disableKServeMetrics Open Data Hub dashboard configuration option is set to its default value of false:
```
disableKServeMetrics: false
```
For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

From the Open Data Hub dashboard navigation menu, click Data science projects.

The Data science projects page opens.
Click the name of the project that contains the NIM model that you want to monitor.
In the project details page, click the Models tab.
Click the NIM model that you want to observe.
On the Endpoint performance tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed to show the latest data. You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
Scroll down to view data graphs for performance metrics.

Verification

The Endpoint performance tab shows graphs of performance metrics for the deployed NIM model.

Serving models on the multi-model serving platform

For deploying small and medium-sized models, Open Data Hub includes a multi-model serving platform that is based on the ModelMesh component. On the multi-model serving platform, multiple models can be deployed from the same model server and share the server resources.

Important

The multi-model serving platform based on ModelMesh is deprecated. You can continue to deploy models on the multi-model serving platform, but it is recommended that you migrate to the single-model serving platform.

Configuring model servers

Enabling the multi-model serving platform

To use the multi-model serving platform, you must first enable the platform. The multi-model serving platform uses the ModelMesh component.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
The spec.dashboardConfig.disableModelMesh dashboard configuration option is set to false (the default).

For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

In the left menu of the Open Data Hub dashboard, click Settings → Cluster settings.
Locate the Model serving platforms section.
Select the Multi-model serving platform checkbox.
Click Save changes.

Adding a custom model-serving runtime for the multi-model serving platform

A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. By default, the multi-model serving platform includes the OpenVINO Model Server runtime. You can also add your own custom runtime if the default runtime does not meet your needs, such as supporting a specific model format.

As an administrator, you can use the Open Data Hub dashboard to add and enable a custom model-serving runtime. You can then choose the custom runtime when you create a new model server for the multi-model serving platform.

Note	Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
You are familiar with how to add a model server to your project. When you have added a custom model-serving runtime, you must configure a new model server to use the runtime.
You have reviewed the example runtimes in the kserve/modelmesh-serving repository. You can use these examples as starting points. However, each runtime requires some further modification before you can deploy it in Open Data Hub. The required modifications are described in the following procedure.

Note
Open Data Hub includes the OpenVINO Model Server runtime by default. You do not need to add this runtime to Open Data Hub.

Procedure

From the Open Data Hub dashboard, click Settings → Serving runtimes.

The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a custom runtime, choose one of the following options:
- To start with an existing runtime (for example the OpenVINO Model Server runtime), click the action menu (⋮) next to the existing runtime and then click Duplicate.
- To add a new custom runtime, click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Multi-model serving platform.

Note
The multi-model serving platform supports only the REST protocol. Therefore, you cannot change the default value in the Select the API protocol this runtime supports list.
Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:
- Upload a YAML file
  
  Click Upload files.
  
  In the file browser, select a YAML file on your computer. This file might be the one of the example runtimes that you downloaded from the kserve/modelmesh-serving repository.
  
  The embedded YAML editor opens and shows the contents of the file that you uploaded.
- Enter YAML code directly in the editor
  
  Click Start from scratch.
  
  Enter or paste YAML code directly in the embedded editor. The YAML that you paste might be copied from one of the example runtimes in the kserve/modelmesh-serving repository.
Optional: If you are adding one of the example runtimes in the kserve/modelmesh-serving repository, perform the following modifications:
1. In the YAML editor, locate the kind field for your runtime. Update the value of this field to ServingRuntime.
2. In the kustomization.yaml file in the kserve/modelmesh-serving repository, take note of the newName and newTag values for the runtime that you want to add. You will specify these values in a later step.
3. In the YAML editor for your custom runtime, locate the containers.image field.
4. Update the value of the containers.image field in the format newName:newTag, based on the values that you previously noted in the kustomization.yaml file. Some examples are shown.
  
  Nvidia Triton Inference Server
  
  image: nvcr.io/nvidia/tritonserver:23.04-py3
  
  Seldon Python MLServer
  
  image: seldonio/mlserver:1.3.2
  
  TorchServe
  
  image: pytorch/torchserve:0.7.1-cpu
In the metadata.name field, ensure that the value of the runtime you are adding is unique (that is, the value does not match a runtime that you have already added).
Optional: To configure a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:
```
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-0.x
  annotations:
    openshift.io/display-name: MLServer
```
Note
If you do not configure a custom display name for your runtime, Open Data Hub shows the value of the metadata.name field.
Click Add.

The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime you added is automatically enabled.
Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification

The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Additional resources

To learn how to configure a model server that uses a custom model-serving runtime that you have added, see Adding a model server to your data science project.

Adding a tested and verified model-serving runtime for the multi-model serving platform

In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes such as the NVIDIA Triton Inference Server to support your needs. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Open Data Hub.

You can use the Open Data Hub dashboard to add and enable the NVIDIA Triton Inference Server runtime and then choose the runtime when you create a new model server for the multi-model serving platform.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
You are familiar with how to add a model server to your project. After you have added a tested and verified model-serving runtime, you must configure a new model server to use the runtime.

Procedure

From the Open Data Hub dashboard, click Settings → Serving runtimes.

The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a tested and verified runtime, click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Multi-model serving platform.

Note
The multi-model serving platform supports only the REST protocol. Therefore, you cannot change the default value in the Select the API protocol this runtime supports list.
Click Start from scratch.

Enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    enable-route: "true"
  name: modelmesh-triton
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    opendatahub.io/modelServingSupport: '["multi"x`x`]'
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  builtInAdapter:
    env:
      - name: CONTAINER_MEM_REQ_BYTES
        value: "268435456"
      - name: USE_EMBEDDED_PULLER
        value: "true"
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8001
    serverType: triton
  containers:
    - args:
        - -c
        - 'mkdir -p /models/_triton_models;  chmod 777
          /models/_triton_models;  exec
          tritonserver "--model-repository=/models/_triton_models" "--model-control-mode=explicit" "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true" "--allow-grpc=true"  '
      command:
        - /bin/sh
      image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
      name: triton
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
  grpcDataEndpoint: port:8001
  grpcEndpoint: port:8085
  multiModel: true
  protocolVersions:
    - grpc-v2
    - v2
  supportedModelFormats:
    - autoSelect: true
      name: onnx
      version: "1"
    - autoSelect: true
      name: pytorch
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: tensorrt
      version: "7"
    - autoSelect: false
      name: xgboost
      version: "1"
    - autoSelect: true
      name: python
      version: "1"

In the metadata.name field, make sure that the value of the runtime you are adding does not match a runtime that you have already added).
Optional: To use a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:
```
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: modelmesh-triton
  annotations:
    openshift.io/display-name: Triton ServingRuntime
```
Note
If you do not configure a custom display name for your runtime, Open Data Hub shows the value of the metadata.name field.
Click Create.

The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime you added is automatically enabled.
Optional: To edit the runtime, click the action menu (⋮) and select Edit.

Verification

The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Additional resources

To learn how to configure a model server that uses a model-serving runtime that you have added, see Adding a model server to your data science project.

Adding a model server for the multi-model serving platform

When you have enabled the multi-model serving platform, you must configure a model server to deploy models. If you require extra computing power for use with large datasets, you can assign accelerators to your model server.

Prerequisites

You have logged in to Open Data Hub.
You have created a data science project that you can add a model server to.
You have enabled the multi-model serving platform.
If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See Adding a custom model-serving runtime.
If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU and AMD GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.

Procedure

In the left menu of the Open Data Hub dashboard, click Data science projects.

The Data science projects page opens.
Click the name of the project that you want to configure a model server for.

A project details page opens.
Click the Models tab.
Perform one of the following actions:
- If you see a Multi-model serving platform tile, click Add model server on the tile.
- If you do not see any tiles, click the Add model server button.
The Add model server dialog opens.
In the Model server name field, enter a unique name for the model server.

From the Serving runtime list, select a model-serving runtime that is installed and enabled in your Open Data Hub deployment.

Note	If you are using a custom model-serving runtime with your model server and want to use GPUs, you must ensure that your custom runtime supports GPUs and is appropriately configured to use them.

In the Number of model replicas to deploy field, specify a value.

From the Accelerator profile list, select an accelerator profile.

Important

Optional: Click Customize resource requests and limit and update the following values:
1. In the CPUs requests field, specify the number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
2. In the CPU limits field, specify the maximum number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
3. In the Memory requests field, specify the requested memory for the model server in gibibytes (Gi).
4. In the Memory limits field, specify the maximum memory limit for the model server in gibibytes (Gi).
Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
Optional: In the Token authentication section, select the Require token authentication checkbox to require token authentication for your model server. To finish configuring token authentication, perform the following actions:
1. In the Service account name field, enter a service account name for which the token will be generated. The generated token is created and displayed in the Token secret field when the model server is configured.
2. To add an additional service account, click Add a service account and enter another service account name.
Click Add.
- The model server that you configured appears on the Models tab for the project, in the Models and model servers list.
Optional: To update the model server, click the action menu (⋮) beside the model server and select Edit model server.

Deleting a model server

When you no longer need a model server to host models, you can remove it from your data science project.

Note	When you remove a model server, you also remove the models that are hosted on that model server. As a result, the models are no longer available to applications.

Prerequisites

You have created a data science project and an associated model server.
You have notified the users of the applications that access the models that the models will no longer be available.

Procedure

From the Open Data Hub dashboard, click Data science projects.

The Data science projects page opens.
Click the name of the project from which you want to delete the model server.

A project details page opens.
Click the Models tab.
Click the action menu (⋮) beside the project whose model server you want to delete and then click Delete model server.

The Delete model server dialog opens.
Enter the name of the model server in the text field to confirm that you intend to delete it.
Click Delete model server.

Verification

The model server that you deleted is no longer displayed on the Models tab for the project.

Working with deployed models

Deploying a model by using the multi-model serving platform

You can deploy trained models on Open Data Hub to enable you to test and implement them into intelligent applications. Deploying a model makes it available as a service that you can access by using an API. This enables you to return predictions based on data inputs.

When you have enabled the multi-model serving platform, you can deploy models on the platform.

Prerequisites

You have logged in to Open Data Hub.
You have enabled the multi-model serving platform.
You have created a data science project and added a model server.
You have access to S3-compatible object storage.
For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.

Procedure

In the left menu of the Open Data Hub dashboard, click Data science projects.

The Data science projects page opens.
Click the name of the project that you want to deploy a model in.

A project details page opens.
Click the Models tab.
Click Deploy model.

Configure properties for deploying your model as follows:

In the Model name field, enter a unique name for the model that you are deploying.
From the Model framework list, select a framework for your model.

Note
The Model framework list shows only the frameworks that are supported by the model-serving runtime that you specified when you configured your model server.

To specify the location of the model you want to deploy from S3-compatible object storage, perform one of the following sets of actions:

To use an existing connection

Select Existing connection.
From the Name list, select a connection that you previously defined.

In the Path field, enter the folder path that contains the model in your specified data source.

Note

If you are deploying a registered model version with an existing S3 or URI data connection, some of your connection details might be autofilled. This depends on the type of data connection and the number of matching connections available in your data science project. For example, if only one matching connection exists, fields like the path, URI, endpoint, bucket, and region might populate automatically. Matching connections will be labeled as Recommended.

To use a new connection
1. To define a new connection that your model can access, select New connection.
2. In the Add connection modal, select a Connection type. The S3 compatible object storage and URI options are pre-installed connection types. Additional options might be available if your Open Data Hub administrator added them.
  
  The Add connection form opens with fields specific to the connection type that you selected.
3. Enter the connection detail fields.

(Optional) Customize the runtime parameters in the Configuration parameters section:
1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
2. Modify the values in Additional environment variables to define variables in the model’s environment.
Click Deploy.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.

Additional resources

To learn how to monitor your model for bias, see Monitoring data science models.

Viewing a deployed model

To analyze the results of your work, you can view a list of deployed models on Open Data Hub. You can also view the current statuses of deployed models and their endpoints.

Prerequisites

You have logged in to Open Data Hub.

Procedure

From the Open Data Hub dashboard, click Models → Model deployments.

The Model deployments page opens.

For each model, the page shows details such as the model name, the project in which the model is deployed, the model-serving runtime that the model uses, and the deployment status.
Optional: For a given model, click the link in the Inference endpoint column to see the inference endpoints for the deployed model.

Verification

A list of previously deployed data science models is displayed on the Model deployments page.

Additional resources

To learn how to monitor your model for bias, see Monitoring data science models.

Updating the deployment properties of a deployed model

You can update the deployment properties of a model that has been deployed previously. For example, you can change the model’s connection and name.

Prerequisites

You have logged in to Open Data Hub.
You have deployed a model on Open Data Hub.

Procedure

From the Open Data Hub dashboard, click Models → Model deployments.

The Model deployments page opens.
Click the action menu (⋮) beside the model whose deployment properties you want to update and click Edit.

The Edit model dialog opens.
Update the deployment properties of the model as follows:
1. In the Model name field, enter a new, unique name for your model.
2. From the Model servers list, select a model server for your model.
3. From the Model framework list, select a framework for your model.
  
  Note
  The Model framework list shows only the frameworks that are supported by the model-serving runtime that you specified when you configured your model server.
4. Optionally, update the connection by specifying an existing connection or by creating a new connection.
5. Click Redeploy.

Verification

The model whose deployment properties you updated is displayed on the Model deployments page of the dashboard.

Deleting a deployed model

You can delete models you have previously deployed. This enables you to remove deployed models that are no longer required.

Prerequisites

You have logged in to Open Data Hub.
You have deployed a model.

Procedure

From the Open Data Hub dashboard, click Models → Model deployments.

The Model deployments page opens.
Click the action menu (⋮) beside the deployed model that you want to delete and click Delete.

The Delete deployed model dialog opens.
Enter the name of the deployed model in the text field to confirm that you intend to delete it.
Click Delete deployed model.

Verification

The model that you deleted is no longer displayed on the Model deployments page.

Configuring monitoring for the multi-model serving platform

The multi-model serving platform includes model and model server metrics for the ModelMesh component. ModelMesh generates its own set of metrics and does not rely on the underlying model-serving runtimes to provide them. The set of metrics that ModelMesh generates includes metrics for model request rates and timings, model loading and unloading rates, times and sizes, internal queuing delays, capacity and usage, cache state, and least recently-used models. For more information, see ModelMesh metrics.

After you have configured monitoring, you can view metrics for the ModelMesh component.

Prerequisites

You have cluster administrator privileges for your OpenShift Container Platform cluster.
You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
You are familiar with creating a config map for monitoring a user-defined workflow. You will perform similar steps in this procedure.
You are familiar with enabling monitoring for user-defined projects in OpenShift. You will perform similar steps in this procedure.
You have assigned the monitoring-rules-view role to users that will monitor metrics.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Define a ConfigMap object in a YAML file called uwm-cm-conf.yaml with the following contents:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheus:
      logLevel: debug
      retention: 15d
```
The user-workload-monitoring-config object configures the components that monitor user-defined projects. Observe that the retention time is set to the recommended value of 15 days.
Apply the configuration to create the user-workload-monitoring-config object.
```
$ oc apply -f uwm-cm-conf.yaml
```
Define another ConfigMap object in a YAML file called uwm-cm-enable.yaml with the following contents:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
```
The cluster-monitoring-config object enables monitoring for user-defined projects.
Apply the configuration to create the cluster-monitoring-config object.
```
$ oc apply -f uwm-cm-enable.yaml
```

Viewing model-serving runtime metrics for the multi-model serving platform

After a cluster administrator has configured monitoring for the multi-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the ModelMesh component.

Procedure

Log in to the OpenShift Container Platform web console.
Switch to the Developer perspective.
In the left menu, click Observe.
As described in Monitoring your project metrics, use the web console to run queries for modelmesh_* metrics.

Monitoring model performance

In the multi-model serving platform, you can view performance metrics for all models deployed on a model server and for a specific model that is deployed on the model server.

Viewing performance metrics for all models on a model server

You can monitor the following metrics for all the models that are deployed on a model server:

HTTP requests per 5 minutes - The number of HTTP requests that have failed or succeeded for all models on the server.
Average response time (ms) - For all models on the server, the average time it takes the model server to respond to requests.
CPU utilization (%) - The percentage of the CPU’s capacity that is currently being used by all models on the server.
Memory utilization (%) - The percentage of the system’s memory that is currently being used by all models on the server.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the models are performing at a specified time.

Prerequisites

You have installed Open Data Hub.
On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.
You have logged in to Open Data Hub.
You have deployed models on the multi-model serving platform.

Procedure

From the Open Data Hub dashboard navigation menu, click Data science projects.

The Data science projects page opens.
Click the name of the project that contains the data science models that you want to monitor.
In the project details page, click the Models tab.
In the row for the model server that you are interested in, click the action menu (⋮) and then select View model server metrics.
Optional: On the metrics page for the model server, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
Scroll down to view data graphs for HTTP requests per 5 minutes, average response time, CPU utilization, and memory utilization.

Verification

On the metrics page for the model server, the graphs provide data on performance metrics.

Viewing HTTP request metrics for a deployed model

You can view a graph that illustrates the HTTP requests that have failed or succeeded for a specific model that is deployed on the multi-model serving platform.

Prerequisites

You have installed Open Data Hub.
On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.
The following dashboard configuration options are set to the default values as shown:
```
disablePerformanceMetrics:false
disableKServeMetrics:false
```
For more information about setting dashboard configuration options, see Customizing the dashboard.
You have logged in to Open Data Hub.
You have deployed models on the multi-model serving platform.

Procedure

From the Open Data Hub dashboard, click Models → Model deployments.
On the Model deployments page, select the model that you are interested in.
Optional: On the Endpoint performance tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.

Verification

The Endpoint performance tab shows a graph of the HTTP metrics for the model.

QUICK LINKS

STAY IN TOUCH

Info alert:Important Notice

Serving models

About model serving

Single-model serving platform

Multi-model serving platform

NVIDIA NIM model serving platform

Serving models on the single-model serving platform

About the single-model serving platform

Model-serving runtimes

ServingRuntime

InferenceService

About KServe deployment modes

Advanced mode

Standard mode

Installing KServe

Deploying models by using the single-model serving platform

Enabling the single-model serving platform

Adding a custom model-serving runtime for the single-model serving platform

Adding a tested and verified model-serving runtime for the single-model serving platform

Deploying models on the single-model serving platform

Stopping and starting a deployed model

Deploying models by using multiple GPU nodes

Setting a timeout for KServe

Customizing the parameters of a deployed model-serving runtime

Customizable model serving runtime parameters

Using accelerators with vLLM

NVIDIA GPUs

Intel Gaudi accelerators

AMD GPUs

Using OCI containers for model storage

Storing a model in an OCI image

Deploying a model stored in an OCI image by using the CLI

Accessing the authentication token for a deployed model

Accessing the inference endpoint for a deployed model

Configuring monitoring for the single-model serving platform

Viewing model-serving runtime metrics for the single-model serving platform

Monitoring model performance

Viewing performance metrics for a deployed model

Deploying a Grafana metrics dashboard

Deploying a vLLM/GPU metrics dashboard on a Grafana instance

Grafana metrics

Accelerator metrics

CPU metrics

vLLM metrics

Configuring metrics-based autoscaling

Optimizing model-serving runtimes

Enabling speculative decoding and multi-modal inferencing

Performance tuning on the single-model serving platform

Resolving CUDA out-of-memory errors

Supported model-serving runtimes

Tested and verified model-serving runtimes

Inference endpoints

Caikit TGIS ServingRuntime for KServe

Caikit Standalone ServingRuntime for KServe

TGIS Standalone ServingRuntime for KServe

OpenVINO Model Server

vLLM NVIDIA GPU ServingRuntime for KServe

vLLM Intel Gaudi Accelerator ServingRuntime for KServe

vLLM AMD GPU ServingRuntime for KServe

NVIDIA Triton Inference Server

Seldon MLServer

Additional resources

About the NVIDIA NIM model serving platform

Enabling the NVIDIA NIM model serving platform

Deploying models on the NVIDIA NIM model serving platform

Customizing model selection options for the NVIDIA NIM model serving platform

Enabling NVIDIA NIM metrics for an existing NIM deployment

Enabling graph generation for an existing NIM deployment

Enabling metrics collection for an existing NIM deployment

Viewing NVIDIA NIM metrics for a NIM model

Viewing performance metrics for a NIM model

Serving models on the multi-model serving platform

Configuring model servers

Enabling the multi-model serving platform

Adding a custom model-serving runtime for the multi-model serving platform

Adding a tested and verified model-serving runtime for the multi-model serving platform

Adding a model server for the multi-model serving platform

Deleting a model server

Working with deployed models

Deploying a model by using the multi-model serving platform