Info alert:Important Notice

Please note that more information about the previous v2 releases can be found here. You can use "Find a release" search bar to search for a particular release.

Configuring your model-serving platform

Table of Contents

About model-serving platforms
Configuring model servers on the single-model serving platform
Configuring model servers on the NVIDIA NIM model serving platform
- Enabling the NVIDIA NIM model serving platform
Configuring model servers on the multi-model serving platform
Customizing model deployments

About model-serving platforms

As an Open Data Hub administrator, you can enable your preferred serving platform and make it available for serving models. You can also add a custom or a tested and verified model-serving runtime.

About model serving

When you serve a model, you upload a trained model into Open Data Hub for querying, which allows you to integrate your trained models into intelligent applications.

You can upload a model to an S3-compatible object storage, persistent volume claim, or Open Container Initiative (OCI) image. You can then access and train the model from your project workbench. After training the model, you can serve or deploy the model using a model-serving platform.

Serving or deploying the model makes the model available as a service, or model runtime server, that you can access using an API. You can then access the inference endpoints for the deployed model from the dashboard and see predictions based on data inputs that you provide through API calls. Querying the model through the API is also called model inferencing.

You can serve models on one of the following model-serving platforms:

Single-model serving platform
Multi-model serving platform
NVIDIA NIM model serving platform

The model-serving platform that you choose depends on your business needs:

If you want to deploy each model on its own runtime server, or want to use a serverless deployment, select the single-model serving platform. The single-model serving platform is recommended for production use.
If you want to deploy multiple models with only one runtime server, select the multi-model serving platform. This option is best if you are deploying more than 1,000 small and medium models and want to reduce resource consumption.
If you want to use NVIDIA Inference Microservices (NIMs) to deploy a model, select the NVIDIA NIM-model serving platform.

Single-model serving platform

You can deploy each model from a dedicated model server on the single-model serving platform. Deploying models from a dedicated model server can help you deploy, monitor, scale, and maintain models that require increased resources. This model serving platform is ideal for serving large models. The single-model serving platform is based on the KServe component.

The single-model serving platform is helpful for use cases such as:

Large language models (LLMs)
Generative AI

Multi-model serving platform

You can deploy multiple models from the same model server on the multi-model serving platform. Each of the deployed models shares the server resources. Deploying multiple models from the same model server can be advantageous on OpenShift clusters that have finite compute resources or pods. This model serving platform is ideal for serving small and medium models in large quantities. The multi-model serving platform is based on the ModelMesh component.

NVIDIA NIM model serving platform

You can deploy models using NVIDIA Inference Microservices (NIM) on the NVIDIA NIM model serving platform.

NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of microservices designed for secure, reliable deployment of high performance AI model inferencing across clouds, data centers and workstations.

NVIDIA NIM inference services are helpful for use cases such as:

Using GPU-accelerated containers inferencing models optimized by NVIDIA
Deploying generative AI for virtual screening, content generation, and avatar creation

Model-serving runtimes

You can serve models on the single-model serving platform by using model-serving runtimes. The configuration of a model-serving runtime is defined by the ServingRuntime and InferenceService custom resource definitions (CRDs).

ServingRuntime

The ServingRuntime CRD creates a serving runtime, an environment for deploying and managing a model. It creates the templates for pods that dynamically load and unload models of various formats and also exposes a service endpoint for inferencing requests.

The following YAML configuration is an example of the vLLM ServingRuntime for KServe model-serving runtime. The configuration includes various flags, environment variables and command-line arguments.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' (1)
    openshift.io/display-name: vLLM ServingRuntime for KServe (2)
  labels:
    opendatahub.io/dashboard: "true"
  name: vllm-runtime
  namespace: <namespace>
spec:
  annotations:
    prometheus.io/path: /metrics (3)
    prometheus.io/port: "8080" (4)
  containers:
    - args:
        - --port=8080
        - --model=/mnt/models (5)
        - --served-model-name={{.Name}} (6)
      command: (7)
        - python
        - '-m'
        - vllm.entrypoints.openai.api_server
      env:
        - name: HF_HOME
          value: /tmp/hf_home
      image: quay.io/modh/vllm@sha256:8a3dd8ad6e15fe7b8e5e471037519719d4d8ad3db9d69389f2beded36a6f5b21 (8)
      name: kserve-container
      ports:
        - containerPort: 8080
          protocol: TCP
  multiModel: false (9)
  supportedModelFormats: (10)
    - autoSelect: true
      name: vLLM

The recommended accelerator to use with the runtime.
The name with which the serving runtime is displayed.
The endpoint used by Prometheus to scrape metrics for monitoring.
The port used by Prometheus to scrape metrics for monitoring.
The path to where the model files are stored in the runtime container.
Passes the model name that is specified by the {{.Name}} template variable inside the runtime container specification to the runtime environment. The {{.Name}} variable maps to the spec.predictor.name field in the InferenceService metadata object.
The entrypoint command that starts the runtime container.
The runtime container image used by the serving runtime. This image differs depending on the type of accelerator used.
Specifies that the runtime is used for single-model serving.
Specifies the model formats supported by the runtime.

InferenceService

The InferenceService CRD creates a server or inference service that processes inference queries, passes it to the model, and then returns the inference output.

The inference service also performs the following actions:

Specifies the location and format of the model.
Specifies the serving runtime used to serve the model.
Enables the passthrough route for gRPC or REST inference.
Defines HTTP or gRPC endpoints for the deployed model.

The following example shows the InferenceService YAML configuration file that is generated when deploying a granite model with the vLLM runtime:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: granite
    serving.knative.openshift.io/enablePassthrough: 'true'
    sidecar.istio.io/inject: 'true'
    sidecar.istio.io/rewriteAppHTTPProbers: 'true'
  name: granite
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '6'
          memory: 24Gi
          nvidia.com/gpu: '1'
        requests:
          cpu: '1'
          memory: 8Gi
          nvidia.com/gpu: '1'
      runtime: vllm-runtime
      storage:
        key: aws-connection-my-storage
        path: models/granite-7b-instruct/
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists

Additional resources

KServe Serving Runtimes

Model-serving runtimes for accelerators

Open Data Hub provides support for accelerators through preinstalled model-serving runtimes.

NVIDIA GPUs

You can serve models with NVIDIA graphics processing units (GPUs) by using the vLLM NVIDIA GPU ServingRuntime for KServe runtime. To use the runtime, you must enable GPU support in Open Data Hub. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.

Intel Gaudi accelerators

You can serve models with Intel Gaudi accelerators by using the vLLM Intel Gaudi Accelerator ServingRuntime for KServe runtime. To use the runtime, you must enable hybrid processing (HPU) support in Open Data Hub. This includes installing the Intel Gaudi Base Operator and configuring a hardware profile. For more information, see Intel Gaudi Base Operator OpenShift installation and Working with hardware profiles.

For information about recommended vLLM parameters, environment variables, supported configurations and more, see vLLM with Intel® Gaudi® AI Accelerators.

Note

Warm-up is a model initialization and performance optimization step that is useful for reducing cold-start delays and first-inference latency. Depending on the model size, warm-up can lead to longer model loading times.

While highly recommended in production environments to avoid performance limitations, you can choose to skip warm-up for non-production environments to reduce model loading times and accelerate model development and testing cycles. To skip warm-up, follow the steps described in Customizing the parameters of a deployed model-serving runtime to add the following environment variable in the Configuration parameters section of your model deployment:

`VLLM_SKIP_WARMUP="true"`

AMD GPUs

You can serve models with AMD GPUs by using the vLLM AMD GPU ServingRuntime for KServe runtime. To use the runtime, you must enable support for AMD graphic processing units (GPUs) in Open Data Hub. This includes installing the AMD GPU operator and configuring a hardware profile. For more information, see Deploying the AMD GPU operator on OpenShift and Working with hardware profiles.

IBM Spyre AI accelerators on x86

You can serve models with IBM Spyre AI accelerators on x86 by using the vLLM Spyre AI Accelerator ServingRuntime for KServe runtime. To use the runtime, you must install the Spyre Operator and configure a hardware profile. For more information, see Spyre operator image and Working with hardware profiles.

Additional resources

Supported model-serving runtimes

Supported model-serving runtimes

Open Data Hub includes several preinstalled model-serving runtimes. You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. You can also add a custom runtime to support a model.

See Supported configurations for a list of the supported model-serving runtimes and deployment requirements.

For help adding a custom runtime, see Adding a custom model-serving runtime for the single-model serving platform.

Additional resources

Inference endpoints

Tested and verified model-serving runtimes

Tested and verified runtimes are community versions of model-serving runtimes that have been tested and verified against specific versions of Open Data Hub.

Red Hat tests the current version of a tested and verified runtime each time there is a new version of Open Data Hub. If a new version of a tested and verified runtime is released in the middle of an Open Data Hub release cycle, it will be tested and verified in an upcoming release.

See Supported configurations for a list of tested and verified runtimes in Open Data Hub.

Note	Tested and verified runtimes are not directly supported by Red Hat. You are responsible for ensuring that you are licensed to use any tested and verified runtimes that you add, and for correctly configuring and maintaining them.

Additional resources

Inference endpoints

Configuring model servers on the single-model serving platform

On the single-model serving platform, you configure model servers by using model-serving runtimes. A model-serving runtime adds support for a specified set of model frameworks and the model formats that they support.

Enabling the single-model serving platform

When you have installed KServe, you can use the Open Data Hub dashboard to enable the single-model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
You have installed KServe.
The spec.dashboardConfig.disableKServe dashboard configuration option is set to false (the default).

For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

Enable the single-model serving platform as follows:
1. In the left menu, click Settings → Cluster settings.
2. Locate the Model serving platforms section.
3. To enable the single-model serving platform for projects, select the Single-model serving platform checkbox.
4. Select KServe RawDeployment or Knative Serverless deployment mode.
  
  For more information about these deployment mode options, see About KServe deployment modes.
5. Click Save changes.
Enable preinstalled runtimes for the single-model serving platform as follows:
1. In the left menu of the Open Data Hub dashboard, click Settings → Serving runtimes.
  
  The Serving runtimes page shows preinstalled runtimes and any custom runtimes that you have added.
  
  For more information about preinstalled runtimes, see Supported runtimes.
2. Set the runtime that you want to use to Enabled.
  
  The single-model serving platform is now available for model deployments.

You can configure the vLLM NVIDIA GPU ServingRuntime for KServe runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for large language models (LLMs).

You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

The following procedure describes customizing the vLLM NVIDIA GPU ServingRuntime for KServe runtime for speculative decoding and multi-modal inferencing.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.

Procedure

Follow the steps to deploy a model as described in Deploying models on the single-model serving platform.
In the Serving runtime field, select the vLLM NVIDIA GPU ServingRuntime for KServe runtime.
To configure the vLLM model-serving runtime for speculative decoding by matching n-grams in the prompt, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--speculative-model=[ngram]
--num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
--ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX>
--use-v2-block-manager
```
1. Replace <NUM_SPECULATIVE_TOKENS> and <NGRAM_PROMPT_LOOKUP_MAX> with your own values.
  
  Note
  
  Inferencing throughput varies depending on the model used for speculating with n-grams.
To configure the vLLM model-serving runtime for speculative decoding with a draft model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--port=8080
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--model=/mnt/models/<path_to_original_model>
--speculative-model=/mnt/models/<path_to_speculative_model>
--num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
--use-v2-block-manager
```
1. Replace <path_to_speculative_model> and <path_to_original_model> with the paths to the speculative model and original model on your S3-compatible object storage.
2. Replace <NUM_SPECULATIVE_TOKENS> with your own value.
To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--trust-remote-code
```
Note

Only use the --trust-remote-code argument with models from trusted sources.
Click Deploy.

Verification

If you have configured the vLLM model-serving runtime for speculative decoding, use the following example command to verify API requests to your deployed model:
```
curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"
```

If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:

curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"
-d '{"model":"<model_name>",
     "messages":
        [{"role":"<role>",
          "content":
             [{"type":"text", "text":"<text>"
              },
              {"type":"image_url", "image_url":"<image_url_link>"
              }
             ]
         }
        ]
    }'

Additional resources

Adding a custom model-serving runtime for the single-model serving platform

A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the preinstalled runtimes that are included with Open Data Hub. You can also add your own custom runtimes if the default runtimes do not meet your needs.

As an administrator, you can use the Open Data Hub interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the single-model serving platform.

Note	Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
You have built your custom runtime and added the image to a container image repository such as Quay.

Procedure

From the Open Data Hub dashboard, click Settings → Serving runtimes.

The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a custom runtime, choose one of the following options:
- To start with an existing runtime (for example, vLLM NVIDIA GPU ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
- To add a new custom runtime, click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
In the Select the API protocol this runtime supports list, select REST or gRPC.
Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:
- Upload a YAML file
  
  Click Upload files.
  
  In the file browser, select a YAML file on your computer.
  
  The embedded YAML editor opens and shows the contents of the file that you uploaded.
- Enter YAML code directly in the editor
  
  Click Start from scratch.
  
  Enter or paste YAML code directly in the embedded editor.
Note
In many cases, creating a custom runtime will require adding new or custom parameters to the env section of the ServingRuntime specification.
Click Add.

The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification

The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Adding a tested and verified model-serving runtime for the single-model serving platform

In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes to support your requirements. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Open Data Hub.

You can use the Open Data Hub dashboard to add and enable tested and verified runtimes for the single-model serving platform. You can then choose the runtime when you deploy a model on the single-model serving platform.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
If you are deploying the IBM Z Accelerated for NVIDIA Triton Inference Server runtime, you have access to IBM Cloud Container Registry to pull the container image. For more information about obtaining credentials to the IBM Cloud Container Registry, see Downloading the IBM Z Accelerated for NVIDIA Triton Inference Server container image.
If you are deploying the IBM Power Accelerated Triton Inference Server runtime, you can access the container image from the Triton Inference Server Quay repository.

Procedure

From the Open Data Hub dashboard, click Settings → Serving runtimes.

The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
Click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
In the Select the API protocol this runtime supports list, select REST or gRPC.
Click Start from scratch.

Follow these steps to add the IBM Power Accelerated for NVIDIA Triton Inference Server runtime:

If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-ppc64le-runtime
  annotations:
    openshift.io/display-name: Triton Server ServingRuntime for KServe(ppc64le)
spec:
  supportedModelFormats:
    - name: python
    - name: onnx
      autoSelect: true
  multiModel: false
  containers:
    - command:
        - tritonserver
        - --model-repository=/mnt/models
      name: kserve-container
      image: quay.io/powercloud/tritonserver:latest
      resources:
        requests:
          cpu: 2
          memory: 8Gi
        limits:
          cpu: 2
          memory: 8Gi
      ports:
        - containerPort: 8000

Follow these steps to add the IBM Z Accelerated for NVIDIA Triton Inference Server runtime:

If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: ibmz-triton-rest
  labels:
    opendatahub.io/dashboard: "true"
spec:
  containers:
    - name: kserve-container
      command:
        - /bin/sh
        - -c
      args:
        - /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --http-port=8000 --grpc-port=8001 --metrics-port=8002
      image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:<version>
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
      resources:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "2"
          memory: 4Gi
      ports:
        - containerPort: 8000
          protocol: TCP
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - name: onnx-mlir
      version: "1"
      autoSelect: true
    - name: snapml
      version: "1"
      autoSelect: true
    - name: pytorch
      version: "1"
      autoSelect: true

If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: ibmz-triton-grpc
  labels:
    opendatahub.io/dashboard: "true"
spec:
  containers:
    - name: kserve-container
      command:
        - /bin/sh
        - -c
      args:
        - /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --grpc-port=8001 --http-port=8000 --metrics-port=8002
      image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:<version>
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
      resources:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "2"
          memory: 4Gi
      ports:
        - containerPort: 8001
          name: grpc
          protocol: TCP
      volumeMounts:
        - mountPath: /dev/shm
          name: shm
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - name: onnx-mlir
      version: "1"
      autoSelect: true
    - name: snapml
      version: "1"
      autoSelect: true
    - name: pytorch
      version: "1"
      autoSelect: true
  volumes:
    - emptyDir: null
      medium: Memory
      sizeLimit: 2Gi
      name: shm

Follow these steps to add the NVIDIA Triton Inference Server runtime:

If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-kserve-rest
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
    - args:
        - tritonserver
        - --model-store=/mnt/models
        - --grpc-port=9000
        - --http-port=8080
        - --allow-grpc=true
        - --allow-http=true
      image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
      name: kserve-container
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 8080
          protocol: TCP
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: tensorrt
      version: "8"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: onnx
      version: "1"
    - name: pytorch
      version: "1"
    - autoSelect: true
      name: triton
      version: "2"
    - autoSelect: true
      name: xgboost
      version: "1"
    - autoSelect: true
      name: python
      version: "1"

If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-kserve-grpc
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
    - args:
        - tritonserver
        - --model-store=/mnt/models
        - --grpc-port=9000
        - --http-port=8080
        - --allow-grpc=true
        - --allow-http=true
      image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
      name: kserve-container
      ports:
        - containerPort: 9000
          name: h2c
          protocol: TCP
      volumeMounts:
        - mountPath: /dev/shm
          name: shm
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: tensorrt
      version: "8"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: onnx
      version: "1"
    - name: pytorch
      version: "1"
    - autoSelect: true
      name: triton
      version: "2"
    - autoSelect: true
      name: xgboost
      version: "1"
    - autoSelect: true
      name: python
      version: "1"
  volumes:
    - name: shm
      emptyDir: null
        medium: Memory
        sizeLimit: 2Gi

Follow these steps to add the Seldon MLServer runtime:

If you selected the REST API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-kserve-rest
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    openshift.io/display-name: Seldon MLServer
    prometheus.kserve.io/port: "8080"
    prometheus.kserve.io/path: /metrics
  containers:
    - name: kserve-container
      image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390'
      env:
        - name: MLSERVER_HTTP_PORT
          value: "8080"
        - name: MLSERVER_GRPC_PORT
          value: "9000"
        - name: MODELS_DIR
          value: /mnt/models
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 8080
          protocol: TCP
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        privileged: false
        runAsNonRoot: true
  protocolVersions:
    - v2
  multiModel: false
  supportedModelFormats:
    - name: sklearn
      version: "0"
      autoSelect: true
      priority: 2
    - name: sklearn
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "2"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "3"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "4"
      autoSelect: true
      priority: 2
    - name: mlflow
      version: "1"
      autoSelect: true
      priority: 1
    - name: mlflow
      version: "2"
      autoSelect: true
      priority: 1
    - name: catboost
      version: "1"
      autoSelect: true
      priority: 1
    - name: huggingface
      version: "1"
      autoSelect: true
      priority: 1

If you selected the gRPC API protocol, enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-kserve-grpc
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    openshift.io/display-name: Seldon MLServer
    prometheus.kserve.io/port: "8080"
    prometheus.kserve.io/path: /metrics
  containers:
    - name: kserve-container
      image: 'docker.io/seldonio/mlserver@sha256:07890828601515d48c0fb73842aaf197cbcf245a5c855c789e890282b15ce390'
      env:
        - name: MLSERVER_HTTP_PORT
          value: "8080"
        - name: MLSERVER_GRPC_PORT
          value: "9000"
        - name: MODELS_DIR
          value: /mnt/models
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "1"
          memory: 2Gi
      ports:
        - containerPort: 9000
          name: h2c
          protocol: TCP
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        privileged: false
        runAsNonRoot: true
  protocolVersions:
    - v2
  multiModel: false
  supportedModelFormats:
    - name: sklearn
      version: "0"
      autoSelect: true
      priority: 2
    - name: sklearn
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "1"
      autoSelect: true
      priority: 2
    - name: xgboost
      version: "2"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "3"
      autoSelect: true
      priority: 2
    - name: lightgbm
      version: "4"
      autoSelect: true
      priority: 2
    - name: mlflow
      version: "1"
      autoSelect: true
      priority: 1
    - name: mlflow
      version: "2"
      autoSelect: true
      priority: 1
    - name: catboost
      version: "1"
      autoSelect: true
      priority: 1
    - name: huggingface
      version: "1"
      autoSelect: true
      priority: 1

In the metadata.name field, make sure that the value of the runtime you are adding does not match a runtime that you have already added.
Optional: To use a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:
```
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: kserve-triton
  annotations:
    openshift.io/display-name: Triton ServingRuntime
```
Note
If you do not configure a custom display name for your runtime, Open Data Hub shows the value of the metadata.name field.
Click Create.

The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
Optional: To edit the runtime, click the action menu (⋮) and select Edit.

Verification

The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Additional resources

Tested and verified model-serving runtimes

Configuring model servers on the NVIDIA NIM model serving platform

You configure and create a model server on the NVIDIA NIM model serving platform when you deploy an NVIDIA-optimized model. During the deployment process, you select a specific NIM from the available list and configure its properties, such as the number of replicas, server size, and the hardware profile.

Enabling the NVIDIA NIM model serving platform

As an Open Data Hub administrator, you can use the Open Data Hub dashboard to enable the NVIDIA NIM model serving platform.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
You have enabled the single-model serving platform. You do not need to enable a preinstalled runtime. For more information about enabling the single-model serving platform, see Enabling the single-model serving platform.
The disableNIMModelServing dashboard configuration option is set to false.

For more information about setting dashboard configuration options, see Customizing the dashboard.
You have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
You have an NVIDIA Cloud Account (NCA) and can access the NVIDIA GPU Cloud (NGC) portal. For more information, see NVIDIA GPU Cloud user guide.
Your NCA account is associated with the NVIDIA AI Enterprise Viewer role.
You have generated a personal API key on the NGC portal. For more information, see Generating a Personal API Key.

Procedure

In the left menu of the Open Data Hub dashboard, click Applications → Explore.
On the Explore page, find the NVIDIA NIM tile.
Click Enable on the application tile.
Enter your personal API key and then click Submit.

Verification

The NVIDIA NIM application that you enabled is displayed on the Enabled page.

Configuring model servers on the multi-model serving platform

On the multi-model serving platform, you configure model servers for your data science project before you deploy models. A model server can host multiple models, which share the server’s resources.

Enabling the multi-model serving platform

To use the multi-model serving platform, you must first enable the platform. The multi-model serving platform uses the ModelMesh component.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
The spec.dashboardConfig.disableModelMesh dashboard configuration option is set to false (the default).

For more information about setting dashboard configuration options, see Customizing the dashboard.

Procedure

In the left menu of the Open Data Hub dashboard, click Settings → Cluster settings.
Locate the Model serving platforms section.
Select the Multi-model serving platform checkbox.
Click Save changes.

Adding a custom model-serving runtime for the multi-model serving platform

A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. By default, the multi-model serving platform includes the OpenVINO Model Server runtime. You can also add your own custom runtime if the default runtime does not meet your needs, such as supporting a specific model format.

As an administrator, you can use the Open Data Hub dashboard to add and enable a custom model-serving runtime. You can then choose the custom runtime when you create a new model server for the multi-model serving platform.

Note	Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
You are familiar with how to add a model server to your project. When you have added a custom model-serving runtime, you must configure a new model server to use the runtime.
You have reviewed the example runtimes in the kserve/modelmesh-serving repository. You can use these examples as starting points. However, each runtime requires some further modification before you can deploy it in Open Data Hub. The required modifications are described in the following procedure.

Note
Open Data Hub includes the OpenVINO Model Server runtime by default. You do not need to add this runtime to Open Data Hub.

Procedure

From the Open Data Hub dashboard, click Settings → Serving runtimes.

The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a custom runtime, choose one of the following options:
- To start with an existing runtime (for example the OpenVINO Model Server runtime), click the action menu (⋮) next to the existing runtime and then click Duplicate.
- To add a new custom runtime, click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Multi-model serving platform.

Note
The multi-model serving platform supports only the REST protocol. Therefore, you cannot change the default value in the Select the API protocol this runtime supports list.
Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:
- Upload a YAML file
  
  Click Upload files.
  
  In the file browser, select a YAML file on your computer. This file might be the one of the example runtimes that you downloaded from the kserve/modelmesh-serving repository.
  
  The embedded YAML editor opens and shows the contents of the file that you uploaded.
- Enter YAML code directly in the editor
  
  Click Start from scratch.
  
  Enter or paste YAML code directly in the embedded editor. The YAML that you paste might be copied from one of the example runtimes in the kserve/modelmesh-serving repository.
Optional: If you are adding one of the example runtimes in the kserve/modelmesh-serving repository, perform the following modifications:
1. In the YAML editor, locate the kind field for your runtime. Update the value of this field to ServingRuntime.
2. In the kustomization.yaml file in the kserve/modelmesh-serving repository, take note of the newName and newTag values for the runtime that you want to add. You will specify these values in a later step.
3. In the YAML editor for your custom runtime, locate the containers.image field.
4. Update the value of the containers.image field in the format newName:newTag, based on the values that you previously noted in the kustomization.yaml file. Some examples are shown.
  
  Nvidia Triton Inference Server
  
  image: nvcr.io/nvidia/tritonserver:23.04-py3
  
  Seldon Python MLServer
  
  image: seldonio/mlserver:1.3.2
  
  TorchServe
  
  image: pytorch/torchserve:0.7.1-cpu
In the metadata.name field, ensure that the value of the runtime you are adding is unique (that is, the value does not match a runtime that you have already added).
Optional: To configure a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:
```
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-0.x
  annotations:
    openshift.io/display-name: MLServer
```
Note
If you do not configure a custom display name for your runtime, Open Data Hub shows the value of the metadata.name field.
Click Add.

The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime you added is automatically enabled.
Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification

The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Additional resources

To learn how to configure a model server that uses a custom model-serving runtime that you have added, see Adding a model server to your data science project.

Adding a tested and verified model-serving runtime for the multi-model serving platform

In addition to preinstalled and custom model-serving runtimes, you can also use Red Hat tested and verified model-serving runtimes such as the NVIDIA Triton Inference Server to support your needs. For more information about Red Hat tested and verified runtimes, see Tested and verified runtimes for Open Data Hub.

You can use the Open Data Hub dashboard to add and enable the NVIDIA Triton Inference Server runtime and then choose the runtime when you create a new model server for the multi-model serving platform.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
You are familiar with how to add a model server to your project. After you have added a tested and verified model-serving runtime, you must configure a new model server to use the runtime.

Procedure

From the Open Data Hub dashboard, click Settings → Serving runtimes.

The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a tested and verified runtime, click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Multi-model serving platform.

Note
The multi-model serving platform supports only the REST protocol. Therefore, you cannot change the default value in the Select the API protocol this runtime supports list.
Click Start from scratch.

Enter or paste the following YAML code directly in the embedded editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    enable-route: "true"
  name: modelmesh-triton
  labels:
    opendatahub.io/dashboard: "true"
spec:
  annotations:
    opendatahub.io/modelServingSupport: '["multi"x`x`]'
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  builtInAdapter:
    env:
      - name: CONTAINER_MEM_REQ_BYTES
        value: "268435456"
      - name: USE_EMBEDDED_PULLER
        value: "true"
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8001
    serverType: triton
  containers:
    - args:
        - -c
        - 'mkdir -p /models/_triton_models;  chmod 777
          /models/_triton_models;  exec
          tritonserver "--model-repository=/models/_triton_models" "--model-control-mode=explicit" "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true" "--allow-grpc=true"  '
      command:
        - /bin/sh
      image: nvcr.io/nvidia/tritonserver@sha256:xxxxx
      name: triton
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
  grpcDataEndpoint: port:8001
  grpcEndpoint: port:8085
  multiModel: true
  protocolVersions:
    - grpc-v2
    - v2
  supportedModelFormats:
    - autoSelect: true
      name: onnx
      version: "1"
    - autoSelect: true
      name: pytorch
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: tensorrt
      version: "7"
    - autoSelect: false
      name: xgboost
      version: "1"
    - autoSelect: true
      name: python
      version: "1"

In the metadata.name field, make sure that the value of the runtime you are adding does not match a runtime that you have already added).
Optional: To use a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:
```
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: modelmesh-triton
  annotations:
    openshift.io/display-name: Triton ServingRuntime
```
Note
If you do not configure a custom display name for your runtime, Open Data Hub shows the value of the metadata.name field.
Click Create.

The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime you added is automatically enabled.
Optional: To edit the runtime, click the action menu (⋮) and select Edit.

Verification

The model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Additional resources

To learn how to configure a model server that uses a model-serving runtime that you have added, see Adding a model server to your data science project.

Customizing model deployments

You can customize a model’s deployment on the single-model serving platform to suit your specific needs, for example, to deploy a particular family of models or to enhance an existing deployment. You can modify the runtime configuration for a specific deployment by setting additional serving runtime arguments and environment variables.

These customizations apply only to the selected model deployment and do not change the default runtime configuration. You can set these parameters when you first deploy a model or by editing an existing deployment.

Customizing the parameters of a deployed model-serving runtime

You might need additional parameters beyond the default ones to deploy specific models or to enhance an existing model deployment. In such cases, you can modify the parameters of an existing runtime to suit your deployment needs.

Note	Customizing the parameters of a runtime only affects the selected model deployment.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
You have deployed a model on the single-model serving platform.

Procedure

From the Open Data Hub dashboard, click Models → Model deployments.

The Model deployments page opens.
Click Stop next to the name of the model you want to customize.
Click the action menu (⋮) and select Edit.

The Configuration parameters section shows predefined serving runtime parameters, if any are available.
Customize the runtime parameters in the Configuration parameters section:
1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
2. Modify the values in Additional environment variables to define variables in the model’s environment.
  
  Note
  Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
After you are done customizing the runtime parameters, click Redeploy to save.
Click Start to deploy the model with your changes.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.
Confirm that the arguments and variables that you set appear in spec.predictor.model.args and spec.predictor.model.env by one of the following methods:
- Checking the InferenceService YAML from the OpenShift Container Platform Console.
- Using the following command in the OpenShift Container Platform CLI:
  oc get -o json inferenceservice <inferenceservicename/modelname> -n <projectname>

Customizable model serving runtime parameters

You can modify the parameters of an existing model serving runtime to suit your deployment needs.

For more information about parameters for each of the supported serving runtimes, see the following table:

Serving runtime	Resource
Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe	Caikit NLP: Configuration TGIS: Model configuration
Caikit Standalone ServingRuntime for KServe	Caikit NLP: Configuration
NVIDIA Triton Inference Server	NVIDIA Triton Inference Server: Model Parameters
OpenVINO Model Server	OpenVINO Model Server Features: Dynamic Input Parameters
Seldon MLServer	MLServer Documentation: Model Settings
[Deprecated] Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe	TGIS: Model configuration
vLLM NVIDIA GPU ServingRuntime for KServe	vLLM: Engine Arguments OpenAI-Compatible Server
vLLM AMD GPU ServingRuntime for KServe	vLLM: Engine Arguments OpenAI-Compatible Server
vLLM Intel Gaudi Accelerator ServingRuntime for KServe	vLLM: Engine Arguments OpenAI-Compatible Server

Serving runtime

Resource

Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe

Caikit NLP: Configuration
TGIS: Model configuration

Caikit Standalone ServingRuntime for KServe

Caikit NLP: Configuration

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server: Model Parameters

OpenVINO Model Server

OpenVINO Model Server Features: Dynamic Input Parameters

Seldon MLServer

MLServer Documentation: Model Settings

[Deprecated] Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe

TGIS: Model configuration

vLLM NVIDIA GPU ServingRuntime for KServe

vLLM: Engine Arguments
OpenAI-Compatible Server

vLLM AMD GPU ServingRuntime for KServe

vLLM: Engine Arguments
OpenAI-Compatible Server

vLLM Intel Gaudi Accelerator ServingRuntime for KServe

vLLM: Engine Arguments
OpenAI-Compatible Server

Additional resources

Customizing the parameters of a deployed model serving runtime

Customizing the vLLM model-serving runtime

In certain cases, you may need to add additional flags or environment variables to the vLLM ServingRuntime for KServe runtime to deploy a family of LLMs.

The following procedure describes customizing the vLLM model-serving runtime to deploy a Llama, Granite or Mistral model.

Prerequisites

You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
For Llama model deployment, you have downloaded a meta-llama-3 model to your object storage.
For Granite model deployment, you have downloaded a granite-7b-instruct or granite-20B-code-instruct model to your object storage.
For Mistral model deployment, you have downloaded a mistral-7B-Instruct-v0.3 model to your object storage.
You have enabled the vLLM ServingRuntime for KServe runtime.
You have enabled GPU support in Open Data Hub and have installed and configured the Node Feature Discovery Operator on your cluster. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs

Procedure

Follow the steps to deploy a model as described in Deploying models on the single-model serving platform.
In the Serving runtime field, select vLLM ServingRuntime for KServe.
If you are deploying a meta-llama-3 model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
–-distributed-executor-backend=mp (1)
--max-model-len=6144 (2)
```
1. Sets the backend to multiprocessing for distributed model workers
2. Sets the maximum context length of the model to 6144 tokens
If you are deploying a granite-7B-instruct model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--distributed-executor-backend=mp (1)
```
1. Sets the backend to multiprocessing for distributed model workers
If you are deploying a granite-20B-code-instruct model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--distributed-executor-backend=mp (1)
–-tensor-parallel-size=4 (2)
--max-model-len=6448 (3)
```
1. Sets the backend to multiprocessing for distributed model workers
2. Distributes inference across 4 GPUs in a single node
3. Sets the maximum context length of the model to 6448 tokens
If you are deploying a mistral-7B-Instruct-v0.3 model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:
```
--distributed-executor-backend=mp (1)
--max-model-len=15344 (2)
```
1. Sets the backend to multiprocessing for distributed model workers
2. Sets the maximum context length of the model to 15344 tokens
Click Deploy.

Verification

Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.

For granite models, use the following example command to verify API requests to your deployed model:

curl -q -X 'POST' \
    "https://<inference_endpoint_url>:443/v1/chat/completions" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d "{
    \"model\": \"<model_name>\",
    \"prompt\": \"<prompt>",
    \"max_tokens\": <max_tokens>,
    \"temperature\": <temperature>
    }"

Additional resources