Serving models

Table of Contents

About model serving
Serving small and medium-sized models
Serving large models
Monitoring model performance
- Viewing performance metrics for all models on a model server
- Viewing HTTP request metrics for a deployed model

About model serving

Serving trained models on Open Data Hub means deploying the models on your OpenShift cluster to test and then integrate them into intelligent applications. Deploying a model makes it available as a service that you can access by using an API. This enables you to return predictions based on data inputs that you provide through API calls. This process is known as model inferencing. When you serve a model on Open Data Hub, the inference endpoints that you can access for the deployed model are shown in the dashboard.

Open Data Hub provides the following model serving platforms:

Single model serving platform: For deploying large models such as large language models (LLMs), Open Data Hub includes a single model serving platform that is based on the KServe component. Because each model is deployed from its own model server, the single model serving platform helps you to deploy, monitor, scale, and maintain large models that require increased resources.
Multi-model serving platform: For deploying small and medium-sized models, Open Data Hub includes a multi-model serving platform that is based on the ModelMesh component. On the multi-model serving platform, you can deploy multiple models on the same model server. Each of the deployed models shares the server resources. This approach can be advantageous on OpenShift clusters that have finite compute resources or pods.

Serving small and medium-sized models

On the multi-model serving platform, multiple models can be deployed from the same model server and share the server resources.

Configuring model servers

Enabling the multi-model serving platform

To use the multi-model serving platform, you must first enable the platform.

Prerequisites

You have logged in to Open Data Hub.
If you are using specialized Open Data Hub groups, you are part of the admin group (for example, odh-admins) in OpenShift.

Procedure

In the left menu of the Open Data Hub dashboard, click Settings → Cluster settings.
Locate the Model serving platforms section.
Select the Multi-model serving platform checkbox.
Click Save changes.

Adding a custom model-serving runtime for the multi-model serving platform

A model-serving runtime adds support for a specified set of model frameworks (that is, formats). By default, the multi-model serving platform includes the OpenVINO Model Server runtime. However, if this runtime doesn’t meet your needs (it doesn’t support a particular model format, for example), you can add your own, custom runtime.

As an administrator, you can use the Open Data Hub dashboard to add and enable a custom model-serving runtime. You can then choose the custom runtime when you create a new model server for the multi-model serving platform.

Note	Open Data Hub enables you to add your own custom runtimes, but does not support the runtimes themselves. You are responsible for correctly configuring and maintaining custom runtimes. You are also responsible for ensuring that you are licensed to use any custom runtimes that you add.

Prerequisites

You have logged in to Open Data Hub as an administrator.
You are familiar with how to add a model server to your project. When you have added a custom model-serving runtime, you must configure a new model server to use the runtime.
You have reviewed the example runtimes in the kserve/modelmesh-serving repository. You can use these examples as starting points. However, each runtime requires some further modification before you can deploy it in Open Data Hub. The required modifications are described in the following procedure.

Note
Open Data Hub includes the OpenVINO Model Server runtime by default. You do not need to add this runtime to Open Data Hub.

Procedure

From the Open Data Hub dashboard, click Settings > Serving runtimes.

The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a custom runtime, choose one of the following options:
- To start with an existing runtime (for example the OpenVINO Model Server runtime), click the action menu (⋮) next to the existing runtime and then click Duplicate.
- To add a new custom runtime, click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Multi-model serving platform.

Note
The multi-model serving platform supports only the REST protocol. Therefore, you cannot change the default value in the Select the API protocol this runtime supports list.
Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:
- Upload a YAML file
  
  Click Upload files.
  
  In the file browser, select a YAML file on your computer. This file might be the one of the example runtimes that you downloaded from the kserve/modelmesh-serving repository.
  
  The embedded YAML editor opens and shows the contents of the file that you uploaded.
- Enter YAML code directly in the editor
  
  Click Start from scratch.
  
  Enter or paste YAML code directly in the embedded editor. The YAML that you paste might be copied from one of the example runtimes in the kserve/modelmesh-serving repository.
Optional: If you are adding one of the example runtimes in the kserve/modelmesh-serving repository, perform the following modifications:
1. In the YAML editor, locate the kind field for your runtime. Update the value of this field to ServingRuntime.
2. In the kustomization.yaml file in the kserve/modelmesh-serving repository, take note of the newName and newTag values for the runtime that you want to add. You will specify these values in a later step.
3. In the YAML editor for your custom runtime, locate the containers.image field.
4. Update the value of the containers.image field in the format newName:newTag, based on the values that you previously noted in the kustomization.yaml file. Some examples are shown.
  
  Nvidia Triton Inference Server
  
  image: nvcr.io/nvidia/tritonserver:23.04-py3
  
  Seldon Python MLServer
  
  image: seldonio/mlserver:1.3.2
  
  TorchServe
  
  image: pytorch/torchserve:0.7.1-cpu
In the metadata.name field, ensure that the value of the runtime you are adding is unique (that is, the value doesn’t match a runtime that you have already added).
Optional: To configure a custom display name for the runtime that you are adding, add a metadata.annotations.openshift.io/display-name field and specify a value, as shown in the following example:
```
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-0.x
  annotations:
    openshift.io/display-name: MLServer
```
Note
If you do not configure a custom display name for your runtime, Open Data Hub shows the value of the metadata.name field.
Click Add.

The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime you added is automatically enabled.
Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification

The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Additional resources

To learn how to configure a model server that uses a custom model-serving runtime that you have added, see Adding a model server to your data science project.

Adding a model server for the multi-model serving platform

When you have enabled the multi-model serving platform, you must configure a model server to deploy models. If you require extra computing power for use with large datasets, you can assign accelerators to your model server.

Prerequisites

You have logged in to Open Data Hub.
If you use specialized Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.
You have created a data science project that you can add a model server to.
You have enabled the multi-model serving platform.
If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See Adding a custom model-serving runtime.
If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.

Procedure

In the left menu of the Open Data Hub dashboard, click Data Science Projects.
Click the name of the project that you want to configure a model server for.

A project details page opens.
In the Models and model servers section, perform one of the following actions:
- If you see a Multi-model serving platform tile, click Add model server on the tile.
- If you do not see any tiles, click the Add model server button.
The Add model server dialog opens.
In the Model server name field, enter a unique name for the model server.

From the Serving runtime list, select a model-serving runtime that is installed and enabled in your Open Data Hub deployment.

Note	If you are using a custom model-serving runtime with your model server and want to use GPUs, you must ensure that your custom runtime supports GPUs and is appropriately configured to use them.

In the Number of model replicas to deploy field, specify a value.
From the Model server size list, select a value.
Optional: If you selected Custom in the preceding step, configure the following settings in the Model server size section to customize your model server:
1. In the CPUs requested field, specify the number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
2. In the CPU limit field, specify the maximum number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
3. In the Memory requested field, specify the requested memory for the model server in gibibytes (Gi).
4. In the Memory limit field, specify the maximum memory limit for the model server in gibibytes (Gi).
Optional: From the Accelerator list, select an accelerator.
1. If you selected an accelerator in the preceding step, specify the number of accelerators to use.
Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
Optional: In the Token authorization section, select the Require token authentication checkbox to require token authentication for your model server. To finish configuring token authentication, perform the following actions:
1. In the Service account name field, enter a service account name for which the token will be generated. The generated token is created and displayed in the Token secret field when the model server is configured.
2. To add an additional service account, click Add a service account and enter another service account name.
Click Add.
- The model server that you configured appears in the Models and model servers section of the project details page.
Optional: To update the model server, click the action menu (⋮) beside the model server and select Edit model server.

Deleting a model server

When you no longer need a model server to host models, you can remove it from your data science project.

Note	When you remove a model server, you also remove the models that are hosted on that model server. As a result, the models are no longer available to applications.

Prerequisites

You have created a data science project and an associated model server.
You have notified the users of the applications that access the models that the models will no longer be available.
If you are using specialized Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

Procedure

From the Open Data Hub dashboard, click Data Science Projects.

The Data science projects page opens.
Click the name of the project from which you want to delete the model server.

A project details page opens.
Click the action menu (⋮) beside the project whose model server you want to delete in the Models and model servers section and then click Delete model server.

The Delete model server dialog opens.
Enter the name of the model server in the text field to confirm that you intend to delete it.
Click Delete model server.

Verification

The model server that you deleted is no longer displayed in the Models and model servers section on the project details page.

Working with deployed models

Deploying a model by using the multi-model serving platform

You can deploy trained models on Open Data Hub to enable you to test and implement them into intelligent applications. Deploying a model makes it available as a service that you can access by using an API. This enables you to return predictions based on data inputs.

When you have enabled the multi-model serving platform, you can deploy models on the platform.

Prerequisites

You have logged in to Open Data Hub.
If you are using specialized Open Data Hub groups, you are part of the user group or admin group (for example, odh-users) in OpenShift.
You have enabled the multi-model serving platform.
You have created a data science project and added a model server.
You have access to S3-compatible object storage.
For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.

Procedure

In the left menu of the Open Data Hub dashboard, click Data Science Projects.
Click the name of the project that you want to deploy a model in.

A project details page opens.
In the Models and model servers section, click Deploy model.
Configure properties for deploying your model as follows:
1. In the Model name field, enter a unique name for the model that you are deploying.
2. From the Model framework list, select a framework for your model.
  
  Note
  The Model framework list shows only the frameworks that are supported by the model-serving runtime that you specified when you configured your model server.
3. To specify the location of the model you want to deploy from S3-compatible object storage, perform one of the following sets of actions:
  To use an existing data connection
  
  Select Existing data connection.
  
  From the Name list, select a data connection that you previously defined.
  
  In the Path field, enter the folder path that contains the model in your specified data source.
  
  To use a new data connection
  
  To define a new data connection that your model can access, select New data connection.
  
  In the Name field, enter a unique name for the data connection.
  
  In the Access key field, enter the access key ID for the S3-compatible object storage provider.
  
  In the Secret key field, enter the secret access key for the S3-compatible object storage account that you specified.
  
  In the Endpoint field, enter the endpoint of your S3-compatible object storage bucket.
  
  In the Region field, enter the default region of your S3-compatible object storage account.
  
  In the Bucket field, enter the name of your S3-compatible object storage bucket.
  
  In the Path field, enter the folder path in your S3-compatible object storage that contains your data file.
4. Click Deploy.

Verification

Confirm that the deployed model is shown in the Models and model servers section of your project, and on the Model Serving page of the dashboard with a checkmark in the Status column.

Additional resources

To learn how to monitor your model for bias, see Monitoring data science models.

Viewing a deployed model

To analyze the results of your work, you can view a list of deployed models on Open Data Hub. You can also view the current statuses of deployed models and their endpoints.

Prerequisites

You have logged in to Open Data Hub.
If you are using specialized Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.

Procedure

From the Open Data Hub dashboard, click Model Serving.

The Deployed models page opens.

For each model, the page shows details such as the model name, the project in which the model is deployed, the serving runtime that the model uses, and the deployment status.
Optional: For a given model, click the link in the Inference endpoint column to see the inference endpoints for the deployed model.

Verification

A list of previously deployed data science models is displayed on the Deployed models page.

Additional resources

To learn how to monitor your model for bias, see Monitoring data science models.

Updating the deployment properties of a deployed model

You can update the deployment properties of a model that has been deployed previously. This allows you to change the model’s data connection and name.

Prerequisites

You have logged in to Open Data Hub.
If you are using specialized Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.
You have deployed a model on Open Data Hub.

Procedure

From the Open Data Hub dashboard, click Model serving.

The Deployed models page opens.
Click the action menu (⋮) beside the model whose deployment properties you want to update and click Edit.

The Deploy model dialog opens.
Update the deployment properties of the model as follows:
1. In the Model Name field, enter a new, unique name for the model.
2. From the Model framework list, select a framework for your model.
  
  Note
  The Model framework list shows only the frameworks that are supported by the model-serving runtime that you specified when you configured your model server.
3. To update how you have specified the location of your model, perform one of the following sets of actions:
  If you previously specified an existing data connection
  
  In the Path field, update the folder path that contains the model in your specified data source.
  
  If you previously specified a new data connection
  
  In the Name field, update a unique name for the data connection.
  
  In the Access key field, update the access key ID for the S3-compatible object storage provider.
  
  In the Secret key field, update the secret access key for the S3-compatible object storage account that you specified.
  
  In the Endpoint field, update the endpoint of your S3-compatible object storage bucket.
  
  In the Region field, update the default region of your S3-compatible object storage account.
  
  In the Bucket field, update the name of your S3-compatible object storage bucket.
  
  In the Path field, update the folder path in your S3-compatible object storage that contains your data file.
4. Click Deploy.

Verification

The model whose deployment properties you updated is displayed on the Model Serving page of the dashboard.

Deleting a deployed model

You can delete models you have previously deployed. This enables you to remove deployed models that are no longer required.

Prerequisites

You have logged in to Open Data Hub.
If you are using specialized Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.
You have deployed a model.

Procedure

From the Open Data Hub dashboard, click Model serving.

The Deployed models page opens.
Click the action menu (⋮) beside the deployed model that you want to delete and click Delete.

The Delete deployed model dialog opens.
Enter the name of the deployed model in the text field to confirm that you intend to delete it.
Click Delete deployed model.

Verification

The model that you deleted is no longer displayed on the Deployed models page.

Configuring monitoring for the multi-model serving platform

The multi-model serving platform includes metrics for the ModelMesh component. When you have configured monitoring, you can grant Prometheus access to scrape the available metrics.

Prerequisites

You have cluster administrator privileges for your OpenShift Container Platform cluster.
You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
You are familiar with creating a config map for monitoring a user-defined workflow. You will perform similar steps in this procedure.
You are familiar with enabling monitoring for user-defined projects in OpenShift. You will perform similar steps in this procedure.
You have assigned the monitoring-rules-view role to users that will monitor metrics.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Define a ConfigMap object in a YAML file called uwm-cm-conf.yaml with the following contents:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheus:
      logLevel: debug
      retention: 15d
```
The user-workload-monitoring-config object configures the components that monitor user-defined projects. Observe that the retention time is set to the recommended value of 15 days.
Apply the configuration to create the user-workload-monitoring-config object.
```
$ oc apply -f uwm-cm-conf.yaml
```
Define another ConfigMap object in a YAML file called uwm-cm-enable.yaml with the following contents:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
```
The cluster-monitoring-config object enables monitoring for user-defined projects.
Apply the configuration to create the cluster-monitoring-config object.
```
$ oc apply -f uwm-cm-enable.yaml
```

Viewing metrics for the multi-model serving platform

When a cluster administrator has configured monitoring for the multi-model serving platform, non-admin users can use the OpenShift web console to view metrics.

Procedure

Switch to the Developer perspective.
In the left menu, click Observe.
As described in monitoring project metrics, use the web console to run queries for modelmesh_* metrics.

Monitoring deployed models

Viewing performance metrics for all models on a model server

In Open Data Hub, you can monitor the following metrics for all the models that are deployed on a model server:

HTTP requests - The number of HTTP requests that have failed or succeeded for all models on the server.

Note: You can also view the number of HTTP requests that have failed or succeeded for a specific model, as described in Viewing HTTP request metrics for a deployed model.
Average response time (ms) - For all models on the server, the average time it takes the model server to respond to requests.
CPU utilization (%) - The percentage of the CPU’s capacity that is currently being used by all models on the server.
Memory utilization (%) - The percentage of the system’s memory that is currently being used by all models on the server.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the models are performing at a specified time.

Prerequisites

You have installed Open Data Hub.
On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.
You have logged in to Open Data Hub.
If you are using specialized Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.
There are deployed data science models in your data science project.

Procedure

From the Open Data Hub dashboard navigation menu, click Data Science Projects and then select the project that contains the data science models that you want to monitor.
On the Components page, scroll down to the Models and model servers section.
In the row for the model server that you are interested in, click the action menu (⋮) and then select View model server metrics.
Optional: On the metrics page for the model server, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
Scroll down to view data graphs for HTTP requests, average response time, CPU utilization, and memory utilization.

Verification

On the metrics page for the model server, the graphs provide performance metric data.

Viewing HTTP request metrics for a deployed model

You can view a graph that illustrates the HTTP requests that have failed or succeeded for a specific model.

Prerequisites

You have installed Open Data Hub.
On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.
You have logged in to Open Data Hub.
If you are using specialized Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.
You have deployed a model in a data science project.

Procedure

From the Open Data Hub dashboard navigation menu, select Model Serving.
On the Deployed models page, select the model that you are interested in.
Optional: On the Endpoint performance tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.

Verification

The Endpoint performance tab shows a graph of the HTTP metrics for the model.

Serving large models

For deploying large models such as large language models (LLMs), Open Data Hub includes a single-model serving platform that is based on the KServe component. Because each model is deployed from its own model server, the single-model serving platform helps you to deploy, monitor, scale, and maintain large models that require increased resources.

About the single model serving platform

The single model serving platform consists of the following components:

KServe: A Kubernetes custom resource definition (CRD) that orchestrates model serving for all types of models. It includes model-serving runtimes that implement the loading of given types of model servers. KServe handles the lifecycle of the deployment object, storage access, and networking setup.
Red Hat OpenShift Serverless: A cloud-native development model that allows for serverless deployments of models. OpenShift Serverless is based on the open source Knative project.

To install the single model serving platform, you have the following options:

Automated installation: If you have not already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you can configure the Open Data Hub Operator to install KServe and its dependencies.
Manual installation: If you have already created a ServiceMeshControlPlane or KNativeServing resource on your OpenShift cluster, you cannot configure the Open Data Hub Operator to install KServe and its dependencies. In this situation, you must install KServe manually.

When you have installed KServe, you can use the Open Data Hub dashboard to deploy models using pre-installed or custom model-serving runtimes.

Open Data Hub includes the following pre-installed runtimes for KServe:

A standalone TGIS runtime
A composite Caikit-TGIS runtime
OpenVINO Model Server

Note

Text Generation Inference Server (TGIS) is based on an early fork of Hugging Face TGI. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model does not work in the current version of Open Data Hub, support might be added in a future version. In the meantime, you can also add your own, custom runtime to support a TGI model. For more information, see Adding a custom model-serving runtime for the single model serving platform.
The composite Caikit-TGIS runtime is based on Caikit and Text Generation Inference Server (TGIS). To use this runtime, you must convert your models to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.

You can also configure monitoring for the single model serving platform and use Prometheus to scrape the available metrics.

Installing KServe

To learn how to perform both automated and manual installation of KServe, see Installation in the caikit-tgis-serving repository.

Deploying models by using the single-model serving platform

On the single-model serving platform, each model is deployed on its own model server. This helps you to deploy, monitor, scale, and maintain large models that require increased resources.

Important

If you want to use the single-model serving platform to deploy a model from S3-compatible storage that uses a self-signed SSL certificate, you must install a certificate authority (CA) bundle on your OpenShift cluster. For more information, see Understanding certificates in Open Data Hub.

Enabling the single model serving platform

When you have installed KServe, you can use the Open Data Hub dashboard to enable the single model serving platform. You can also use the dashboard to enable model-serving runtimes for the platform.

Prerequisites

You have logged in to Open Data Hub.
If you are using specialized Open Data Hub groups, you are part of the admin group (for example, odh-admins) in OpenShift.
You have installed KServe.

Procedure

Enable the single model serving platform as follows:
1. In the left menu, click Settings → Cluster settings.
2. Locate the Model serving platforms section.
3. To enable the single model serving platform for projects, select the Single model serving platform checkbox.
4. Click Save changes.
Enable pre-installed runtimes for the single-model serving platform as follows:
1. In the left menu of the Open Data Hub dashboard, click Settings → Serving runtimes.
  
  The Serving runtimes page shows any custom runtimes that you have added, as well as the following pre-installed runtimes:
  - Caikit TGIS ServingRuntime for KServe
  - OpenVINO Model Server
  - TGIS Standalone ServingRuntime for KServe
2. Set the runtime that you want to use to Enabled.
  
  The single model serving platform is now available for model deployments.

Adding a custom model-serving runtime for the single-model serving platform

A model-serving runtime adds support for a specified set of model frameworks (that is, formats). You have the option of using the pre-installed runtimes included with Open Data Hub or adding your own, custom runtimes. This is useful in instances where the pre-installed runtimes don’t meet your needs. For example, you might find that the TGIS runtime does not support a particular model format that is supported by Hugging Face Text Generation Inference (TGI). In this case, you can create a custom runtime to add support for the model.

As an administrator, you can use the Open Data Hub interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the single-model serving platform.

Note	Open Data Hub enables you to add your own custom runtimes, but does not support the runtimes themselves. You are responsible for correctly configuring and maintaining custom runtimes. You are also responsible for ensuring that you are licensed to use any custom runtimes that you add.

Prerequisites

You have logged in to Open Data Hub as an administrator.
You have built your custom runtime and added the image to a container image repository such as Quay.

Procedure

From the Open Data Hub dashboard, click Settings > Serving runtimes.

The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
To add a custom runtime, choose one of the following options:
- To start with an existing runtime (for example, TGIS Standalone ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
- To add a new custom runtime, click Add serving runtime.
In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
In the Select the API protocol this runtime supports list, select REST or gRPC.
Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:
- Upload a YAML file
  
  Click Upload files.
  
  In the file browser, select a YAML file on your computer.
  
  The embedded YAML editor opens and shows the contents of the file that you uploaded.
- Enter YAML code directly in the editor
  
  Click Start from scratch.
  
  Enter or paste YAML code directly in the embedded editor.
Note
In many cases, creating a custom runtime will require adding new or custom parameters to the env section of the ServingRuntime specification.
Click Add.

The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.

Verification

The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.

Deploying models on the single-model serving platform

When you have enabled the single-model serving platform, you can enable a pre-installed or custom model-serving runtime and start to deploy models on the platform.

Note

Text Generation Inference Server (TGIS) is based on an early fork of Hugging Face TGI. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model does not work in the current version of Open Data Hub, support might be added in a future version. In the meantime, you can also add your own, custom runtime to support a TGI model. For more information, see Adding a custom model-serving runtime for the single-model serving platform.

Prerequisites

You have logged in to Open Data Hub.
If you are using specialized Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.
You have installed KServe.
You have enabled the single-model serving platform.
You have created a data science project.
You have access to S3-compatible object storage.
For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.
To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.

Procedure

In the left menu, click Data Science Projects.
Click the name of the project that you want to deploy a model in.
In the Models and model servers section, perform one of the following actions:
- If you see a Single model serving platform tile, click Deploy model on the tile.
- If you do not see any tiles, click the Deploy model button.
The Deploy model dialog opens.
Configure properties for deploying your model as follows:
1. In the Model name field, enter a unique name for the model that you are deploying.
2. In the Serving runtime field, select an enabled runtime.
3. From the Model framework list, select a value.
4. In the Number of model replicas to deploy field, specify a value.
5. From the Model server size list, select a value.
6. To specify the location of your model, perform one of the following sets of actions:
  To use an existing data connection
  
  Select Existing data connection.
  
  From the Name list, select a data connection that you previously defined.
  
  In the Path field, enter the folder path that contains the model in your specified data source.
  
  To use a new data connection
  
  To define a new data connection that your model can access, select New data connection.
  
  In the Name field, enter a unique name for the data connection.
  
  In the Access key field, enter the access key ID for your S3-compatible object storage provider.
  
  In the Secret key field, enter the secret access key for the S3-compatible object storage account that you specified.
  
  In the Endpoint field, enter the endpoint of your S3-compatible object storage bucket.
  
  In the Region field, enter the default region of your S3-compatible object storage account.
  
  In the Bucket field, enter the name of your S3-compatible object storage bucket.
  
  In the Path field, enter the folder path in your S3-compatible object storage that contains your data file.
7. Click Deploy.

Verification

Confirm that the deployed model is shown in the Models and model servers section of your project, and on the Model Serving page of the dashboard with a check mark in the Status column.

Accessing the inference endpoints for models deployed on the single model serving platform

When you deploy a model by using the single model serving platform, the model is available as a service that you can access using API requests. This enables you to return predictions based on data inputs. To use API requests to interact with your deployed model, you must know how to access the inference endpoints that are available.

Prerequisites

You have logged in to Open Data Hub.
If you are using specialized Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.
You have deployed a model by using the single model serving platform.

Procedure

From the Open Data Hub dashboard, click Model Serving.
From the Project list, select the project that you deployed a model in.
In the Deployed models table, for the model that you want to access, copy the URL shown in the Inference endpoint column.
Depending on what action you want to perform with the model (and if the model supports that action), add one of the following paths to the end of the inference endpoint URL:
Caikit TGIS ServingRuntime for KServe
- :443/api/v1/task/text-generation
- :443/api/v1/task/server-streaming-text-generation
TGIS Standalone ServingRuntime for KServe
- :443 fmaas.GenerationService/Generate
- :443 fmaas.GenerationService/GenerateStream
  
  Note
  To query the endpoints for the TGIS standalone runtime, you must also download the files in the proto directory of the IBM text-generation-inference repository.
OpenVINO Model Server
- /v2/models/<model-name>/infer
As indicated by the paths shown, the single model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.

Use the endpoints to make API requests to your deployed model, as shown in the following example commands:

Caikit TGIS ServingRuntime for KServe

curl --json '{"model_id": "<model_name>", "inputs": "<text>"}' \
https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation

TGIS Standalone ServingRuntime for KServe

grpcurl -proto text-generation-inference/proto/generation.proto -d \
'{"requests": [{"text":"<text>"}]}' \
-H 'mm-model-id: <model_name>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate

OpenVINO Model Server

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d \
'{ "model_name": "<model_name>", \
"inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}'

Additional resources

Configuring monitoring for the single-model serving platform

The single-model serving platform includes metrics for supported runtimes. You can also configure monitoring for OpenShift Service Mesh. The service mesh metrics helps you to understand dependencies and traffic flow between components in the mesh. When you have configured monitoring, you can grant Prometheus access to scrape the available metrics.

Prerequisites

You have cluster administrator privileges for your OpenShift Container Platform cluster.
You have created OpenShift Service Mesh and Knative Serving instances and installed KServe.
You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
You are familiar with creating a config map for monitoring a user-defined workflow. You will perform similar steps in this procedure.
You are familiar with enabling monitoring for user-defined projects in OpenShift. You will perform similar steps in this procedure.
You have assigned the monitoring-rules-view role to users that will monitor metrics.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Define a ConfigMap object in a YAML file called uwm-cm-conf.yaml with the following contents:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheus:
      logLevel: debug
      retention: 15d
```
The user-workload-monitoring-config object configures the components that monitor user-defined projects. Observe that the retention time is set to the recommended value of 15 days.
Apply the configuration to create the user-workload-monitoring-config object.
```
$ oc apply -f uwm-cm-conf.yaml
```
Define another ConfigMap object in a YAML file called uwm-cm-enable.yaml with the following contents:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
```
The cluster-monitoring-config object enables monitoring for user-defined projects.
Apply the configuration to create the cluster-monitoring-config object.
```
$ oc apply -f uwm-cm-enable.yaml
```

Create ServiceMonitor and PodMonitor objects to monitor metrics in the service mesh control plane as follows:

Create an istiod-monitor.yaml YAML file with the following contents:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istiod-monitor
  namespace: istio-system
spec:
  targetLabels:
  - app
  selector:
    matchLabels:
      istio: pilot
  endpoints:
  - port: http-monitoring
    interval: 30s

Deploy the ServiceMonitor CR in the specified istio-system namespace.
```
$ oc apply -f istiod-monitor.yaml
```
You see the following output:
```
servicemonitor.monitoring.coreos.com/istiod-monitor created
```

Create an istio-proxies-monitor.yaml YAML file with the following contents:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: istio-proxies-monitor
  namespace: istio-system
spec:
  selector:
    matchExpressions:
    - key: istio-prometheus-ignore
      operator: DoesNotExist
  podMetricsEndpoints:
  - path: /stats/prometheus
    interval: 30s

Deploy the PodMonitor CR in the specified istio-system namespace.

$ oc apply -f istio-proxies-monitor.yaml

You see the following output:

podmonitor.monitoring.coreos.com/istio-proxies-monitor created

Viewing metrics for the single model serving platform

When a cluster administrator has configured monitoring for the single model serving platform, non-admin users can use the OpenShift web console to view metrics.

Procedure

Switch to the Developer perspective.
In the left menu, click Observe.
As described in monitoring project metrics, use the web console to run queries for caikit_*, tgi_*, ovms_* or istio_* metrics.

Monitoring model performance

Viewing performance metrics for all models on a model server

In Open Data Hub, you can monitor the following metrics for all the models that are deployed on a model server:

HTTP requests - The number of HTTP requests that have failed or succeeded for all models on the server.

Note: You can also view the number of HTTP requests that have failed or succeeded for a specific model, as described in Viewing HTTP request metrics for a deployed model.
Average response time (ms) - For all models on the server, the average time it takes the model server to respond to requests.
CPU utilization (%) - The percentage of the CPU’s capacity that is currently being used by all models on the server.
Memory utilization (%) - The percentage of the system’s memory that is currently being used by all models on the server.

You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the models are performing at a specified time.

Prerequisites

You have installed Open Data Hub.
On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.
You have logged in to Open Data Hub.
If you are using specialized Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.
There are deployed data science models in your data science project.

Procedure

From the Open Data Hub dashboard navigation menu, click Data Science Projects and then select the project that contains the data science models that you want to monitor.
On the Components page, scroll down to the Models and model servers section.
In the row for the model server that you are interested in, click the action menu (⋮) and then select View model server metrics.
Optional: On the metrics page for the model server, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
Scroll down to view data graphs for HTTP requests, average response time, CPU utilization, and memory utilization.

Verification

On the metrics page for the model server, the graphs provide performance metric data.

Viewing HTTP request metrics for a deployed model

You can view a graph that illustrates the HTTP requests that have failed or succeeded for a specific model.

Prerequisites

You have installed Open Data Hub.
On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.
You have logged in to Open Data Hub.
If you are using specialized Open Data Hub groups, you are part of the user group or admin group (for example, odh-users or odh-admins) in OpenShift.
You have deployed a model in a data science project.

Procedure

From the Open Data Hub dashboard navigation menu, select Model Serving.
On the Deployed models page, select the model that you are interested in.
Optional: On the Endpoint performance tab, set the following options:
- Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
- Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.

Verification

The Endpoint performance tab shows a graph of the HTTP metrics for the model.

QUICK LINKS

STAY IN TOUCH