cd $(mktemp -d)
Info alert:Important Notice
Please note that more information about the previous v2 releases can be found here. You can use "Find a release" search bar to search for a particular release.
Deploying models
- Storing models
- Deploying models on the single-model serving platform
- Deploying models on the NVIDIA NIM model serving platform
- Deploying models on the multi-model serving platform
- Adding a model server for the multi-model serving platform
- Deleting a model server
- Deploying a model by using the multi-model serving platform
- Viewing a deployed model
- Updating the deployment properties of a deployed model
- Deleting a deployed model
- Configuring monitoring for the multi-model serving platform
- Viewing model-serving runtime metrics for the multi-model serving platform
- Viewing performance metrics for all models on a model server
- Viewing HTTP request metrics for a deployed model
- Making inference requests to deployed models
Storing models
You must store your model before you can deploy it. You can store a model in an S3 bucket, URI or Open Container Initiative (OCI) containers.
Using OCI containers for model storage
As an alternative to storing a model in an S3 bucket or URI, you can upload models to Open Container Initiative (OCI) containers. Deploying models from OCI containers is also known as modelcars in KServe.
Using OCI containers for model storage can help you:
-
Reduce startup times by avoiding downloading the same model multiple times.
-
Reduce disk space usage by reducing the number of models downloaded locally.
-
Improve model performance by allowing pre-fetched images.
Using OCI containers for model storage involves the following tasks:
-
Storing a model in an OCI image.
-
Deploying a model from an OCI image by using either the user interface or the command line interface. To deploy a model by using:
-
The user interface, see Deploying models on the single-model serving platform.
-
The command line interface, see Deploying a model stored in an OCI image by using the CLI.
-
Storing a model in an OCI image
You can store a model in an OCI image. The following procedure uses the example of storing a MobileNet v2-7 model in ONNX format.
-
You have a model in the ONNX format. The example in this procedure uses the MobileNet v2-7 model in ONNX format.
-
You have installed the Podman tool.
-
In a terminal window on your local machine, create a temporary directory for storing both the model and the support files that you need to create the OCI image:
-
Create a
models
folder inside the temporary directory:mkdir -p models/1
NoteThis example command specifies the subdirectory
1
because OpenVINO requires numbered subdirectories for model versioning. If you are not using OpenVINO, you do not need to create the1
subdirectory to use OCI container images. -
Download the model and support files:
DOWNLOAD_URL=https://github.com/onnx/models/raw/main/validated/vision/classification/mobilenet/model/mobilenetv2-7.onnx curl -L $DOWNLOAD_URL -O --output-dir models/1/
-
Use the
tree
command to confirm that the model files are located in the directory structure as expected:tree
The
tree
command should return a directory structure similar to the following example:. ├── Containerfile └── models └── 1 └── mobilenetv2-7.onnx
-
Create a Docker file named
Containerfile
:Note-
Specify a base image that provides a shell. In the following example,
ubi9-micro
is the base container image. You cannot specify an empty image that does not provide a shell, such asscratch
, because KServe uses the shell to ensure the model files are accessible to the model server. -
Change the ownership of the copied model files and grant read permissions to the root group to ensure that the model server can access the files. OpenShift runs containers with a random user ID and the root group ID.
FROM registry.access.redhat.com/ubi9/ubi-micro:latest COPY --chown=0:0 models /models RUN chmod -R a=rX /models # nobody user USER 65534
-
-
Use
podman build
commands to create the OCI container image and upload it to a registry. The following commands use Quay as the registry.NoteIf your repository is private, ensure that you are authenticated to the registry before uploading your container image.
podman build --format=oci -t quay.io/<user_name>/<repository_name>:<tag_name> . podman push quay.io/<user_name>/<repository_name>:<tag_name>
Deploying models on the single-model serving platform
You can use the single-model serving platform to deploy large models, such as large language models (LLMs). This platform is based on the KServe component. You can deploy models in either advanced or standard mode. Advanced mode uses Red Hat OpenShift Container Platform Serverless for serverless deployments, while standard mode uses KServe Raw Deployment and does not require serverless dependencies.
On the single-model serving platform, each model is deployed from its own dedicated model server. This helps you to deploy, monitor, scale, and maintain large models that require more resources.
About KServe deployment modes
You can deploy models in either advanced or standard deployment mode.
Advanced deployment mode uses Knative Serverless. By default, KServe integrates with Red Hat OpenShift Serverless and Red Hat OpenShift Service Mesh to deploy models on the single-model serving platform. Red Hat Serverless is based on the open source Knative project and requires the Red Hat OpenShift Serverless Operator.
Alternatively, you can use standard deployment mode, which uses KServe RawDeployment mode and does not require the Red Hat OpenShift Serverless Operator, Red Hat OpenShift Service Mesh, or Authorino.
If you configure KServe for advanced deployment mode, you can set up your data science project to serve models in both advanced and standard deployment mode. However, if you configure KServe for only standard deployment mode, you can only use standard deployment mode.
There are both advantages and disadvantages to using each of these deployment modes:
Advanced mode
Advantages:
-
Enables autoscaling based on request volume:
-
Resources scale up automatically when receiving incoming requests.
-
Optimizes resource usage and maintains performance during peak times.
-
-
Supports scale down to and from zero using Knative:
-
Allows resources to scale down completely when there are no incoming requests.
-
Saves costs by not running idle resources.
-
Disadvantages:
-
Has customization limitations:
-
Serverless is backed by Knative and implicitly inherits the same design choices, such as when mounting multiple volumes.
-
-
Dependency on Knative for scaling:
-
Introduces additional complexity in setup and management compared to traditional scaling methods.
-
-
Cluster scoped component:
-
If the cluster already has Serverless configured, you must manually configure the cluster to make it work with Open Data Hub.
-
Standard mode
Advantages:
-
Enables deployment with Kubernetes resources, such as
Deployment
,Service
,Route
, andHorizontal Pod Autoscaler
, without additional dependencies like Red Hat Serverless, Red Hat Service Mesh, and Authorino.-
The resulting model deployment has a smaller resource footprint compared to advanced mode.
-
-
Enables traditional Deployment/Pod configurations, such as mounting multiple volumes, which is not available using Knative.
-
Beneficial for applications requiring complex configurations or multiple storage mounts.
-
Disadvantages:
-
Does not support automatic scaling:
-
Does not support automatic scaling down to zero resources when idle.
-
Might result in higher costs during periods of low traffic.
-
Deploying models on the single-model serving platform
When you have enabled the single-model serving platform, you can enable a preinstalled or custom model-serving runtime and deploy models on the platform.
You can use preinstalled model-serving runtimes to start serving models without modifying or defining the runtime yourself. For help adding a custom runtime, see Adding a custom model-serving runtime for the single-model serving platform.
-
You have logged in to Open Data Hub.
-
You have installed KServe.
-
You have enabled the single-model serving platform.
-
(Advanced deployments only) To enable token authentication and external model routes for deployed models, you have added Authorino as an authorization provider.
-
You have created a data science project.
-
You have access to S3-compatible object storage.
-
For the model that you want to deploy, you know the associated URI in your S3-compatible object storage bucket or Open Container Initiative (OCI) container.
-
To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis-serving repository.
-
To use the vLLM NVIDIA GPU ServingRuntime for KServe runtime or use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
-
To use the VLLM runtime on IBM Z and IBM Power, use the vLLM CPU ServingRuntime for KServe. For IBM Z and IBM Power, vLLM runtime is supported only on CPU.
-
To use the vLLM Intel Gaudi Accelerator ServingRuntime for KServe runtime, you have enabled support for hybrid processing units (HPUs) in Open Data Hub. This includes installing the Intel Gaudi Base Operator and configuring a hardware profile. For more information, see Intel Gaudi Base Operator OpenShift installation and Working with hardware profiles.
-
To use the vLLM AMD GPU ServingRuntime for KServe runtime, you have enabled support for AMD graphic processing units (GPUs) in Open Data Hub. This includes installing the AMD GPU Operator and configuring a hardware profile. For more information, see Deploying the AMD GPU operator on OpenShift and Working with hardware profiles.
-
To deploy RHEL AI models:
-
You have enabled the vLLM NVIDIA GPU ServingRuntime for KServe runtime.
-
You have downloaded the model from the Red Hat container registry and uploaded it to S3-compatible object storage.
-
-
In the left menu, click Data science projects.
The Data science projects page opens.
-
Click the name of the project that you want to deploy a model in.
A project details page opens.
-
Click the Models tab.
-
Perform one of the following actions:
-
If you see a ​​Single-model serving platform tile, click Deploy model on the tile.
-
If you do not see any tiles, click the Deploy model button.
The Deploy model dialog opens.
-
-
In the Model deployment name field, enter a unique name for the model that you are deploying.
-
In the Serving runtime field, select an enabled runtime. If project-scoped runtimes exist, the Serving runtime list includes subheadings to distinguish between global runtimes and project-scoped runtimes.
-
From the Model framework (name - version) list, select a value.
-
From the Deployment mode list, select standard or advanced. For more information about deployment modes, see About KServe deployment modes.
-
In the Number of model server replicas to deploy field, specify a value.
-
The following options are only available if you have created a hardware profile:
-
From the Hardware profile list, select a hardware profile. If project-scoped hardware profiles exist, the Hardware profile list includes subheadings to distinguish between global hardware profiles and project-scoped hardware profiles.
ImportantBy default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. If you enable hardware profiles, the Hardware profiles list is displayed instead of the Accelerator profiles list. To show the Settings → Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the
disableHardwareProfiles
value tofalse
in theOdhDashboardConfig
custom resource (CR) in OpenShift Container Platform. For more information about setting dashboard configuration options, see Customizing the dashboard. -
Optional To change these default values, click Customize resource requests and limit and enter new minimum (request) and maximum (limit) values. The hardware profile specifies the number of CPUs and the amount of memory allocated to the container, setting the guaranteed minimum (request) and maximum (limit) for both.
-
-
Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
-
To require token authentication for inference requests to the deployed model, perform the following actions:
-
Select Require token authentication.
-
In the Service account name field, enter the service account name that the token will be generated for.
-
To add an additional service account, click Add a service account and enter another service account name.
-
-
To specify the location of your model, perform one of the following sets of actions:
-
To use an existing connection
-
Select Existing connection.
-
From the Name list, select a connection that you previously defined.
-
For S3-compatible object storage: In the Path field, enter the folder path that contains the model in your specified data source.
-
For Open Container Image connections: In the OCI storage location field, enter the model URI where the model is located.
NoteIf you are deploying a registered model version with an existing S3, URI, or OCI data connection, some of your connection details might be autofilled. This depends on the type of data connection and the number of matching connections available in your data science project. For example, if only one matching connection exists, fields like the path, URI, endpoint, model URI, bucket, and region might populate automatically. Matching connections will be labeled as Recommended.
-
-
-
To use a new connection
-
To define a new connection that your model can access, select New connection.
-
In the Add connection modal, select a Connection type. The OCI-compliant registry, S3 compatible object storage, and URI options are pre-installed connection types. Additional options might be available if your Open Data Hub administrator added them.
The Add connection form opens with fields specific to the connection type that you selected.
-
-
Fill in the connection detail fields.
-
-
-
(Optional) Customize the runtime parameters in the Configuration parameters section:
-
Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
-
Modify the values in Additional environment variables to define variables in the model’s environment.
The Configuration parameters section shows predefined serving runtime parameters, if any are available.
NoteDo not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
-
-
Click Deploy.
-
Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.
Deploying a model stored in an OCI image by using the CLI
You can deploy a model that is stored in an OCI image from the command line interface.
The following procedure uses the example of deploying a MobileNet v2-7 model in ONNX format, stored in an OCI image on an OpenVINO model server.
Note
|
By default in KServe, models are exposed outside the cluster and not protected with authentication. |
-
You have stored a model in an OCI image as described in Storing a model in an OCI image.
-
If you want to deploy a model that is stored in a private OCI repository, you must configure an image pull secret. For more information about creating an image pull secret, see Using image pull secrets.
-
You are logged in to your OpenShift cluster.
-
Create a project to deploy the model:
oc new-project oci-model-example
-
Use the Open Data Hub project
kserve-ovms
template to create aServingRuntime
resource and configure the OpenVINO model server in the new project:oc process -n opendatahub -o yaml kserve-ovms | oc apply -f -
-
Verify that the
ServingRuntime
namedkserve-ovms
is created:oc get servingruntimes
The command should return output similar to the following:
NAME DISABLED MODELTYPE CONTAINERS AGE kserve-ovms openvino_ir kserve-container 1m
-
Create an
InferenceService
YAML resource, depending on whether the model is stored from a private or a public OCI repository:-
For a model stored in a public OCI repository, create an
InferenceService
YAML file with the following values, replacing<user_name>
,<repository_name>
, and<tag_name>
with values specific to your environment:apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: sample-isvc-using-oci spec: predictor: model: runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource modelFormat: name: onnx storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name> resources: requests: memory: 500Mi cpu: 100m # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it limits: memory: 4Gi cpu: 500m # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it
-
For a model stored in a private OCI repository, create an
InferenceService
YAML file that specifies your pull secret in thespec.predictor.imagePullSecrets
field, as shown in the following example:apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: sample-isvc-using-private-oci spec: predictor: model: runtime: kserve-ovms # Ensure this matches the name of the ServingRuntime resource modelFormat: name: onnx storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name> resources: requests: memory: 500Mi cpu: 100m # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it limits: memory: 4Gi cpu: 500m # nvidia.com/gpu: "1" # Only required if you have GPUs available and the model and runtime will use it imagePullSecrets: # Specify image pull secrets to use for fetching container images, including OCI model images - name: <pull-secret-name>
After you create the
InferenceService
resource, KServe deploys the model stored in the OCI image referred to by thestorageUri
field.
-
Check the status of the deployment:
oc get inferenceservice
The command should return output that includes information, such as the URL of the deployed model and its readiness state.
Monitoring models on the single-model serving platform
You can monitor models that are deployed on the single-model serving platform to view performance and resource usage metrics.
Viewing performance metrics for a deployed model
You can monitor the following metrics for a specific model that is deployed on the single-model serving platform:
-
Number of requests - The number of requests that have failed or succeeded for a specific model.
-
Average response time (ms) - The average time it takes a specific model to respond to requests.
-
CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
-
Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.
You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the model is performing at a specified time.
-
You have installed Open Data Hub.
-
A cluster admin has enabled user workload monitoring (UWM) for user-defined projects on your OpenShift cluster. For more information, see Enabling monitoring for user-defined projects and Configuring monitoring for the single-model serving platform.
-
You have logged in to Open Data Hub.
-
The following dashboard configuration options are set to the default values as shown:
disablePerformanceMetrics:false disableKServeMetrics:false
For more information about setting dashboard configuration options, see Customizing the dashboard.
-
You have deployed a model on the single-model serving platform by using a preinstalled runtime.
NoteMetrics are only supported for models deployed by using a preinstalled model-serving runtime or a custom runtime that is duplicated from a preinstalled runtime.
-
From the Open Data Hub dashboard navigation menu, click Data science projects.
The Data science projects page opens.
-
Click the name of the project that contains the data science models that you want to monitor.
-
In the project details page, click the Models tab.
-
Select the model that you are interested in.
-
On the Endpoint performance tab, set the following options:
-
Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
-
Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
-
-
Scroll down to view data graphs for number of requests, average response time, CPU utilization, and memory utilization.
The Endpoint performance tab shows graphs of metrics for the model.
Viewing model-serving runtime metrics for the single-model serving platform
When a cluster administrator has configured monitoring for the single-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the KServe component.
-
Log in to the OpenShift Container Platform web console.
-
Switch to the Developer perspective.
-
In the left menu, click Observe.
-
As described in Monitoring your project metrics, use the web console to run queries for model-serving runtime metrics. You can also run queries for metrics that are related to OpenShift Service Mesh. Some examples are shown.
-
The following query displays the number of successful inference requests over a period of time for a model deployed with the vLLM runtime:
sum(increase(vllm:request_success_total{namespace=${namespace},model_name=${model_name}}[${rate_interval}]))
NoteCertain vLLM metrics are available only after an inference request is processed by a deployed model. To generate and view these metrics, you must first make an inference request to the model.
-
The following query displays the number of successful inference requests over a period of time for a model deployed with the standalone TGIS runtime:
sum(increase(tgi_request_success{namespace=${namespace}, pod=~${model_name}-predictor-.*}[${rate_interval}]))
-
The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:
sum(increase(predict_rpc_count_total{namespace=${namespace},code=OK,model_id=${model_name}}[${rate_interval}]))
-
The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:
sum(increase(ovms_requests_success{namespace=${namespace},name=${model_name}}[${rate_interval}]))
-
Deploying models on the NVIDIA NIM model serving platform
You can deploy models using NVIDIA NIM inference services on the NVIDIA NIM model serving platform.
NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of microservices designed for secure, reliable deployment of high performance AI model inferencing across clouds, data centers and workstations.
Deploying models on the NVIDIA NIM model serving platform
When you have enabled the NVIDIA NIM model serving platform, you can start to deploy NVIDIA-optimized models on the platform.
-
You have logged in to Open Data Hub.
-
You have enabled the NVIDIA NIM model serving platform.
-
You have created a data science project.
-
You have enabled support for graphic processing units (GPUs) in Open Data Hub. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform.
-
In the left menu, click Data science projects.
The Data science projects page opens.
-
Click the name of the project that you want to deploy a model in.
A project details page opens.
-
Click the Models tab.
-
In the Models section, perform one of the following actions:
-
On the ​​NVIDIA NIM model serving platform tile, click Select NVIDIA NIM on the tile, and then click Deploy model.
-
If you have previously selected the NVIDIA NIM model serving type, the Models page displays NVIDIA model serving enabled on the upper-right corner, along with the Deploy model button. To proceed, click Deploy model.
The Deploy model dialog opens.
-
-
Configure properties for deploying your model as follows:
-
In the Model deployment name field, enter a unique name for the deployment.
-
From the NVIDIA NIM list, select the NVIDIA NIM model that you want to deploy. For more information, see Supported Models
-
In the NVIDIA NIM storage size field, specify the size of the cluster storage instance that will be created to store the NVIDIA NIM model.
NoteWhen resizing a PersistentVolumeClaim (PVC) backed by Amazon EBS in Open Data Hub, you may encounter
VolumeModificationRateExceeded: You've reached the maximum modification rate per volume limit.
To avoid this error, wait at least six hours between modifications per EBS volume. If you resize a PVC before the cooldown expires, the Amazon EBS CSI driver (ebs.csi.aws.com
) fails with this error. This error is an Amazon EBS service limit that applies to all workloads using EBS-backed PVCs. -
In the Number of model server replicas to deploy field, specify a value.
-
From the Model server size list, select a value.
-
-
From the Hardware profile list, select a hardware profile.
ImportantBy default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. If you enable hardware profiles, the Hardware profiles list is displayed instead of the Accelerator profiles list. To show the Settings → Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the
disableHardwareProfiles
value tofalse
in theOdhDashboardConfig
custom resource (CR) in OpenShift Container Platform. For more information about setting dashboard configuration options, see Customizing the dashboard. -
Optional: Click Customize resource requests and limit and update the following values:
-
In the CPUs requests field, specify the number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
-
In the CPU limits field, specify the maximum number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
-
In the Memory requests field, specify the requested memory for the model server in gibibytes (Gi).
-
In the Memory limits field, specify the maximum memory limit for the model server in gibibytes (Gi).
-
-
Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
-
To require token authentication for inference requests to the deployed model, perform the following actions:
-
Select Require token authentication.
-
In the Service account name field, enter the service account name that the token will be generated for.
-
To add an additional service account, click Add a service account and enter another service account name.
-
-
Click Deploy.
-
Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.
Viewing NVIDIA NIM metrics for a NIM model
In Open Data Hub, you can observe the following NVIDIA NIM metrics for a NIM model deployed on the NVIDIA NIM model serving platform:
-
GPU cache usage over time (ms)
-
Current running, waiting, and max requests count
-
Tokens count
-
Time to first token
-
Time per output token
-
Request outcomes
You can specify a time range and a refresh interval for these metrics to help you determine, for example, the peak usage hours and model performance at a specified time.
-
You have enabled the NVIDIA NIM model serving platform.
-
You have deployed a NIM model on the NVIDIA NIM model serving platform.
-
A cluster administrator has enabled metrics collection and graph generation for your deployment.
-
The
disableKServeMetrics
Open Data Hub dashboard configuration option is set to its default value offalse
:disableKServeMetrics: false
For more information about setting dashboard configuration options, see Customizing the dashboard.
-
From the Open Data Hub dashboard navigation menu, click Data science projects.
The Data science projects page opens.
-
Click the name of the project that contains the NIM model that you want to monitor.
-
In the project details page, click the Models tab.
-
Click the NIM model that you want to observe.
-
On the NIM Metrics tab, set the following options:
-
Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
-
Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
-
-
Scroll down to view data graphs for NIM metrics.
The NIM Metrics tab shows graphs of NIM metrics for the deployed NIM model.
Viewing performance metrics for a NIM model
You can observe the following performance metrics for a NIM model deployed on the NVIDIA NIM model serving platform:
-
Number of requests - The number of requests that have failed or succeeded for a specific model.
-
Average response time (ms) - The average time it takes a specific model to respond to requests.
-
CPU utilization (%) - The percentage of the CPU limit per model replica that is currently utilized by a specific model.
-
Memory utilization (%) - The percentage of the memory limit per model replica that is utilized by a specific model.
You can specify a time range and a refresh interval for these metrics to help you determine, for example, the peak usage hours and model performance at a specified time.
-
You have enabled the NVIDIA NIM model serving platform.
-
You have deployed a NIM model on the NVIDIA NIM model serving platform.
-
A cluster administrator has enabled metrics collection and graph generation for your deployment.
-
The
disableKServeMetrics
Open Data Hub dashboard configuration option is set to its default value offalse
:disableKServeMetrics: false
For more information about setting dashboard configuration options, see Customizing the dashboard.
-
From the Open Data Hub dashboard navigation menu, click Data science projects.
The Data science projects page opens.
-
Click the name of the project that contains the NIM model that you want to monitor.
-
In the project details page, click the Models tab.
-
Click the NIM model that you want to observe.
-
On the Endpoint performance tab, set the following options:
-
Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
-
Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed to show the latest data. You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
-
-
Scroll down to view data graphs for performance metrics.
The Endpoint performance tab shows graphs of performance metrics for the deployed NIM model.
Deploying models on the multi-model serving platform
For deploying small and medium-sized models, Open Data Hub includes a multi-model serving platform that is based on the ModelMesh component. On the multi-model serving platform, multiple models can be deployed from the same model server and share the server resources.
Important
|
Starting with Open Data Hub version 2.19, the multi-model serving platform based on ModelMesh is deprecated. You can continue to deploy models on the multi-model serving platform, but it is recommended that you migrate to the single-model serving platform. For more information or for help on using the single-model serving platform, contact your account manager. |
Adding a model server for the multi-model serving platform
When you have enabled the multi-model serving platform, you must configure a model server to deploy models. If you require extra computing power for use with large datasets, you can assign accelerators to your model server.
-
You have logged in to Open Data Hub.
-
You have created a data science project that you can add a model server to.
-
You have enabled the multi-model serving platform.
-
If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See Adding a custom model-serving runtime.
-
If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU and AMD GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
-
In the left menu of the Open Data Hub dashboard, click Data science projects.
The Data science projects page opens.
-
Click the name of the project that you want to configure a model server for.
A project details page opens.
-
Click the Models tab.
-
Perform one of the following actions:
-
If you see a ​Multi-model serving platform tile, click Add model server on the tile.
-
If you do not see any tiles, click the Add model server button.
The Add model server dialog opens.
-
-
In the Model server name field, enter a unique name for the model server.
-
From the Serving runtime list, select a model-serving runtime that is installed and enabled in your Open Data Hub deployment.
NoteIf you are using a custom model-serving runtime with your model server and want to use GPUs, you must ensure that your custom runtime supports GPUs and is appropriately configured to use them.
-
In the Number of model replicas to deploy field, specify a value.
-
From the Accelerator profile list, select an accelerator profile.
ImportantBy default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. If you enable hardware profiles, the Hardware profiles list is displayed instead of the Accelerator profiles list. To show the Settings → Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the
disableHardwareProfiles
value tofalse
in theOdhDashboardConfig
custom resource (CR) in OpenShift Container Platform. For more information about setting dashboard configuration options, see Customizing the dashboard. -
Optional: Click Customize resource requests and limit and update the following values:
-
In the CPUs requests field, specify the number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
-
In the CPU limits field, specify the maximum number of CPUs to use with your model server. Use the list beside this field to specify the value in cores or millicores.
-
In the Memory requests field, specify the requested memory for the model server in gibibytes (Gi).
-
In the Memory limits field, specify the maximum memory limit for the model server in gibibytes (Gi).
-
-
Optional: In the Model route section, select the Make deployed models available through an external route checkbox to make your deployed models available to external clients.
-
Optional: In the Token authentication section, select the Require token authentication checkbox to require token authentication for your model server. To finish configuring token authentication, perform the following actions:
-
In the Service account name field, enter a service account name for which the token will be generated. The generated token is created and displayed in the Token secret field when the model server is configured.
-
To add an additional service account, click Add a service account and enter another service account name.
-
-
Click Add.
-
The model server that you configured is displayed on the Models tab for the project, in the Models and model servers list.
-
-
Optional: To update the model server, click the action menu (⋮) beside the model server and select Edit model server.
Deleting a model server
When you no longer need a model server to host models, you can remove it from your data science project.
Note
|
When you remove a model server, you also remove the models that are hosted on that model server. As a result, the models are no longer available to applications. |
-
You have created a data science project and an associated model server.
-
You have notified the users of the applications that access the models that the models will no longer be available.
-
From the Open Data Hub dashboard, click Data science projects.
The Data science projects page opens.
-
Click the name of the project from which you want to delete the model server.
A project details page opens.
-
Click the Models tab.
-
Click the action menu (⋮) beside the project whose model server you want to delete and then click Delete model server.
The Delete model server dialog opens.
-
Enter the name of the model server in the text field to confirm that you intend to delete it.
-
Click Delete model server.
-
The model server that you deleted is no longer displayed on the Models tab for the project.
Deploying a model by using the multi-model serving platform
You can deploy trained models on Open Data Hub to enable you to test and implement them into intelligent applications. Deploying a model makes it available as a service that you can access by using an API. This enables you to return predictions based on data inputs.
When you have enabled the multi-model serving platform, you can deploy models on the platform.
-
You have logged in to Open Data Hub.
-
You have enabled the multi-model serving platform.
-
You have created a data science project and added a model server.
-
You have access to S3-compatible object storage.
-
For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.
-
In the left menu of the Open Data Hub dashboard, click Data science projects.
The Data science projects page opens.
-
Click the name of the project that you want to deploy a model in.
A project details page opens.
-
Click the Models tab.
-
Click Deploy model.
-
Configure properties for deploying your model as follows:
-
In the Model name field, enter a unique name for the model that you are deploying.
-
From the Model framework list, select a framework for your model.
NoteThe Model framework list shows only the frameworks that are supported by the model-serving runtime that you specified when you configured your model server. -
To specify the location of the model you want to deploy from S3-compatible object storage, perform one of the following sets of actions:
-
To use an existing connection
-
Select Existing connection.
-
From the Name list, select a connection that you previously defined.
-
In the Path field, enter the folder path that contains the model in your specified data source.
NoteIf you are deploying a registered model version with an existing S3 or URI data connection, some of your connection details might be autofilled. This depends on the type of data connection and the number of matching connections available in your data science project. For example, if only one matching connection exists, fields like the path, URI, endpoint, bucket, and region might populate automatically. Matching connections will be labeled as Recommended.
-
-
To use a new connection
-
To define a new connection that your model can access, select New connection.
-
In the Add connection modal, select a Connection type. The S3 compatible object storage and URI options are pre-installed connection types. Additional options might be available if your Open Data Hub administrator added them.
The Add connection form opens with fields specific to the connection type that you selected.
-
Enter the connection detail fields.
-
-
-
(Optional) Customize the runtime parameters in the Configuration parameters section:
-
Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
-
Modify the values in Additional environment variables to define variables in the model’s environment.
-
-
Click Deploy.
-
-
Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column.
-
To learn how to monitor your model for bias, see Monitoring data science models.
Viewing a deployed model
To analyze the results of your work, you can view a list of deployed models on Open Data Hub. You can also view the current statuses of deployed models and their endpoints.
-
You have logged in to Open Data Hub.
-
From the Open Data Hub dashboard, click Models → Model deployments.
The Model deployments page opens.
For each model, the page shows details such as the model name, the project in which the model is deployed, the model-serving runtime that the model uses, and the deployment status.
-
Optional: For a given model, click the link in the Inference endpoint column to see the inference endpoints for the deployed model.
-
A list of previously deployed data science models is displayed on the Model deployments page.
-
To learn how to monitor your model for bias, see Monitoring data science models.
Updating the deployment properties of a deployed model
You can update the deployment properties of a model that has been deployed previously. For example, you can change the model’s connection and name.
-
You have logged in to Open Data Hub.
-
You have deployed a model on Open Data Hub.
-
From the Open Data Hub dashboard, click Models → Model deployments.
The Model deployments page opens.
-
Click the action menu (⋮) beside the model whose deployment properties you want to update and click Edit.
The Edit model dialog opens.
-
Update the deployment properties of the model as follows:
-
In the Model name field, enter a new, unique name for your model.
-
From the Model servers list, select a model server for your model.
-
From the Model framework list, select a framework for your model.
NoteThe Model framework list shows only the frameworks that are supported by the model-serving runtime that you specified when you configured your model server. -
Optionally, update the connection by specifying an existing connection or by creating a new connection.
-
Click Redeploy.
-
-
The model whose deployment properties you updated is displayed on the Model deployments page of the dashboard.
Deleting a deployed model
You can delete models you have previously deployed. This enables you to remove deployed models that are no longer required.
-
You have logged in to Open Data Hub.
-
You have deployed a model.
-
From the Open Data Hub dashboard, click Models → Model deployments.
The Model deployments page opens.
-
Click the action menu (⋮) beside the deployed model that you want to delete and click Delete.
The Delete deployed model dialog opens.
-
Enter the name of the deployed model in the text field to confirm that you intend to delete it.
-
Click Delete deployed model.
-
The model that you deleted is no longer displayed on the Model deployments page.
Configuring monitoring for the multi-model serving platform
The multi-model serving platform includes model and model server metrics for the ModelMesh component. ModelMesh generates its own set of metrics and does not rely on the underlying model-serving runtimes to provide them. The set of metrics that ModelMesh generates includes metrics for model request rates and timings, model loading and unloading rates, times and sizes, internal queuing delays, capacity and usage, cache state, and least recently-used models. For more information, see ModelMesh metrics.
After you have configured monitoring, you can view metrics for the ModelMesh component.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
-
You are familiar with creating a config map for monitoring a user-defined workflow. You will perform similar steps in this procedure.
-
You are familiar with enabling monitoring for user-defined projects in OpenShift. You will perform similar steps in this procedure.
-
You have assigned the
monitoring-rules-view
role to users that will monitor metrics.
-
In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
-
Define a
ConfigMap
object in a YAML file calleduwm-cm-conf.yaml
with the following contents:apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheus: logLevel: debug retention: 15d
The
user-workload-monitoring-config
object configures the components that monitor user-defined projects. Observe that the retention time is set to the recommended value of 15 days. -
Apply the configuration to create the
user-workload-monitoring-config
object.$ oc apply -f uwm-cm-conf.yaml
-
Define another
ConfigMap
object in a YAML file calleduwm-cm-enable.yaml
with the following contents:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true
The
cluster-monitoring-config
object enables monitoring for user-defined projects. -
Apply the configuration to create the
cluster-monitoring-config
object.$ oc apply -f uwm-cm-enable.yaml
Viewing model-serving runtime metrics for the multi-model serving platform
After a cluster administrator has configured monitoring for the multi-model serving platform, non-admin users can use the OpenShift web console to view model-serving runtime metrics for the ModelMesh component.
-
Log in to the OpenShift Container Platform web console.
-
Switch to the Developer perspective.
-
In the left menu, click Observe.
-
As described in Monitoring your project metrics, use the web console to run queries for
modelmesh_*
metrics.
Viewing performance metrics for all models on a model server
You can monitor the following metrics for all the models that are deployed on a model server:
-
HTTP requests per 5 minutes - The number of HTTP requests that have failed or succeeded for all models on the server.
-
Average response time (ms) - For all models on the server, the average time it takes the model server to respond to requests.
-
CPU utilization (%) - The percentage of the CPU’s capacity that is currently being used by all models on the server.
-
Memory utilization (%) - The percentage of the system’s memory that is currently being used by all models on the server.
You can specify a time range and a refresh interval for these metrics to help you determine, for example, when the peak usage hours are and how the models are performing at a specified time.
-
You have installed Open Data Hub.
-
On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.
-
You have logged in to Open Data Hub.
-
You have deployed models on the multi-model serving platform.
-
From the Open Data Hub dashboard navigation menu, click Data science projects.
The Data science projects page opens.
-
Click the name of the project that contains the data science models that you want to monitor.
-
In the project details page, click the Models tab.
-
In the row for the model server that you are interested in, click the action menu (⋮) and then select View model server metrics.
-
Optional: On the metrics page for the model server, set the following options:
-
Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
-
Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
-
-
Scroll down to view data graphs for HTTP requests per 5 minutes, average response time, CPU utilization, and memory utilization.
On the metrics page for the model server, the graphs provide data on performance metrics.
Viewing HTTP request metrics for a deployed model
You can view a graph that illustrates the HTTP requests that have failed or succeeded for a specific model that is deployed on the multi-model serving platform.
-
You have installed Open Data Hub.
-
On the OpenShift cluster where Open Data Hub is installed, user workload monitoring is enabled.
-
The following dashboard configuration options are set to the default values as shown:
disablePerformanceMetrics:false disableKServeMetrics:false
For more information about setting dashboard configuration options, see Customizing the dashboard.
-
You have logged in to Open Data Hub.
-
You have deployed models on the multi-model serving platform.
-
From the Open Data Hub dashboard, click Models → Model deployments.
-
On the Model deployments page, select the model that you are interested in.
-
Optional: On the Endpoint performance tab, set the following options:
-
Time range - Specifies how long to track the metrics. You can select one of these values: 1 hour, 24 hours, 7 days, and 30 days.
-
Refresh interval - Specifies how frequently the graphs on the metrics page are refreshed (to show the latest data). You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 1 day.
-
The Endpoint performance tab shows a graph of the HTTP metrics for the model.
Making inference requests to deployed models
When you deploy a model, it is available as a service that you can access with API requests. This allows you to get predictions from your model based on the data you provide in the request.
Accessing the authentication token for a deployed model
If you secured your model inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify it in your inference requests.
-
You have logged in to Open Data Hub.
-
You have deployed a model by using the single-model serving platform.
-
From the Open Data Hub dashboard, click Data science projects.
The Data science projects page opens.
-
Click the name of the project that contains your deployed model.
A project details page opens.
-
Click the Models tab.
-
In the Models and model servers list, expand the section for your model.
Your authentication token is shown in the Token authentication section, in the Token secret field.
-
Optional: To copy the authentication token for use in an inference request, click the Copy button (
) next to the token value.
Accessing the inference endpoint for a deployed model
To make inference requests to your deployed model, you must know how to access the inference endpoint that is available.
For a list of paths to use with the supported runtimes and example commands, see Inference endpoints.
-
You have logged in to Open Data Hub.
-
You have deployed a model by using the single-model serving platform.
-
If you enabled token authentication for your deployed model, you have the associated token value.
-
From the Open Data Hub dashboard, click Models → Model deployments.
The inference endpoint for the model is shown in the Inference endpoint field.
-
Depending on what action you want to perform with the model (and if the model supports that action), copy the inference endpoint and then add a path to the end of the URL.
-
Use the endpoint to make API requests to your deployed model.
Making inference requests to models deployed on the single-model serving platform
When you deploy a model by using the single-model serving platform, the model is available as a service that you can access using API requests. This enables you to return predictions based on data inputs. To use API requests to interact with your deployed model, you must know the inference endpoint for the model.
In addition, if you secured your inference endpoint by enabling token authentication, you must know how to access your authentication token so that you can specify this in your inference requests.
Inference endpoints
These examples show how to use inference endpoints to query the model.
Note
|
If you enabled token authentication when deploying the model, add the |
Caikit TGIS ServingRuntime for KServe
-
:443/api/v1/task/text-generation
-
:443/api/v1/task/server-streaming-text-generation
curl --json '{"model_id": "<model_name>", "inputs": "<text>"}' \
https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation \
-H 'Authorization: Bearer <token>'
Caikit Standalone ServingRuntime for KServe
If you are serving multiple models, you can query /info/models
or :443 caikit.runtime.info.InfoService/GetModelsInfo
to view a list of served models.
-
/api/v1/task/embedding
-
/api/v1/task/embedding-tasks
-
/api/v1/task/sentence-similarity
-
/api/v1/task/sentence-similarity-tasks
-
/api/v1/task/rerank
-
/api/v1/task/rerank-tasks
-
/info/models
-
/info/version
-
/info/runtime
-
:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict
-
:443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict
-
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict
-
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict
-
:443 caikit.runtime.Nlp.NlpService/RerankTaskPredict
-
:443 caikit.runtime.Nlp.NlpService/RerankTasksPredict
-
:443 caikit.runtime.info.InfoService/GetModelsInfo
-
:443 caikit.runtime.info.InfoService/GetRuntimeInfo
Note
|
By default, the Caikit Standalone Runtime exposes REST endpoints. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see Adding a custom model-serving runtime for the single-model serving platform. |
An example manifest is available in the caikit-tgis-serving GitHub repository.
REST
curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'
gRPC
grpcurl -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'
TGIS Standalone ServingRuntime for KServe
Important
|
The Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe is deprecated. For more information, see Open Data Hub release notes. |
-
:443 fmaas.GenerationService/Generate
-
:443 fmaas.GenerationService/GenerateStream
NoteTo query the endpoint for the TGIS standalone runtime, you must also download the files in the proto directory of the Open Data Hub
text-generation-inference
repository.
grpcurl -proto text-generation-inference/proto/generation.proto -d \
'{"requests": [{"text":"<text>"}]}' \
-insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate \
-H 'Authorization: Bearer <token>'
OpenVINO Model Server
-
/v2/models/<model-name>/infer
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d \
'{ "model_name": "<model_name>", \
"inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' \
-H 'Authorization: Bearer <token>'
vLLM NVIDIA GPU ServingRuntime for KServe
-
:443/version
-
:443/docs
-
:443/v1/models
-
:443/v1/chat/completions
-
:443/v1/completions
-
:443/v1/embeddings
-
:443/tokenize
-
:443/detokenize
Note-
The vLLM runtime is compatible with the OpenAI REST API. For a list of models that the vLLM runtime supports, see Supported models.
-
To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see Supported embeddings models in vLLM.
-
As of vLLM v0.5.5, you must provide a chat template while querying a model using the
/v1/chat/completions
endpoint. If your model does not include a predefined chat template, you can use thechat-template
command-line parameter to specify a chat template in your custom vLLM runtime, as shown in the example. Replace<CHAT_TEMPLATE>
with the path to your template.containers: - args: - --chat-template=<CHAT_TEMPLATE>
You can use the chat templates that are available as
.jinja
files here or with the vLLM image under/app/data/template
. For more information, see Chat templates.
As indicated by the paths shown, the single-model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.
-
curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H \
"Content-Type: application/json" -d '{ \
"messages": [{ \
"role": "<role>", \
"content": "<content>" \
}] -H 'Authorization: Bearer <token>'
vLLM Intel Gaudi Accelerator ServingRuntime for KServe
vLLM AMD GPU ServingRuntime for KServe
NVIDIA Triton Inference Server
-
v2/models/[/versions/<model_version>]/infer
-
v2/models/<model_name>[/versions/<model_version>]
-
v2/health/ready
-
v2/health/live
-
v2/models/<model_name>[/versions/]/ready
-
v2
Note
|
ModelMesh does not support the following REST endpoints:
|
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d /
'{ "model_name": "<model_name>", \
"inputs": \
[{ "name": "<name_of_model_input>", \
"shape": [<shape>], \
"datatype": "<data_type>", \
"data": [<data>] \
}]}' -H 'Authorization: Bearer <token>'
-
:443 inference.GRPCInferenceService/ModelInfer
-
:443 inference.GRPCInferenceService/ModelReady
-
:443 inference.GRPCInferenceService/ModelMetadata
-
:443 inference.GRPCInferenceService/ServerReady
-
:443 inference.GRPCInferenceService/ServerLive
-
:443 inference.GRPCInferenceService/ServerMetadata
grpcurl -cacert ./openshift_ca_istio_knative.crt \
-proto ./grpc_predict_v2.proto \
-d @ \
-H "Authorization: Bearer <token>" \
<inference_endpoint_url>:443 \
inference.GRPCInferenceService/ModelMetadata
Seldon MLServer
-
v2/models/[/versions/<model_version>]/infer
-
v2/models/<model_name>[/versions/<model_version>]
-
v2/health/ready
-
v2/health/live
-
v2/models/<model_name>[/versions/]/ready
-
v2
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d /
'{ "model_name": "<model_name>", \
"inputs": \
[{ "name": "<name_of_model_input>", \
"shape": [<shape>], \
"datatype": "<data_type>", \
"data": [<data>] \
}]}' -H 'Authorization: Bearer <token>'
-
:443 inference.GRPCInferenceService/ModelInfer
-
:443 inference.GRPCInferenceService/ModelReady
-
:443 inference.GRPCInferenceService/ModelMetadata
-
:443 inference.GRPCInferenceService/ServerReady
-
:443 inference.GRPCInferenceService/ServerLive
-
:443 inference.GRPCInferenceService/ServerMetadata
grpcurl -cacert ./openshift_ca_istio_knative.crt \
-proto ./grpc_predict_v2.proto \
-d @ \
-H "Authorization: Bearer <token>" \
<inference_endpoint_url>:443 \
inference.GRPCInferenceService/ModelMetadata