-
Upload a YAML file
-
Click Upload files.
-
In the file browser, select a YAML file on your computer.
The embedded YAML editor opens and shows the contents of the file that you uploaded.
-
-
Enter YAML code directly in the editor
-
Click Start from scratch.
-
Enter or paste YAML code directly in the embedded editor.
-
Info alert:Important Notice
Please note that more information about the previous v2 releases can be found here. You can use "Find a release" search bar to search for a particular release.
Managing and monitoring models
Managing model-serving runtimes
As a cluster administrator, you can create a custom model-serving runtime and edit the inference service for a model deployed in Open Data Hub.
Adding a custom model-serving runtime for the single-model serving platform
A model-serving runtime adds support for a specified set of model frameworks and the model formats supported by those frameworks. You can use the preinstalled runtimes that are included with Open Data Hub. You can also add your own custom runtimes if the default runtimes do not meet your needs.
As an administrator, you can use the Open Data Hub interface to add and enable a custom model-serving runtime. You can then choose the custom runtime when you deploy a model on the single-model serving platform.
|
Note
|
Red Hat does not provide support for custom runtimes. You are responsible for ensuring that you are licensed to use any custom runtimes that you add, and for correctly configuring and maintaining them. |
-
You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
-
You have built your custom runtime and added the image to a container image repository such as Quay.
-
From the Open Data Hub dashboard, click Settings → Serving runtimes.
The Serving runtimes page opens and shows the model-serving runtimes that are already installed and enabled.
-
To add a custom runtime, choose one of the following options:
-
To start with an existing runtime (for example, vLLM NVIDIA GPU ServingRuntime for KServe), click the action menu (⋮) next to the existing runtime and then click Duplicate.
-
To add a new custom runtime, click Add serving runtime.
-
-
In the Select the model serving platforms this runtime supports list, select Single-model serving platform.
-
In the Select the API protocol this runtime supports list, select REST or gRPC.
-
Optional: If you started a new runtime (rather than duplicating an existing one), add your code by choosing one of the following options:
NoteIn many cases, creating a custom runtime will require adding new or custom parameters to the envsection of theServingRuntimespecification. -
Click Add.
The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the custom runtime that you added is automatically enabled. The API protocol that you specified when creating the runtime is shown.
-
Optional: To edit your custom runtime, click the action menu (⋮) and select Edit.
-
The custom model-serving runtime that you added is shown in an enabled state on the Serving runtimes page.
Managing and monitoring models on the single-model serving platform
As a cluster administrator, you can manage and monitor models on the single-model serving platform. You can configure monitoring for the single-model serving platform, deploy models across multiple GPU nodes and set up a Grafana dashboard to visualize real-time metrics, among other tasks.
Setting a timeout for KServe
When deploying large models or using node autoscaling with KServe, the operation may time out before a model is deployed because the default progress-deadline that KNative Serving sets is 10 minutes.
If a pod using KNative Serving takes longer than 10 minutes to deploy, the pod might be automatically marked as failed. This can happen if you are deploying large models that take longer than 10 minutes to pull from S3-compatible object storage or if you are using node autoscaling to reduce the consumption of GPU nodes.
To resolve this issue, you can set a custom progress-deadline in the KServe InferenceService for your application.
-
You have namespace edit access for your OpenShift Container Platform cluster.
-
Log in to the OpenShift Container Platform console as a cluster administrator.
-
Select the project where you have deployed the model.
-
In the Administrator perspective, click Home → Search.
-
From the Resources dropdown menu, search for
InferenceService. -
Under
spec.predictor.annotations, modify theserving.knative.dev/progress-deadlinewith the new timeout:apiVersion: serving.kserve.io/v1alpha1 kind: InferenceService metadata: name: my-inference-service spec: predictor: annotations: serving.knative.dev/progress-deadline: 30mNoteEnsure that you set the
progress-deadlineon thespec.predictor.annotationslevel, so that the KServeInferenceServicecan copy theprogress-deadlineback to the KNative Service object.
Deploying models by using multiple GPU nodes
Deploy models across multiple GPU nodes to handle large models, such as large language models (LLMs).
You can serve models on Open Data Hub across multiple GPU nodes using the vLLM serving framework. Multi-node inferencing uses the vllm-multinode-runtime custom runtime, which uses the same image as the vLLM NVIDIA GPU ServingRuntime for KServe runtime and also includes information necessary for multi-GPU inferencing.
You can deploy the model from a persistent volume claim (PVC) or from an Open Container Initiative (OCI) container image.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
You have enabled the operators for your GPU type, such as Node Feature Discovery Operator, NVIDIA GPU Operator. For more information about enabling accelerators, see Working with accelerators.
-
You are using an NVIDIA GPU (
nvidia.com/gpu). -
You have specified the GPU type through either the
ServingRuntimeorInferenceService. If the GPU type specified in theServingRuntimediffers from what is set in theInferenceService, both GPU types are assigned to the resource and can cause errors.
-
-
You have enabled KServe on your cluster.
-
You have only one head pod in your setup. Do not adjust the replica count using the
min_replicasormax_replicassettings in theInferenceService. Creating additional head pods can cause them to be excluded from the Ray cluster. -
To deploy from a PVC: You have a persistent volume claim (PVC) set up and configured for ReadWriteMany (RWX) access mode.
-
To deploy from an OCI container image:
-
You have stored a model in an OCI container image.
-
If the model is stored in a private OCI repository, you have configured an image pull secret.
-
-
In a terminal window, if you are not already logged in to your OpenShift Container Platform cluster as a cluster administrator, log in to the OpenShift Container Platform CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <admin_username> -p <password> -
Select or create a namespace for deploying the model. For example, run the following command to create the
kserve-demonamespace:oc new-project kserve-demo -
(Deploying a model from a PVC only) Create a PVC for model storage in the namespace where you want to deploy the model. Create a storage class using
Filesystem volumeModeand use this storage class for your PVC. The storage size must be larger than the size of the model files on disk. For example:NoteIf you have already configured a PVC or are deploying a model from an OCI container image, you can skip this step. kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: granite-8b-code-base-pvc spec: accessModes: - ReadWriteMany volumeMode: Filesystem resources: requests: storage: <model size> storageClassName: <storage class>-
Create a pod to download the model to the PVC you created. Update the sample YAML with your bucket name, model path, and credentials:
apiVersion: v1 kind: Pod metadata: name: download-granite-8b-code labels: name: download-granite-8b-code spec: volumes: - name: model-volume persistentVolumeClaim: claimName: granite-8b-code-base-pvc restartPolicy: Never initContainers: - name: fix-volume-permissions image: quay.io/quay/busybox@sha256:92f3298bf80a1ba949140d77987f5de081f010337880cd771f7e7fc928f8c74d command: ["sh"] args: ["-c", "mkdir -p /mnt/models/$(MODEL_PATH) && chmod -R 777 /mnt/models"] (1) volumeMounts: - mountPath: "/mnt/models/" name: model-volume env: - name: MODEL_PATH value: <model path> (2) containers: - resources: requests: memory: 40Gi name: download-model imagePullPolicy: IfNotPresent image: quay.io/opendatahub/kserve-storage-initializer:v0.14 (3) args: - 's3://$(BUCKET_NAME)/$(MODEL_PATH)/' - /mnt/models/$(MODEL_PATH) env: - name: AWS_ACCESS_KEY_ID value: <id> (4) - name: AWS_SECRET_ACCESS_KEY value: <secret> (5) - name: BUCKET_NAME value: <bucket_name> (6) - name: MODEL_PATH value: <model path> (2) - name: S3_USE_HTTPS value: "1" - name: AWS_ENDPOINT_URL value: <AWS endpoint> (7) - name: awsAnonymousCredential value: 'false' - name: AWS_DEFAULT_REGION value: <region> (8) - name: S3_VERIFY_SSL value: 'true' (9) volumeMounts: - mountPath: "/mnt/models/" name: model-volume-
The
chmodoperation is permitted only if your pod is running as root. Remove`chmod -R 777` from the arguments if you are not running the pod as root. -
Specify the path to the model.
-
The value for
containers.image, located in yourdownload-modelcontainer. To access this value, run the following command:oc get configmap inferenceservice-config -n opendatahub -oyaml | grep kserve-storage-initializer: -
The access key ID to your S3 bucket.
-
The secret access key to your S3 bucket.
-
The name of your S3 bucket.
-
The endpoint to your S3 bucket.
-
The region for your S3 bucket if using an AWS S3 bucket. If using other S3-compatible storage, such as ODF or Minio, you can remove the
AWS_DEFAULT_REGIONenvironment variable. -
If you encounter SSL errors, change
S3_VERIFY_SSLtofalse.
-
-
-
Create the
vllm-multinode-runtimecustom runtime in your project namespace:oc process vllm-multinode-runtime-template -n opendatahub|oc apply -f - -
Deploy the model using the following
InferenceServiceconfiguration:apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: annotations: serving.kserve.io/deploymentMode: RawDeployment serving.kserve.io/autoscalerClass: external name: <inference service name> spec: predictor: model: modelFormat: name: vLLM runtime: vllm-multinode-runtime storageUri: <storage_uri_path> (1) workerSpec: {} (2)-
Specify the path to your model based on your deployment method:
-
For PVC:
pvc://<pvc_name>/<model_path> -
For an OCI container image:
oci://<registry_host>/<org_or_username>/<repository_name><tag_or_digest>
-
-
The following configuration can be added to the
InferenceService:-
workerSpec.tensorParallelSize: Determines how many GPUs are used per node. The GPU type count in both the head and worker node deployment resources is updated automatically. Ensure that the value ofworkerSpec.tensorParallelSizeis at least1. -
workerSpec.pipelineParallelSize: Determines how many nodes are used to balance the model in deployment. This variable represents the total number of nodes, including both the head and worker nodes. Ensure that the value ofworkerSpec.pipelineParallelSizeis at least2. Do not modify this value in production environments.NoteYou may need to specify additional arguments, depending on your environment and model size.
-
-
-
Deploy the model by applying the
InferenceServiceconfiguration:oc apply -f <inference-service-file.yaml>
To confirm that you have set up your environment to deploy models on multiple GPU nodes, check the GPU resource status, the InferenceService status, the Ray cluster status, and send a request to the model.
-
Check the GPU resource status:
-
Retrieve the pod names for the head and worker nodes:
# Get pod name podName=$(oc get pod -l app=isvc.granite-8b-code-base-pvc-predictor --no-headers|cut -d' ' -f1) workerPodName=$(oc get pod -l app=isvc.granite-8b-code-base-pvc-predictor-worker --no-headers|cut -d' ' -f1) oc wait --for=condition=ready pod/${podName} --timeout=300s # Check the GPU memory size for both the head and worker pods: echo "### HEAD NODE GPU Memory Size" kubectl exec $podName -- nvidia-smi echo "### Worker NODE GPU Memory Size" kubectl exec $workerPodName -- nvidia-smiSample response+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | | 0% 33C P0 71W / 300W |19031MiB / 23028MiB <1>| 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ... +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | | 0% 30C P0 69W / 300W |18959MiB / 23028MiB <2>| 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+Confirm that the model loaded properly by checking the values of <1> and <2>. If the model did not load, the value of these fields is
0MiB.
-
-
Verify the status of your
InferenceServiceusing the following command:oc wait --for=condition=ready pod/${podName} -n $DEMO_NAMESPACE --timeout=300s export MODEL_NAME=granite-8b-code-base-pvcSample responseNAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE granite-8b-code-base-pvc http://granite-8b-code-base-pvc.default.example.com -
Send a request to the model to confirm that the model is available for inference:
oc wait --for=condition=ready pod/${podName} -n vllm-multinode --timeout=300s oc port-forward $podName 8080:8080 & curl http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d "{ 'model': "$MODEL_NAME", 'prompt': 'At what temperature does Nitrogen boil?', 'max_tokens': 100, 'temperature': 0 }"
Configuring an inference service for Kueue
To queue your inference service workloads and manage their resources, add the kueue.x-k8s.io/queue-name label to the service’s metadata. This label directs the workload to a specific LocalQueue for management and is required only if your project is enabled for Kueue.
For more information, see Managing workloads with Kueue.
-
You have permissions to edit resources in the project where the model is deployed.
-
As a cluster administrator, you have installed and activated the Red Hat build of Kueue Operator as described in Configuring workload management with Kueue.
To configure the inference service, complete the following steps:
-
Log in to the OpenShift Container Platform console.
-
In the Administrator perspective, navigate to your project and locate the
InferenceServiceresource for your model. -
Click the name of the InferenceService to view its details.
-
Select the YAML tab to open the editor.
-
In the
metadatasection, add thekueue.x-k8s.io/queue-namelabel underlabels. Replace <local-queue-name> with the name of your targetLocalQueue.apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: <model-name> namespace: <project-namespace> labels: kueue.x-k8s.io/queue-name: <local-queue-name> ... -
Click Save.
-
The workload is submitted to the
LocalQueuespecified in thekueue.x-k8s.io/queue-namelabel. -
The workload starts when the required cluster resources are available and admitted by the queue.
-
Optional: To verify, run the following command and review the
Admitted Workloadssection:$ oc describe localqueue <local-queue-name> -n <project-namespace>
Configuring an inference service for Spyre
If you are deploying a model using a hardware profile that relies on Spyre schedulers, you must manually edit the InferenceService YAML after deployment to add the required scheduler name and tolerations. This step is necessary because the user interface does not currently provide an option to specify a custom scheduler.
-
You have deployed a model on OpenShift Container Platform by using the vLLM Spyre AI Accelerator ServingRuntime for KServe runtime.
-
You have privileges to edit resources in the project where the model is deployed.
To configure the inference service, complete the following steps:
-
Log in to the OpenShift Container Platform console.
-
From the perspective dropdown menu, select Administrator.
-
From the Project dropdown menu, select the project where your model is deployed.
-
Navigate to Home > Search.
-
From the Resources dropdown menu, select
InferenceService. -
Click the name of the
InferenceServiceresource associated with your model. -
Select the YAML tab.
-
Edit the
spec.predictorsection to add theschedulerNameandtolerationsfields as shown in the following example:apiVersion: serving.kserve.io/v1beta1 kind: InferenceService spec: predictor: schedulerName: spyre-scheduler tolerations: - effect: NoSchedule key: ibm.com/spyre_pf operator: Exists -
Click Save.
After you save the YAML, the existing pod for the model is terminated and a new pod is created.
-
Navigate to Workloads > Pods.
-
Click the new pod for your model to view its details.
-
On the Details page, verify that the pod is running on a Spyre node by checking the Node information.
Optimizing performance and tuning
You can optimize and tune your deployed models to balance speed, efficiency, and cost for different use cases.
To evaluate a model’s inference performance, consider these key metrics:
-
Latency: The time it takes to generate a response, which is critical for real-time applications. This includes Time-to-First-Token (TTFT) and Inter-Token Latency (ITL).
-
Throughput: The overall efficiency of the model server, measured in Tokens per Second (TPS) or Requests per Second (RPS).
-
Cost per million tokens: The cost-effectiveness of the model’s inference.
Performance is influenced by factors like model size, available GPU memory, and input sequence length, especially for applications like text-summarization and retrieval-augmented generation (RAG). To meet your performance requirements, you can use techniques such as quantization to reduce memory needs or parallelism to distribute very large models across multiple GPUs.
Determining GPU requirements for LLM-powered applications
There are several factors to consider when choosing GPUs for applications powered by a Large Language Model (LLM) hosted on Open Data Hub.
The following guidelines help you determine the hardware requirements for your application, depending on the size and expected usage of your model.
-
Estimating memory needs: A general rule of thumb is that a model with
Nparameters in 16-bit precision requires approximately2Nbytes of GPU memory. For example, an 8-billion-parameter model requires around 16GB of GPU memory, while a 70-billion-parameter model requires around 140GB. -
Quantization: To reduce memory requirements and potentially improve throughput, you can use quantization to load or run the model at lower-precision formats such as INT8, FP8, or INT4. This reduces the memory footprint at the expense of a slight reduction in model accuracy.
NoteThe vLLM ServingRuntime for KServe model-serving runtime supports several quantization methods. For more information about supported implementations and compatible hardware, see Supported hardware for quantization kernels.
-
Additional memory for key-value cache: In addition to model weights, GPU memory is also needed to store the attention key-value (KV) cache, which increases with the number of requests and the sequence length of each request. This can impact performance in real-time applications, especially for larger models.
-
Recommended GPU configurations:
-
Small Models (1B–8B parameters): For models in the range, a GPU with 24GB of memory is generally sufficient to support a small number of concurrent users.
-
Medium Models (10B–34B parameters):
-
Models under 20B parameters require at least 48GB of GPU memory.
-
Models that are between 20B - 34B parameters require at least 80GB or more of memory in a single GPU.
-
-
Large Models (70B parameters): Models in this range may need to be distributed across multiple GPUs by using tensor parallelism techniques. Tensor parallelism allows the model to span multiple GPUs, improving inter-token latency and increasing the maximum batch size by freeing up additional memory for KV cache. Tensor parallelism works best when GPUs have fast interconnects such as an NVLink.
-
Very Large Models (405B parameters): For extremely large models, quantization is recommended to reduce memory demands. You can also distribute the model using pipeline parallelism across multiple GPUs, or even across two servers. This approach allows you to scale beyond the memory limitations of a single server, but requires careful management of inter-server communication for optimal performance.
-
For best results, start with smaller models and then scale up to larger models as required, using techniques such as parallelism and quantization to meet your performance and memory requirements.
Performance considerations for text-summarization and retrieval-augmented generation (RAG) applications
There are additional factors that need to be taken into consideration for text-summarization and RAG applications, as well as for LLM-powered services that process large documents uploaded by users.
-
Longer Input Sequences: The input sequence length can be significantly longer than in a typical chat application, if each user query includes a large prompt or a large amount of context such as an uploaded document. The longer input sequence length increases the prefill time, the time the model takes to process the initial input sequence before generating a response, which can then lead to a higher Time-to-First-Token (TTFT). A longer TTFT may impact the responsiveness of the application. Minimize this latency for optimal user experience.
-
KV Cache Usage: Longer sequences require more GPU memory for the key-value (KV) cache. The KV cache stores intermediate attention data to improve model performance during generation. A high KV cache utilization per request requires a hardware setup with sufficient GPU memory. This is particularly crucial if multiple users are querying the model concurrently, as each request adds to the total memory load.
-
Optimal Hardware Configuration: To maintain responsiveness and avoid memory bottlenecks, select a GPU configuration with sufficient memory. For instance, instead of running an 8B model on a single 24GB GPU, deploying it on a larger GPU (e.g., 48GB or 80GB) or across multiple GPUs can improve performance by providing more memory headroom for the KV cache and reducing inter-token latency. Multi-GPU setups with tensor parallelism can also help manage memory demands and improve efficiency for larger input sequences.
In summary, to ensure optimal responsiveness and scalability for document-based applications, you must prioritize hardware with high GPU memory capacity and also consider multi-GPU configurations to handle the increased memory requirements of long input sequences and KV caching.
Inference performance metrics
Latency, throughput and cost per million tokens are key metrics to consider when evaluating the response generation efficiency of a model during inferencing. These metrics provide a comprehensive view of a model’s inference performance and can help balance speed, efficiency, and cost for different use cases.
Latency
Latency is critical for interactive or real-time use cases, and is measured using the following metrics:
-
Time-to-First-Token (TTFT): The delay in milliseconds between the initial request and the generation of the first token. This metric is important for streaming responses.
-
Inter-Token Latency (ITL): The time taken in milliseconds to generate each subsequent token after the first, also relevant for streaming.
-
Time-Per-Output-Token (TPOT): For non-streaming requests, the average time taken in milliseconds to generate each token in an output sequence.
Throughput
Throughput measures the overall efficiency of a model server and is expressed with the following metrics:
-
Tokens per Second (TPS): The total number of tokens generated per second across all active requests.
-
Requests per Second (RPS): The number of requests processed per second. RPS, like response time, is sensitive to sequence length.
Cost per million tokens
Cost per Million Tokens measures the cost-effectiveness of a model’s inference, indicating the expense incurred per million tokens generated. This metric helps to assess both the economic feasibility and scalability of deploying the model.
Configuring metrics-based autoscaling
Knative-based autoscaling is not available in KServe RawDeployment mode. However, you can enable metrics-based autoscaling for an inference service in this mode. Metrics-based autoscaling helps you efficiently manage accelerator resources, lower operational costs, and ensure that your inference services meet performance requirements.
To set up autoscaling for your inference service in KServe RawDeployment mode, install and configure the OpenShift Custom Metrics Autoscaler (CMA), which is based on Kubernetes Event-driven Autoscaling (KEDA). You can then use various model runtime metrics available in OpenShift Monitoring to trigger autoscaling of your inference service, such as KVCache utilization, Time to First Token (TTFT), and Concurrency.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You have installed the CMA operator on your cluster. For more information, see Installing the custom metrics autoscaler.
Note-
You must configure the
KedaControllerresource after installing the CMA operator. -
The
odh-controllerautomatically creates theTriggerAuthentication,ServiceAccount,Role,RoleBinding, andSecretresources to allow CMA access to OpenShift Monitoring metrics.
-
-
You have enabled User Workload Monitoring (UWM) for your cluster. For more information, see Configuring user workload monitoring.
-
You have deployed a model on the single-model serving platform in KServe RawDeployment mode.
-
Log in to the OpenShift Container Platform console as a cluster administrator.
-
In the Administrator perspective, click Home → Search.
-
Select the project where you have deployed your model.
-
From the Resources dropdown menu, select InferenceService.
-
Click the
InferenceServicefor your deployed model and then click YAML. -
Under
spec.predictor, define a metric-based autoscaling policy similar to the following example:kind: InferenceService metadata: name: my-inference-service namespace: my-namespace annotations: serving.kserve.io/autoscalerClass: keda spec: predictor: minReplicas: 1 maxReplicas: 5 autoscaling: metrics: - type: External external: metric: backend: "prometheus" serverAddress: "https://thanos-querier.openshift-monitoring.svc:9092" query: vllm:num_requests_waiting authenticationRef: name: inference-prometheus-auth authModes: bearer target: type: Value value: 2The example configuration sets up the inference service to autoscale between 1 and 5 replicas based on the number of requests waiting to be processed, as indicated by the
vllm:num_requests_waitingmetric. -
Click Save.
-
Confirm that the KEDA
ScaledObjectresource is created:oc get scaledobject -n <namespace>
Guidelines for metrics-based autoscaling
You can use metrics-based autoscaling to scale your AI workloads based on latency or throughput-focused Service Level Objectives (SLOs) as opposed to traditional request concurrency. Metrics-based autoscaling is based on Kubernetes Event-driven Autoscaling (KEDA).
Traditional scaling methods, which depend on factors such as request concurrency, request rate, or CPU utilization, are not effective for scaling LLM inference servers that operate on GPUs. In contrast, vLLM capacity is determined by the size of the GPU and the total number of tokens processed simultaneously. You can use custom metrics to help with autoscaling decisions to meet your SLOs.
The following guidelines can help you autoscale AI inference workloads, including selecting metrics, defining sliding windows, configuring HPA scale-down settings, and taking model size into account for optimal scaling performance.
Choosing metrics for latency and throughput-optimized scaling
For latency-sensitive applications, choose scaling metrics depending on the characteristics of the requests:
-
When sequence lengths vary, use service level objectives (SLOs) for Time to First Token (TTFT) and Inter-Token Latency (ITL). These metrics provide more scaling signals because they are less affected by changes in sequence length.
-
Use
end-to-end request latencyto trigger autoscaling when requests have similar sequence lengths.
End-to-end (e2e) request latency depends on sequence length, posing challenges for use cases with high variance in input/output token counts. A 10 token completion and a 2000 token completion will have vastly different latencies even under identical system conditions. To maximize throughput without latency constraints, use the vllm:num_requests_waiting > 0.1 metric (KEDA scaledObject does not support a threshold of 0) to scale your workloads. This metric scales up the system as soon as a request is queued, which maximizes utilization and prevents a backlog. This strategy works best when input and output sequence lengths are consistent.
To build effective metrics-based autoscaling, follow these best practices:
-
Select the right metrics:
-
Analyze your load patterns to determine sequence length variance.
-
Choose TTFT/ITL for high-variance workloads, and E2E latency for uniform workloads.
-
Implement multiple metrics with different priorities for robust scaling decisions.
-
-
Progressively tune configurations:
-
Start with conservative thresholds and longer windows.
-
Monitor scaling behavior and SLO compliance over time.
-
Optimize the configuration based on observed patterns and business needs.
-
-
Validate behavior through testing:
-
Run load tests with realistic sequence length distributions.
-
Validate scaling under various traffic patterns.
-
Test edge cases, such as traffic spikes and gradual load increases.
-
Choosing the right sliding window
The sliding window length is the time period over which metrics are aggregated or evaluated to make scaling decisions. The length of the sliding window length affects scaling responsiveness and stability.
The ideal window length depends on the metric you use:
-
For Time to First Token (TTFT) and Inter-Token Latency (ITL) metrics, you can use shorter windows (1-2 minutes) because they are less noisy.
-
For end-to-end latency metrics, you need longer windows (4-5 minutes) to account for variations in sequence length.
| Window length | Characteristics | Best for |
|---|---|---|
Short (Less than 30 seconds) |
Does not effectively trigger autoscaling if the metric scraping interval is too long. |
Not recommended. |
Medium (60 seconds) |
Responds quickly to load changes, but may lead to higher costs. Can cause rapid scaling up and down, also known as thrashing. |
Workloads with sharp, unpredictable spikes. |
Long (Over 4 minutes) |
Balances responsiveness and stability while reducing unnecessary scaling. Might miss brief spikes and adapt slowly to load changes. |
Production workloads with moderate variability. |
Optimizing HPA scale-down configuration
Effective scale-down configuration is crucial for cost optimization and resource efficiency. It requires balancing the need to quickly terminate idle pods to reduce cluster load, with the consideration of maintaining them to avoid cold startup times. The Horizontal Pod Autoscaler (HPA) configuration for scale-down plays a critical role in removing idle pods promptly and preventing unnecessary resource usage.
You can control the HPA scale-down behavior by managing the KEDA scaledObject custom resource (CR). This Custom Resource (CR) enables event-driven autoscaling for a specific workload.
To set the time that the HPA waits before scaling down, adjust the stabilizationWindowSeconds field as shown in the following example:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: my-app-scaler
spec:
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
Considering model size for optimal scaling
Model size affects autoscaling behavior and resource use. The following table describes the typical characteristics for different model sizes and describes a scaling strategy to select when implementing metrics-based autoscaling for AI inference workloads.
| Model size | Memory footprint | Scaling strategy | Cold start time |
|---|---|---|---|
Small (Less than 3B) |
Less than 6 GiB |
Use aggressive scaling with lower resource buffers. |
Up to 10 minutes to download and 30 seconds to load. |
Medium (3B-10B) |
6-20 GiB |
Use a more conservative scaling strategy. |
Up to 30 minutes to download and 1 minute to load. |
Large (Greater than 10B) |
Greater than 20 GiB |
May require model sharding or quantization. |
Up to several hours to download and minutes to load. |
For models with fewer than 3 billion parameters, you can reduce cold start latency with the following strategies:
-
Optimize container images by embedding models directly into the image instead of downloading them at runtime. You can also use multi-stage builds to reduce the final image size and use image layer caching for faster container pulls.
-
Cache models on a Persistent Volume Claim (PVC) to share storage across replicas. Configure your inference service to use the PVC to access the cached model.
Monitoring models on the single-model serving platform
Configuring monitoring for the single-model serving platform
The single-model serving platform includes metrics for supported runtimes of the KServe component. KServe does not generate its own metrics, and relies on the underlying model-serving runtimes to provide them. The set of available metrics for a deployed model depends on its model-serving runtime.
In addition to runtime metrics for KServe, you can also configure monitoring for OpenShift Service Mesh. The OpenShift Service Mesh metrics help you to understand dependencies and traffic flow between components in the mesh.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You have created OpenShift Service Mesh and Knative Serving instances and installed KServe.
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
You are familiar with creating a config map for monitoring a user-defined workflow. You will perform similar steps in this procedure.
-
You are familiar with enabling monitoring for user-defined projects in OpenShift. You will perform similar steps in this procedure.
-
You have assigned the
monitoring-rules-viewrole to users that will monitor metrics.
-
In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI (
oc) as shown in the following example:$ oc login <openshift_cluster_url> -u <admin_username> -p <password> -
Define a
ConfigMapobject in a YAML file calleduwm-cm-conf.yamlwith the following contents:apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheus: logLevel: debug retention: 15dThe
user-workload-monitoring-configobject configures the components that monitor user-defined projects. Observe that the retention time is set to the recommended value of 15 days. -
Apply the configuration to create the
user-workload-monitoring-configobject.$ oc apply -f uwm-cm-conf.yaml -
Define another
ConfigMapobject in a YAML file calleduwm-cm-enable.yamlwith the following contents:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: trueThe
cluster-monitoring-configobject enables monitoring for user-defined projects. -
Apply the configuration to create the
cluster-monitoring-configobject.$ oc apply -f uwm-cm-enable.yaml -
Create
ServiceMonitorandPodMonitorobjects to monitor metrics in the service mesh control plane as follows:-
Create an
istiod-monitor.yamlYAML file with the following contents:apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: istiod-monitor namespace: istio-system spec: targetLabels: - app selector: matchLabels: istio: pilot endpoints: - port: http-monitoring interval: 30s -
Deploy the
ServiceMonitorCR in the specifiedistio-systemnamespace.$ oc apply -f istiod-monitor.yamlYou see the following output:
servicemonitor.monitoring.coreos.com/istiod-monitor created -
Create an
istio-proxies-monitor.yamlYAML file with the following contents:apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: istio-proxies-monitor namespace: istio-system spec: selector: matchExpressions: - key: istio-prometheus-ignore operator: DoesNotExist podMetricsEndpoints: - path: /stats/prometheus interval: 30s -
Deploy the
PodMonitorCR in the specifiedistio-systemnamespace.$ oc apply -f istio-proxies-monitor.yamlYou see the following output:
podmonitor.monitoring.coreos.com/istio-proxies-monitor created
-
Using Grafana to monitor model performance
You can deploy a Grafana metrics dashboard to monitor the performance and resource usage of your models. Metrics dashboards can help you visualize key metrics for your model-serving runtimes and hardware accelerators.
Deploying a Grafana metrics dashboard
You can deploy a Grafana metrics dashboard for User Workload Monitoring (UWM) to monitor performance and resource usage metrics for models deployed on the single-model serving platform.
You can create a Kustomize overlay, similar to this example. Use the overlay to deploy preconfigured metrics dashboards for models deployed with OpenVino Model Server (OVMS) and vLLM.
-
You have cluster admin privileges for your OpenShift cluster.
-
A cluster admin has enabled user workload monitoring (UWM) for user-defined projects on your OpenShift cluster. For more information, see Enabling monitoring for user-defined projects and Configuring monitoring for the single-model serving platform.
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
You have created an overlay to deploy a Grafana instance, similar to this example.
-
For vLLM deployments, see examples in Monitoring Dashboards.
NoteTo view GPU metrics, you must enable the NVIDIA GPU monitoring dashboard as described in Enabling the GPU monitoring dashboard. The GPU monitoring dashboard provides a comprehensive view of GPU utilization, memory usage, and other metrics for your GPU nodes.
-
-
In a terminal window, log in to the OpenShift CLI (
oc) as a cluster administrator. -
If you have not already created the overlay to install the Grafana operator and metrics dashboards, refer to the RHOAI UWM repository to create it.
-
Install the Grafana instance and metrics dashboards on your OpenShift cluster with the overlay that you created. Replace
<overlay-name>with the name of your overlay.oc apply -k overlays/<overlay-name> -
Retrieve the URL of the Grafana instance. Replace
<namespace>with the namespace that contains the Grafana instance.oc get route -n <namespace> grafana-route -o jsonpath='{.spec.host}' -
Use the URL to access the Grafana instance:
grafana-<namespace>.apps.example-openshift.com
-
You can access the preconfigured dashboards available for KServe, vLLM and OVMS on the Grafana instance.
Deploying a vLLM/GPU metrics dashboard on a Grafana instance
Deploy Grafana boards to monitor accelerator and vLLM performance metrics.
-
You have deployed a Grafana metrics dashboard, as described in Deploying a Grafana metrics dashboard.
-
You can access a Grafana instance.
-
You have installed
envsubst, a command-line tool used to substitute environment variables in configuration files. For more information, see the GNUgettextdocumentation.
-
Define a
GrafanaDashboardobject in a YAML file, similar to the following examples:-
To monitor NVIDIA accelerator metrics, see
nvidia-vllm-dashboard.yaml. -
To monitor AMD accelerator metrics, see
amd-vllm-dashboard.yaml. -
To monitor Intel accelerator metrics, see
gaudi-vllm-dashboard.yaml. -
To monitor vLLM metrics, see
grafana-vllm-dashboard.yaml.
-
-
Create an
inputs.envfile similar to the following example. Replace theNAMESPACEandMODEL_NAMEparameters with your own values:NAMESPACE=<namespace> (1) MODEL_NAME=<model-name> (2)-
NAMESPACE is the target namespace where the model will be deployed.
-
MODEL_NAME is the model name as defined in your InferenceService. The model name is also used to filter the pod name in the Grafana dashboard.
-
-
Replace the
NAMESPACEandMODEL_NAMEparameters in your YAML file with the values from theinputs.envfile by performing the following actions:-
Export the parameters described in the
inputs.envas environment variables:export $(cat inputs.env | xargs) -
Update the following YAML file, replacing the
${NAMESPACE}and${MODEL_NAME}variables with the values of the exported environment variables, anddashboard_template.yamlwith the name of theGrafanaDashboardobject YAML file that you created earlier:envsubst '${NAMESPACE} ${MODEL_NAME}' < dashboard_template.yaml > dashboard_template-replaced.yaml
-
-
Confirm that your YAML file contains updated values.
-
Deploy the dashboard object:
oc create -f dashboard_template-replaced.yaml
You can see the accelerator and vLLM metrics dashboard on your Grafana instance.
Grafana metrics
You can use Grafana boards to monitor the accelerator and vLLM performance metrics. The datasource, instance and gpu are variables defined inside the board.
Accelerator metrics
Track metrics on your accelerators to ensure the health of the hardware.
- NVIDIA GPU utilization
Tracks the percentage of time the GPU is actively processing tasks, indicating GPU workload levels.
Query
DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}
- NVIDIA GPU memory utilization
Compares memory usage against free memory, which is critical for identifying memory bottlenecks in GPU-heavy workloads.
Query
DCGM_FI_DEV_POWER_USAGE{instance=~"$instance", gpu=~"$gpu"}
Sum
sum(DCGM_FI_DEV_POWER_USAGE{instance=~"$instance", gpu=~"$gpu"})
- NVIDIA GPU temperature
Ensures the GPU operates within safe thermal limits to prevent hardware degradation.
Query
DCGM_FI_DEV_GPU_TEMP{instance=~"$instance", gpu=~"$gpu"}
Avg
avg(DCGM_FI_DEV_GPU_TEMP{instance=~"$instance", gpu=~"$gpu"})
- NVIDIA GPU throttling
GPU throttling occurs when the GPU automatically reduces the clock to avoid damage from overheating.
You can access the following metrics to identify GPU throttling:
-
GPU temperature: Monitor the GPU temperature. Throttling often occurs when the GPU reaches a certain temperature, for example, 85-90°C.
-
SM clock speed: Monitor the core clock speed. A significant drop in the clock speed while the GPU is under load indicates throttling.
CPU metrics
You can track metrics on your CPU to ensure the health of the hardware.
- CPU utilization
Tracks CPU usage to identify workloads that are CPU-bound.
Query
sum(rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$model_name.*"}[5m])) by (namespace)
- CPU-GPU bottlenecks
A combination of CPU throttling and GPU usage metrics to identify resource allocation inefficiencies. The following table outlines the combination of CPU throttling and GPU utilizations, and what these metrics mean for your environment:
| CPU throttling | GPU utilization | Meaning |
|---|---|---|
Low |
High |
System well-balanced. GPU is fully used without CPU constraints. |
High |
Low |
CPU resources are constrained. The CPU is unable to keep up with the GPU’s processing demands, and the GPU may be underused. |
High |
High |
Workload is increasing for both CPU and GPU, and you might need to scale up resources. |
Query
sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace", pod=~"$model_name.*"}[5m])) by (namespace)
avg_over_time(DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}[5m])
vLLM metrics
You can track metrics related to your vLLM model.
- GPU and CPU cache utilization
Tracks the percentage of GPU memory used by the vLLM model, providing insights into memory efficiency.
Query
sum_over_time(vllm:gpu_cache_usage_perc{namespace="${namespace}",pod=~"$model_name.*"}[24h])
- Running requests
The number of requests actively being processed. Helps monitor workload concurrency.
num_requests_running{namespace="$namespace", pod=~"$model_name.*"}
- Waiting requests
Tracks requests in the queue, indicating system saturation.
num_requests_waiting{namespace="$namespace", pod=~"$model_name.*"}
- Prefix cache hit rates
High hit rates imply efficient reuse of cached computations, optimizing resource usage.
Queries
vllm:gpu_cache_usage_perc{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
vllm:cpu_cache_usage_perc{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
- Request total count
Query
vllm:request_success_total{finished_reason="length",namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
The request ended because it reached the maximum token limit set for the model inference.
Query
vllm:request_success_total{finished_reason="stop",namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}
The request completed naturally based on the model’s output or a stop condition, for example, the end of a sentence or token completion.
- End-to-end latency
-
Measures the overall time to process a request for an optimal user experience.
Histogram queries
histogram_quantile(0.99, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
rate(vllm:e2e_request_latency_seconds_sum{namespace="$namespace", pod=~"$model_name.*",model_name="$model_name"}[5m])
rate(vllm:e2e_request_latency_seconds_count{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
- Time to first token (TTFT) latency
The time taken to generate the first token in a response.
Histogram queries
histogram_quantile(0.99, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
rate(vllm:time_to_first_token_seconds_sum{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:time_to_first_token_seconds_count{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
- Time per output token (TPOT) latency
The average time taken to generate each output token.
Histogram queries
histogram_quantile(0.99, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])))
rate(vllm:time_per_output_token_seconds_sum{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:time_per_output_token_seconds_count{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
- Prompt token throughput and generation throughput
Tracks the speed of processing prompt tokens for LLM optimization.
Queries
rate(vllm:prompt_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
rate(vllm:generation_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"}[5m])
- Total tokens generated
-
Measures the efficiency of generating response tokens, critical for real-time applications.
Query
sum(vllm:generation_tokens_total{namespace="$namespace", pod=~"$model_name.*", model_name="$model_name"})
Managing and monitoring models on the NVIDIA NIM model serving platform
As a cluster administrator, you can manage and monitor models on the NVIDIA NIM model serving platform. You can customize your NVIDIA NIM model selection options and enable metrics for a NIM model, among other tasks.
Customizing model selection options for the NVIDIA NIM model serving platform
The NVIDIA NIM model serving platform provides access to all available NVIDIA NIM models from the NVIDIA GPU Cloud (NGC). You can deploy a NIM model by selecting it from the NVIDIA NIM list in the Deploy model dialog. To customize the models that appear in the list, you can create a ConfigMap object specifying your preferred models.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You have an NVIDIA Cloud Account (NCA) and can access the NVIDIA GPU Cloud (NGC) portal.
-
You know the IDs of the NVIDIA NIM models that you want to make available for selection on the NVIDIA NIM model serving platform.
Note-
You can find the model ID from the NGC Catalog. The ID is usually part of the URL path.
-
You can also find the model ID by using the NGC CLI. For more information, see NGC CLI reference.
-
-
You know the name and namespace of your
Accountcustom resource (CR).
-
In a terminal window, log in to the OpenShift Container Platform CLI as a cluster administrator as shown in the following example:
oc login <openshift_cluster_url> -u <admin_username> -p <password> -
Define a
ConfigMapobject in a YAML file, similar to the one in the following example, containing the model IDs that you want to make available for selection on the NVIDIA NIM model serving platform:apiVersion: v1 kind: ConfigMap metadata: name: nvidia-nim-enabled-models data: models: |- [ "mistral-nemo-12b-instruct", "llama3-70b-instruct", "phind-codellama-34b-v2-instruct", "deepseek-r1", "qwen-2.5-72b-instruct" ] -
Confirm the name and namespace of your
AccountCR:oc get account -AYou see output similar to the following example:
NAMESPACE NAME TEMPLATE CONFIGMAP SECRET opendatahub odh-nim-account -
Deploy the
ConfigMapobject in the same namespace as yourAccountCR:oc apply -f <configmap-name> -n <namespace>Replace <configmap-name> with the name of your YAML file, and <namespace> with the namespace of your
AccountCR. -
Add the
ConfigMapobject that you previously created to thespec.modelListConfigsection of yourAccountCR:oc patch account <account-name> \ --type='merge' \ -p '{"spec": {"modelListConfig": {"name": "<configmap-name>"}}}'Replace <account-name> with the name of your
AccountCR, and <configmap-name> with yourConfigMapobject. -
Confirm that the
ConfigMapobject is added to yourAccountCR:oc get account <account-name> -o yamlYou see the
ConfigMapobject in thespec.modelListConfigsection of yourAccountCR, similar to the following output:spec: enabledModelsConfig: modelListConfig: name: <configmap-name>
-
Follow the steps to deploy a model as described in Deploying models on the NVIDIA NIM model serving platform to deploy a NIM model. You see that the NVIDIA NIM list in the Deploy model dialog displays your preferred list of models instead of all the models available in the NGC catalog.
Enabling NVIDIA NIM metrics for an existing NIM deployment
If you have previously deployed a NIM model in Open Data Hub, and then upgraded to the latest version, you must manually enable NIM metrics for your existing deployment by adding annotations to enable metrics collection and graph generation.
|
Note
|
NIM metrics and graphs are automatically enabled for new deployments in the latest version of Open Data Hub. |
Enabling graph generation for an existing NIM deployment
The following procedure describes how to enable graph generation for an existing NIM deployment.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
You have an existing NIM deployment in Open Data Hub.
-
In a terminal window, if you are not already logged in to your OpenShift Container Platform cluster as a cluster administrator, log in to the OpenShift CLI (
oc). -
Confirm the name of the
ServingRuntimeassociated with your NIM deployment:oc get servingruntime -n <namespace>Replace
<namespace>with the namespace of the project where your NIM model is deployed. -
Check for an existing
metadata.annotationssection in theServingRuntimeconfiguration:oc get servingruntime -n <namespace> <servingruntime-name> -o json | jq '.metadata.annotations'Replace <servingruntime-name> with the name of the
ServingRuntimefrom the previous step. -
Perform one of the following actions:
-
If the
metadata.annotationssection is not present in the configuration, add the section with the required annotations:oc patch servingruntime -n <namespace> <servingruntime-name> --type json --patch \ '[{"op": "add", "path": "/metadata/annotations", "value": {"runtimes.opendatahub.io/nvidia-nim": "true"}}]'You see output similar to the following:
servingruntime.serving.kserve.io/nim-serving-runtime patched -
If there is an existing
metadata.annotationssection, add the required annotations to the section:oc patch servingruntime -n <project-namespace> <runtime-name> --type json --patch \ '[{"op": "add", "path": "/metadata/annotations/runtimes.opendatahub.io~1nvidia-nim", "value": "true"}]'You see output similar to the following:
servingruntime.serving.kserve.io/nim-serving-runtime patched
-
-
Confirm that the annotation has been added to the
ServingRuntimeof your existing NIM deployment.oc get servingruntime -n <namespace> <servingruntime-name> -o json | jq '.metadata.annotations'The annotation that you added is displayed in the output:
... "runtimes.opendatahub.io/nvidia-nim": "true"NoteFor metrics to be available for graph generation, you must also enable metrics collection for your deployment. Please see Enabling metrics collection for an existing NIM deployment.
Enabling metrics collection for an existing NIM deployment
To enable metrics collection for your existing NIM deployment, you must manually add the Prometheus endpoint and port annotations to the InferenceService of your deployment.
The following procedure describes how to add the required Prometheus annotations to the InferenceService of your NIM deployment.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
You have an existing NIM deployment in Open Data Hub.
-
In a terminal window, if you are not already logged in to your OpenShift Container Platform cluster as a cluster administrator, log in to the OpenShift CLI (
oc). -
Confirm the name of the
InferenceServiceassociated with your NIM deployment:oc get inferenceservice -n <namespace>Replace
<namespace>with the namespace of the project where your NIM model is deployed. -
Check if there is an existing
spec.predictor.annotationssection in theInferenceServiceconfiguration:oc get inferenceservice -n <namespace> <inferenceservice-name> -o json | jq '.spec.predictor.annotations'Replace <inferenceservice-name> with the name of the
InferenceServicefrom the previous step. -
Perform one of the following actions:
-
If the
spec.predictor.annotationssection does not exist in the configuration, add the section and required annotations:oc patch inferenceservice -n <namespace> <inference-name> --type json --patch \ '[{"op": "add", "path": "/spec/predictor/annotations", "value": {"prometheus.io/path": "/metrics", "prometheus.io/port": "8000"}}]'The annotation that you added is displayed in the output:
inferenceservice.serving.kserve.io/nim-serving-runtime patched -
If there is an existing
spec.predictor.annotationssection, add the Prometheus annotations to the section:oc patch inferenceservice -n <namespace> <inference-service-name> --type json --patch \ '[{"op": "add", "path": "/spec/predictor/annotations/prometheus.io~1path", "value": "/metrics"}, {"op": "add", "path": "/spec/predictor/annotations/prometheus.io~1port", "value": "8000"}]'The annotations that you added is displayed in the output:
inferenceservice.serving.kserve.io/nim-serving-runtime patched
-
-
Confirm that the annotations have been added to the
InferenceService.oc get inferenceservice -n <namespace> <inferenceservice-name> -o json | jq '.spec.predictor.annotations'You see the annotation that you added in the output:
{ "prometheus.io/path": "/metrics", "prometheus.io/port": "8000" }
Managing and monitoring models on the multi-model serving platform
As a cluster administrator, you can manage and monitor models on the multi-model serving platform.
Configuring monitoring for the multi-model serving platform
The multi-model serving platform includes model and model server metrics for the ModelMesh component. ModelMesh generates its own set of metrics and does not rely on the underlying model-serving runtimes to provide them. The set of metrics that ModelMesh generates includes metrics for model request rates and timings, model loading and unloading rates, times and sizes, internal queuing delays, capacity and usage, cache state, and least recently-used models. For more information, see ModelMesh metrics.
After you have configured monitoring, you can view metrics for the ModelMesh component.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
You are familiar with creating a config map for monitoring a user-defined workflow. You will perform similar steps in this procedure.
-
You are familiar with enabling monitoring for user-defined projects in OpenShift. You will perform similar steps in this procedure.
-
You have assigned the
monitoring-rules-viewrole to users that will monitor metrics.
-
In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI (
oc) as shown in the following example:$ oc login <openshift_cluster_url> -u <admin_username> -p <password> -
Define a
ConfigMapobject in a YAML file calleduwm-cm-conf.yamlwith the following contents:apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheus: logLevel: debug retention: 15dThe
user-workload-monitoring-configobject configures the components that monitor user-defined projects. Observe that the retention time is set to the recommended value of 15 days. -
Apply the configuration to create the
user-workload-monitoring-configobject.$ oc apply -f uwm-cm-conf.yaml -
Define another
ConfigMapobject in a YAML file calleduwm-cm-enable.yamlwith the following contents:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: trueThe
cluster-monitoring-configobject enables monitoring for user-defined projects. -
Apply the configuration to create the
cluster-monitoring-configobject.$ oc apply -f uwm-cm-enable.yaml