import os
from openai import OpenAI
import rich
# We'll be using a llama-stack server deployed in {productname-short}.
# Once all pods associated to the LlamaStackDistribution are running,
# create the base_url using the llama-stack service hostname (with /v1 at the end when using openai sdk)
base_url = "http://llama-stack-distribution-service.my-project.svc.cluster.local:8321/v1"
import openai
client = openai.OpenAI(
api_key="your-llama-stack-key",
base_url=base_url
)
Info alert:Important Notice
The Open Data Hub documentation and the opendatahub-documentation repository are archived as of March 2026. To see the latest documentation, go to: Red Hat OpenShift AI Self-Managed documentation.
Working with Llama Stack
Overview of Llama Stack
Llama Stack is a unified AI runtime environment designed to simplify the deployment and management of generative AI workloads on Open Data Hub. In OpenShift Container Platform, the Llama Stack Operator manages the deployment lifecycle of these components, ensuring scalability, consistency, and integration with Open Data Hub projects. Llama Stack integrates model inference, embedding generation, vector storage, and retrieval services into a single stack that is optimized for retrieval-augmented generation (RAG) and agent-based AI workflows.
Llama Stack concepts
-
Llama Stack Operator Installs and manages Llama Stack server instances in Open Data Hub, handling lifecycle operations such as deployment, scaling, and updates.
-
The
run.yamlfile Defines which APIs are enabled and how backend providers are configured for a Llama Stack server. Red Hat ships a defaultrun.yamlthat supports common deployment scenarios. You can provide a customrun.yamlto enable advanced workflows or integrate additional providers. -
LlamaStackDistributioncustom resource Declares the runtime configuration for a Llama Stack server, including model providers, embedding configuration, vector storage, and persistence settings.
Open Data Hub ships with a Llama Stack Distribution that runs the Llama Stack server in a containerized environment. Open Data Hub 3.3.0 includes Open Data Hub Llama Stack version 0.4.2.1+rhai0, which is based on upstream Llama Stack version 0.4.2.
Llama Stack includes the following core components:
-
Integration with Open Data Hub Uses the
LlamaStackDistributioncustom resource to simplify configuration and deployment of AI workloads. -
Inference model connections Acts as a proxy between Llama Stack APIs and model inference servers, such as vLLM deployments.
-
Embedding generation Generates vector embeddings used for retrieval. In Open Data Hub 3.2, remote embedding models are the recommended and default option for production deployments. Inline embedding models remain available for development and testing scenarios.
-
Vector storage Stores and indexes embeddings by using supported vector databases, such as Milvus or PostgreSQL with the pgvector extension.
-
Metadata persistence Stores vector store metadata, file references, and configuration state. In Open Data Hub 3.2, PostgreSQL is the default backend for production-grade deployments.
-
Retrieval workflows Manages ingestion, chunking, embedding, and similarity search to support RAG workflows.
-
Agentic workflows Enables agent-based interactions through supported APIs, such as OpenAI-compatible Responses and Chat Completions.
For information about deploying Llama Stack in Open Data Hub, see Deploying a RAG stack in a project.
|
Note
|
The Llama Stack Operator is not currently supported on IBM Power or IBM Z platforms. |
Llama Stack APIs
You can use the following APIs from Llama Stack for AI actions such as evaluation, scoring, and inference:
Supported Llama Stack APIs in Open Data Hub
Datasets_IO API
-
Endpoint:
/v1beta/datasetio. -
Providers: All dataset_io backends deployed through OpenShift AI.
-
Support level: Technology Preview.
The Dataset_IO API manages the input and output of datasets and their content.
Evaluation API
-
Endpoint:
/v1beta/eval. -
Providers: All evaluation backends deployed through OpenShift AI.
-
Support level: Developer Preview.
The Evaluation API defines an evaluation task for models and datasets
Inference API
-
Endpoint:
/v1alpha/inference. -
Providers: All inference backends deployed through OpenShift AI.
-
Support level: Developer Preview.
|
Warning
|
The majority of the Inference API is deprecated. The Inference providers use the Completions and Chat Completions APIs now. |
The Inference API enables conversational, message-based interactions with models served by Llama Stack in OpenShift AI.
Safety API
-
Endpoint:
/v1/safety. -
Providers: All safety backends deployed through OpenShift AI.
-
Support level: Technology Preview.
The Safety API detects and prevents harmful content in model inputs and outputs.
Tool Runtime API
-
Endpoint:
/v1/tool-runtime. -
Providers: All tool runtime backends deployed through OpenShift AI.
-
Support level: Developer Preview.
The Tool Runtime API allows a model to dynamically call a tool at runtime.
Vector_IO API
-
Endpoint:
/v1/vector-io. -
Providers: All vector_io backends deployed through OpenShift AI.
-
Support level: Developer Preview.
The Vector_IO API allows you to manage and query vector embeddings: numeric representations of data.
OpenAI compatibility for RAG APIs in Llama Stack
Open Data Hub supports OpenAI-compatible request and response schemas for Llama Stack retrieval-augmented generation (RAG) workflows. This compatibility allows you to use OpenAI clients, tools, and schemas with Llama Stack for managing files, vector stores, and executing RAG queries through the Responses API.
OpenAI compatibility enables the following capabilities:
-
You can use OpenAI SDKs and tools with Llama Stack by pointing the client to the Llama Stack OpenAI-compatible API path.
-
You can manage files and vector stores by using OpenAI-compatible endpoints and invoke RAG workflows by using the Responses API with the
file_searchtool.
When configuring clients, the required base_url depends on the SDK that you use:
-
OpenAI SDKs When you use an OpenAI-compatible SDK (for example, the OpenAI Python client), you must include the
/v1path suffix in the base URL. + For example: +http://llama-stack-service:8321/v1 -
Llama Stack SDK (
llama_stack_client) When you use the native Llama Stack SDK, set the base URL to the Llama Stack service endpoint without the/v1suffix. The SDK automatically appends the correct API paths. + For example: +http://llama-stack-service:8321
|
Important
|
When you use OpenAI-compatible SDKs or send raw HTTP requests to Llama Stack, always include the Using the service endpoint without |
OpenAI-compatible APIs in Llama Stack
Open Data Hub includes a Llama Stack component that exposes OpenAI-compatible APIs. These APIs enable you to reuse existing OpenAI SDKs, tools, and workflows directly within your OpenShift Container Platform environment, without changing your client code. This compatibility layer supports retrieval-augmented generation (RAG), inference, and embedding workloads by using OpenAI-compatible endpoints, schemas, and authentication patterns.
This compatibility layer has the following capabilities:
-
Standardized endpoints: REST API paths align with OpenAI specifications.
-
Schema parity: Request and response fields follow OpenAI data structures.
|
Important
|
When connecting OpenAI SDKs or third-party tools to Open Data Hub, you must update the client configuration to use your deployment’s Llama Stack route as the When you use OpenAI-compatible SDKs, the When you use OpenAI SDKs or send raw HTTP requests to Llama Stack, always include the For example:
Using the service endpoint without |
These endpoints are exposed under the OpenAI compatibility layer and are distinct from the native Llama Stack APIs.
Supported OpenAI-compatible APIs in Open Data Hub
Before running the following examples, ensure you have:
-
The OpenAI Python SDK installed:
pip install -q openai rich -
A configured client pointing to your Llama Stack endpoint
-
Model IDs from your deployment (see Models API section)
For more information, see Deploying a Llama Stack server.
Models API
-
Endpoint:
/v1/models. -
Providers: All model-serving back ends configured within Open Data Hub.
-
Support level: Technology Preview.
The Models API lists and retrieves available model resources from the Llama Stack deployment running on Open Data Hub. By using the Models API, you can enumerate models, view their capabilities, and verify deployment status through a standardized OpenAI-compatible interface.
Example code in Python:
# List models available in the llama-stack server
models = client.models.list()
rich.print(models)
# Select the first LLM and first embedding model
model_id = next(m for m in models if m.custom_metadata["model_type"] == "llm").id
embedding_model_id = (
em := next(m for m in models if m.custom_metadata["model_type"] == "embedding")
).id
embedding_dimension = em.custom_metadata["embedding_dimension"]
Chat Completions API
-
Endpoint:
/v1/chat/completions. -
Providers: All inference back ends deployed through Open Data Hub.
-
Support level: Technology Preview.
The Chat Completions API enables conversational, message-based interactions with models served by Llama Stack in Open Data Hub.
Example code in Python:
# Test chat completion functionality with a simple question
response = client.chat.completions.create(
model=model_id,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0,
)
# Optional verification check
assert len(response.choices) > 0, "No response after basic inference on llama-stack server"
content = response.choices[0].message.content
rich.print(content)
Completions API
-
Endpoint:
/v1/completions. -
Providers: All inference back ends managed by Open Data Hub.
-
Support level: Technology Preview.
The Completions API supports single-turn text generation and prompt completion.
Example code in Python:
# Test completion functionality with a simple question
response = client.completions.create(
model=model_id,
prompt="Answer with one word only: What is the capital of France?",
max_tokens=64,
temperature=0.1
)
# Optional verification check
assert len(response.choices) > 0, "No response after basic inference on llama-stack server"
content = response.choices[0].text
rich.print(content)
Embeddings API
-
Endpoint:
/v1/embeddings. -
Providers: All embedding models enabled in Open Data Hub.
The Embeddings API generates numerical embeddings for text or documents that can be used in downstream semantic search or RAG applications.
Example code in Python:
# Create text embeddings
response = client.embeddings.create(
input="Your text string goes here",
model=embedding_model_id
)
embedding = response.data[0].embedding
rich.print(embedding[:5] + ["..."] + embedding[-5:])
Files API
-
Endpoint:
/v1/files. -
Providers: File system-based file storage provider for managing files and documents stored locally in your cluster.
-
Support level: Technology Preview.
The Files API manages file uploads for use in embedding and retrieval workflows.
Example code in Python:
import requests
from rich import print
from rich.rule import Rule
import time
# -----------------------------
# Download the PDF from url
# -----------------------------
print(Rule("[bold cyan]Downloading PDF[/bold cyan]"))
# We'll use IBM 2025-Q4 report to test RAG, as models don't have that info
pdf_url = "https://www.ibm.com/downloads/documents/us-en/1550f7eea8c0ded6"
filename = "ibm-Q4-2025-4q25-press-release.pdf"
title = "IBM-4Q25-Earnings-Press-Release"
print("📥 Fetching PDF from URL...")
response = requests.get(pdf_url)
response.raise_for_status()
print("✅ PDF fetched successfully")
print(f"💾 Saving PDF as [bold]{filename}[/bold]...")
with open(filename, "wb") as f:
f.write(response.content)
print(f"✅ Downloaded and saved: [green]{filename}[/green]")
# -----------------------------
# Upload the PDF
# -----------------------------
print(Rule("[bold cyan]Uploading File[/bold cyan]"))
print("☁️ Uploading file to Files API...")
with open(filename, "rb") as f:
file_info = client.files.create(
file=(filename, f),
purpose="assistants"
)
print("✅ File uploaded successfully")
print(file_info)
# -----------------------------
# Create vector store
# -----------------------------
print(Rule("[bold cyan]Creating Vector Store[/bold cyan]"))
provider_id = "milvus"
print("🧠 Creating vector store with Milvus provider...")
vector_store = client.vector_stores.create(
name="test_vector_store",
extra_body={
"embedding_model": embedding_model_id,
"embedding_dimension": embedding_dimension,
"provider_id": provider_id,
},
)
print("✅ Vector store created")
print(vector_store)
# -----------------------------
# Add file to vector store
# -----------------------------
print(Rule("[bold cyan]Indexing File[/bold cyan]"))
print("📎 Adding uploaded file to vector store...")
vector_store_file = client.vector_stores.files.create(
vector_store_id=vector_store.id,
file_id=file_info.id,
chunking_strategy={
"type": "static",
"static": {
"max_chunk_size_tokens": 700,
"chunk_overlap_tokens": 100,
}
},
attributes={
"title": title,
},
)
print("✅ File added to vector store")
print(vector_store_file)
# -----------------------------
# Verify file is completed
# -----------------------------
print(Rule("[bold cyan]Waiting until file status is complete[/bold cyan]"))
# Wait for file processing to complete
print("Waiting for file processing to complete...")
max_wait_time = 300 # 5 minutes
start_time = time.time()
while time.time() - start_time < max_wait_time:
files = client.vector_stores.files.list(vector_store_id=vector_store.id)
if files.data:
file_status = files.data[0].status
print(f"File status: {file_status}")
if file_status == "completed":
print("✅ File processing completed!")
break
elif file_status == "failed":
print("✗ File processing failed!")
break
time.sleep(5)
else:
print("⚠ Timeout waiting for file processing")
# Verify file is completed
files = client.vector_stores.files.list(vector_store_id=vector_store.id)
if files.data:
print(f"\nFinal file status: {files.data[0].status}")
print(f"File details: {files.data[0]}")
else:
print("No files found in vector store")
print(Rule("[bold green]All tasks completed successfully ✔[/bold green]"))
Vector Stores API
-
Endpoint:
/v1/vector_stores/. -
Providers: Inline and remote vector store providers configured in Open Data Hub.
-
Support level: Technology Preview.
The Vector Stores API manages the creation, configuration, and lifecycle of vector store resources in Llama Stack. Through this API, you can create new vector stores, list existing ones, delete unused stores, and query their metadata, all using OpenAI-compatible request and response formats.
Vector Store Files API
-
Endpoint:
/v1/vector_stores/{vector_store_id}/files. -
Providers: Local inline provider configured for file storage and retrieval.
-
Support level: Developer Preview.
The Vector Store Files API implements the OpenAI Vector Store Files interface and manages the association between document files and vector stores used for RAG workflows.
Responses API
-
Endpoint:
/v1/responses. -
Providers: All agents, inference, and vector providers configured in Open Data Hub.
-
Support level: Developer Preview.
The Responses API generates model outputs by combining inference, file search, and tool-calling capabilities through a single OpenAI-compatible endpoint. It is particularly useful for retrieval-augmented generation (RAG) workflows that rely on the file_search tool to retrieve context from vector stores.
Example code in Python:
from rich import print
from rich.table import Table
system_instructions = """You are a financial document analysis assistant specialized in quarterly earnings reports, annual filings, press releases, and earnings call transcripts.
You are designed to answer questions in a concise and professional manner.
Answer questions strictly using only the provided documents.
Base every answer strictly on the retrieved document content and cite the relevant section or excerpt ID.
Do not use outside knowledge.
Do not guess, infer missing data, or fabricate numbers.
If the answer is not found in the retrieved content, reply: "I couldn't find relevant information in the available files or my own knowledge."
Be concise, precise, and factual."""
examples = [
{
"input_query": "What do you know about IBM earnings in Q4, 2025? Summarize in one sentence",
"expected_answer": "IBM reported strong fourth-quarter results with revenue rising 12% to $19.7 billion, driven by double-digit growth in its Software and Infrastructure segments and a generative AI book of business that has now surpassed $12.5 billion"
},
{
"input_query": "What was the total value of IBM's generative AI book of business as reported in the fourth quarter of 2025?",
"expected_answer": "IBM reported that its generative AI book of business now stands at more than $12.5 billion."
},
{
"input_query": "What was IBM's reported free cash flow for the full year of 2025?",
"expected_answer": (
"IBM reported a full-year free cash flow of $14.7 billion, which was an increase of $2.0 billion year-over-year"
)
},
{
"input_query": "How did the Software segment perform in terms of revenue during the fourth quarter of 2025?",
"expected_answer": (
"The Software segment generated $9.0 billion in revenue, representing an increase of 14 percent (or 11 percent at constant currency)"
)
},
]
# Use the Responses API to create a results table comparing not using vs using
# the vector_store
table = Table(
title="Answer Comparison (With vs Without Vector Store)",
show_lines=True,
)
table.add_column("Question", style="cyan", no_wrap=False)
table.add_column("Expected Answer", style="magenta", no_wrap=False)
table.add_column("Answer (No Vector Store)", style="yellow", no_wrap=False)
table.add_column("Answer (With Vector Store)", style="green", no_wrap=False)
for example in examples:
question = example["input_query"]
expected_answer = example["expected_answer"]
# Ask question without vector_store
response_no_vs = client.responses.create(
model=model_id,
input=question,
instructions=system_instructions,
)
answer_no_vs = response_no_vs.output_text.strip()
# Ask question with vector_store
response_vs = client.responses.create(
model=model_id,
input=question,
instructions=system_instructions,
tools=[
{
"type": "file_search",
"vector_store_ids": [vector_store.id],
}
],
)
answer_vs = response_vs.output_text.strip()
table.add_row(
question,
expected_answer,
answer_no_vs,
answer_vs,
)
# The table will take a while to be printed, as multiple queries to the responses API will be done
print(table)
|
Note
|
The Responses API is an experimental feature that is still under active development in Open Data Hub. While the API is already functional and suitable for evaluation, some endpoints and parameters remain under implementation and might change in future releases. This API is provided for testing and feedback purposes only and is not recommended for production use. |
OpenAI-compatible file citation annotations
Llama Stack supports OpenAI-compatible file citation annotations in Responses API outputs when using the file_search tool. These annotations enable applications to trace generated responses back to source documents without requiring changes to existing OpenAI client code.
OpenAI-compatible file citation annotations in Llama Stack
Open Data Hub provides OpenAI-compatible file citation annotations in Responses API outputs when using retrieval-augmented generation (RAG) with the file_search tool. These annotations enable applications to trace generated responses back to the source files used during retrieval without requiring changes to existing OpenAI client code. When you use the Responses API with the file_search tool, Llama Stack returns citation metadata that references the source file used to generate the response. Annotations are enabled by default.
Citation annotations have the following characteristics:
-
They follow the same response structure defined by OpenAI.
-
They appear in the
annotationsfield ofoutput_textresponse content. -
They identify the source file by ID and filename.
-
They provide document-level attribution.
This feature improves transparency for RAG workflows while maintaining schema compatibility with OpenAI request and response formats.
In Open Data Hub, the following annotation capabilities are supported:
-
Annotations are returned only through the Responses API.
-
Annotations are returned only when using the
file_searchtool. -
The
file_citationannotation type is supported. -
Attribution is provided at the document level.
Viewing file citation annotations in Responses API output
When you query ingested content by using the file_search tool with the Responses API, Llama Stack returns OpenAI-compatible file_citation annotations. These annotations identify the source files used during retrieval.
-
You have deployed a Llama Stack server.
-
You have configured at least one inference model.
-
You have created a vector store and ingested content into it.
-
You can successfully execute a RAG query by using the
file_searchtool, as described in Querying ingested content in a Llama model. -
You have access to a client environment, such as a Jupyter notebook or an OpenAI SDK client, that is correctly configured to send authenticated requests to the Llama Stack server.
|
Note
|
This procedure requires that content has already been ingested into a vector store. If no content is available, RAG queries return empty or non-contextual responses. |
-
In a Jupyter notebook cell or other configured client environment, run a RAG query by using the
file_searchtool.response = client.responses.create( model=model_id, input=query, instructions=system_instructions, tools=[ { "type": "file_search", "vector_store_ids": [vector_store_id], } ], ) -
Inspect the full response object rather than only the
output_textproperty.response.output -
Access the
annotationsarray.annotations = response.output[0].content[0].annotations print(annotations) -
Review the
file_citationannotation fields.Example output:
[ { "type": "file_citation", "file_id": "file-57610eaac6364459bfefae60377837b7", "filename": "redbankfinancial_about.pdf", "index": 139 } ]
Each file_citation annotation includes the following fields:
-
file_id: The identifier of the retrieved file. -
filename: The name of the source file. -
index: The index of the cited file in the list of files.
Multiple annotations can reference the same index position.
If you use raw HTTP requests or an OpenAI SDK, send requests to the following endpoint:
/v1/responses
Ensure that your base URL includes the /v1 path suffix, as described in
OpenAI compatibility for RAG APIs in Llama Stack.
|
Note
|
The accuracy and consistency of citation annotations depend on the capabilities of the underlying language model. Smaller or less capable models might produce less precise attributions, even when retrieval is functioning correctly. If citation results are incomplete or inconsistent, verify the model configuration and consider using a larger or more capable model. |
When you use an OpenAI SDK, configure the client base_url to include the /v1 path suffix. The SDK automatically appends the appropriate endpoint path, such as /responses.
For example:
http://llama-stack-service:8321/v1
When you send raw HTTP requests, include both the /v1 path suffix and the /responses endpoint in the full request URL.
For example:
http://llama-stack-service:8321/v1/responses
Ensure that /v1 is included only once in the base URL. Do not append /v1 multiple times.
For more information, see OpenAI compatibility for RAG APIs in Llama Stack.
|
Note
|
The accuracy and consistency of citation annotations depend on the capabilities of the underlying language model. Smaller or less capable models might produce less precise attributions, even when retrieval is functioning correctly. If citation results are incomplete or inconsistent, verify the model configuration and consider using a larger or more capable model. |
-
The response includes an
annotationsarray underoutput[].content[]. -
Each annotation has
"type": "file_citation". -
The
file_idandfilenamecorrespond to files stored in the specified vector store.
File citation annotation reference
This reference describes the file_citation annotation type returned by Llama Stack through the OpenAI-compatible Responses API.
Annotation location
Annotations are returned in the annotations field of output_text content items within the output[].content[] structure of the Responses API response.
"output": [
{
"content": [
{
"type": "output_text",
"text": "Example generated response.",
"annotations": [ ... ]
}
]
}
]
Supported annotation type
In Open Data Hub, Llama Stack returns the file_citation annotation type when using the file_search tool.
The url_citation type is defined in the OpenAI schema but is not produced by Llama Stack in Open Data Hub 3.3.
File citation fields
The file_citation annotation includes the following fields:
| Field | Type | Description |
|---|---|---|
type |
string |
Always |
file_id |
string |
Identifier of the source file used during retrieval |
filename |
string |
Name of the source file |
index |
integer |
Index of the cited file in the list of files. |
Annotation behavior
-
Attribution is provided at the document level.
-
Multiple annotations can reference the same index position.
-
Chunk-level and token-level attribution are not supported.
-
Annotations follow the OpenAI response schema without modification.
Llama Stack API provider support
You can use Llama Stack to enable various Provider APIs and providers in Open Data Hub. The following table lists the supported providers included in Open Data Hub
|
Warning
|
The support status of the Llama Stack API providers has shifted between Technology Preview and Developer Preview across Open Data Hub versions. |
| Provider API | Providers | How to Enable | Disconnected support | Support status | ||
|---|---|---|---|---|---|---|
Agents |
|
Enabled by default |
Yes |
Developer Preview |
||
Dataset_IO |
|
Enabled by default |
Yes |
Technology Preview |
||
|
Enabled by default |
No |
Technology Preview |
|||
Evaluation |
|
Set the |
No |
Technology Preview |
||
|
See the "Configuring the Ragas remote provider for production" documentation |
No |
Technology Preview |
|||
|
Enabled by default |
No |
Technology Preview |
|||
Files |
|
Enabled by default |
No |
Technology Preview |
||
Inference |
|
Set the |
Yes |
Technology Preview |
||
|
Enabled by default |
Yes |
Technology Preview |
|||
|
Set the |
No |
Technology Preview |
|||
|
Set the |
No |
Technology Preview |
|||
|
Set the |
No |
Technology Preview |
|||
|
Set the |
No |
Technology Preview |
|||
|
Set the |
No |
Technology Preview |
|||
Safety |
|
Enabled by default |
No |
Technology Preview |
||
Scoring |
|
Enabled by default |
No |
Technology Preview |
||
|
Enabled by default |
No |
Technology Preview |
|||
|
Enabled by default |
No |
Technology Preview |
|||
Tool_Runtime |
|
Enabled by default |
No |
Developer Preview |
||
|
Enabled by default |
No |
Developer Preview |
|||
|
Enabled by default |
No |
Developer Preview |
|||
|
Enabled by default |
No |
Developer Preview |
|||
Vector_IO |
|
Set the |
No |
Technology Preview |
||
|
Enabled by default |
Yes |
Technology Preview |
|||
|
Set the |
Yes |
Technology Preview |
|||
|
Set the |
Yes |
Technology Preview |
|||
|
Set the |
Yes |
Technology Preview |
Activating the Llama Stack Operator
You can activate the Llama Stack Operator on your OpenShift Container Platform cluster by setting its managementState to Managed in the Open Data Hub Operator DataScienceCluster custom resource (CR). This setting enables Llama-based model serving without reinstalling or directly editing Operator subscriptions. You can edit the CR in the OpenShift Container Platform web console or by using the OpenShift CLI (oc).
|
Note
|
As an alternative to following the steps in this procedure, you can activate the Llama Stack Operator from the OpenShift CLI (
Replace <name> with your |
-
You have installed OpenShift Container Platform 4.19 or newer.
-
You have cluster administrator privileges.
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
You have installed the Open Data Hub Operator on your cluster.
-
You have a
DataScienceClustercustom resource in your environment; the default isdefault-dsc. -
Your infrastructure supports GPU-enabled instance types, for example,
g4dn.xlargeon AWS. -
You have enabled GPU support in Open Data Hub, including installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
-
You have created a
NodeFeatureDiscoveryresource instance on your cluster, as described in Installing the Node Feature Discovery Operator and creating a NodeFeatureDiscovery instance in the NVIDIA documentation. -
You have created a
ClusterPolicyresource instance with default values on your cluster, as described in Creating the ClusterPolicy instance in the NVIDIA documentation.
-
Log in to the OpenShift Container Platform web console as a cluster administrator.
-
In the Administrator perspective, click Ecosystem → Installed Operators.
-
Click the Open Data Hub Operator to open its details.
-
Click the Data Science Cluster tab.
-
On the DataScienceClusters page, click the
default-dscobject. -
Click the YAML tab.
An embedded YAML editor opens, displaying the configuration for the
DataScienceClustercustom resource. -
In the YAML editor, locate the
spec.componentssection. If thellamastackoperatorfield does not exist, add it. Then, set themanagementStatefield toManaged:spec: components: llamastackoperator: managementState: Managed -
Click Save to apply your changes.
After you activate the Llama Stack Operator, verify that it is running in your cluster:
-
In the OpenShift Container Platform web console, click Workloads → Pods.
-
From the Project list, select the
opendatahubnamespace. -
Confirm that a pod with the label
app.kubernetes.io/name=llama-stack-operatoris displayed and has a status of Running.
Deploying a Llama Stack server
Llama Stack allows you to create and deploy a server that enables various APIs for accessing AI services in your Open Data Hub cluster. You can create a LlamaStackDistribution custom resource for your desired use cases. You are responsible for provisioning and managing the PostgreSQL instance. The PostgreSQL database can be deployed in-cluster or hosted externally, as long as it is reachable from the cluster network.
The included procedure provides an example LlamaStackDistribution CR that deploys a Llama Stack server that enables the following setup:
-
A connection to a vLLM inference service with a
llama32-3bmodel. -
A connection to a remote vector database.
-
Allocated persistent storage.
-
Orchestration endpoints.
-
You have installed OpenShift Container Platform 4.19 or newer.
-
You have logged in to Open Data Hub.
-
You have cluster administrator privileges for your OpenShift cluster.
-
You have activated the Llama Stack Operator in your cluster.
-
You have access to a PostgreSQL version 14 or later instance that is reachable from the OpenShift Container Platform cluster network.
-
You have PostgreSQL credentials for that instance that allow Llama Stack to create the database and tables.
-
You know the PostgreSQL hostname and database port to use for the
POSTGRES_HOSTandPOSTGRES_PORTenvironment variables. -
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
In the OpenShift web console, select Administrator → Quick Create (
) → Import YAML, and create a CR similar to the following example llamastack-custom-distribution.yamlfile:Example llamastack-custom-distribution.yamlapiVersion: llamastack.io/v1alpha1 kind: LlamaStackDistribution metadata: name: llamastack-custom-distribution namespace: <project-name> # Replace with your OpenShift project spec: replicas: 1 server: containerSpec: env: - name: VLLM_URL value: 'https://llama32-3b.llamastack.svc.cluster.local/v1' - name: INFERENCE_MODEL value: llama32-3b - name: VLLM_TLS_VERIFY value: 'false' - name: POSTGRES_HOST value: <postgres-host> - name: POSTGRES_PORT value: '<postgres-port>' # Default PostgreSQL port is 5432 - name: POSTGRES_DB value: llamastack - name: POSTGRES_USER value: llamastack - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: key: password name: postgres-secret (1) name: llama-stack port: 8321 distribution: name: 'rh-dev' storage: size: 20Gi mountPath: <custom-mount-path> ## Defaults to /opt/app-root/src/.llama/distributions/rh/-
Create the secret in the same namespace as the
LlamaStackDistributionresource. Avoid placing passwords directly on the command line, as they can be stored in shell history. Instead, create a file that contains only the database password and use that file to create the secret, or create the secret by using the OpenShift web console.
For example:
$ oc create secret generic postgres-secret --from-file=password=pg-password.txt -n <project-name> $ rm -f pg-password.txtFor more information about creating and managing Secrets, see Providing sensitive data to pods by using secrets.
Ensure that the file
pg-password.txtcontains only the database password and is deleted after the secret is created.Llama Stack automatically creates the metadata database specified by the
POSTGRES_DBenvironment variable if it does not already exist, provided that the PostgreSQL user has sufficient privileges. -
-
Check that the custom resource was created with the following command:
$ oc get llamastackdistribution -n llamastack -
Check the running pods with the following command:
$ oc get pods -n llamastack | grep llamastack-custom-distribution -
Check the logs with the following command:
$ oc logs -n llamastack -l app=llama-stackExample outputINFO: Started server process INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:8321
Llama Stack application examples
Use the following examples to deploy and configure Llama Stack applications on Open Data Hub. These examples include deploying a RAG stack, evaluating RAG systems, and configuring authentication and availability options.
The following documentation includes example workflows:
-
Deploying a RAG stack in a data science project
-
Evaluating RAG systems with Llama Stack
-
Using supported vector stores with Llama Stack
-
Configuring Llama Stack with OAuth authentication
Deploying a RAG stack in a project
As an OpenShift Container Platform cluster administrator, you can deploy a Retrieval‑Augmented Generation (RAG) stack in Open Data Hub. This stack provides the infrastructure, including LLM inference, vector storage, and retrieval services that data scientists and AI engineers use to build conversational workflows in their projects.
To deploy the RAG stack in a project, complete the following tasks:
-
Activate the Llama Stack Operator in Open Data Hub.
-
Enable GPU support on the OpenShift Container Platform cluster. This task includes installing the required NVIDIA Operators.
-
Deploy an inference model, for example, the llama-3.2-3b-instruct model. This task includes creating a storage connection and configuring GPU allocation.
-
Ingest domain data into the configured vector store by running Docling in an AI pipeline or Jupyter notebook. This process keeps the embeddings synchronized with the source data.
-
Expose and secure the model endpoints.
Overview of RAG
Retrieval-augmented generation (RAG) in Open Data Hub enhances large language models (LLMs) by integrating domain-specific data sources directly into the model’s context. Domain-specific data sources can be structured data, such as relational database tables, or unstructured data, such as PDF documents.
RAG indexes content and builds an embedding store that data scientists and AI engineers can query. When data scientists or AI engineers pose a question to a RAG chatbot, the RAG pipeline retrieves the most relevant pieces of data, passes them to the LLM as context, and generates a response that reflects both the prompt and the retrieved content.
By implementing RAG, data scientists and AI engineers can obtain tailored, accurate, and verifiable answers to complex queries based on their own datasets within a project.
Audience for RAG
The target audience for RAG is practitioners who build data-grounded conversational AI applications using Open Data Hub infrastructure.
- For Data Scientists
-
Data scientists can use RAG to prototype and validate models that answer natural-language queries against data sources without managing low-level embedding pipelines or vector stores. They can focus on creating prompts and evaluating model outputs instead of building retrieval infrastructure.
- For MLOps Engineers
-
MLOps engineers typically deploy and operate RAG pipelines in production. Within Open Data Hub, they manage LLM endpoints, monitor performance, and ensure that both retrieval and generation scale reliably. RAG decouples vector store maintenance from the serving layer, enabling MLOps engineers to apply CI/CD workflows to data ingestion and model deployment alike.
- For Data Engineers
-
Data engineers build workflows to load data into storage that Open Data Hub indexes. They keep embeddings in sync with source systems, such as S3 buckets or relational tables to ensure that chatbot responses are accurate.
- For AI Engineers
-
AI engineers architect RAG chatbots by defining prompt templates, retrieval methods, and fallback logic. They configure agents and add domain-specific tools, such as OpenShift Container Platform job triggers, enabling rapid iteration.
Overview of vector databases
Vector databases are a core component of retrieval-augmented generation (RAG) in Open Data Hub. They store and index vector embeddings that represent the semantic meaning of text or other data. When integrated with Llama Stack, vector databases enable applications to retrieve relevant context and combine it with large language model (LLM) inference.
Vector databases provide the following capabilities:
-
Store vector embeddings generated by embedding models.
-
Support efficient similarity search to retrieve semantically related content.
-
Enable RAG workflows by supplying the LLM with contextually relevant data.
In Open Data Hub, vector databases are configured and managed through the Llama Stack Operator as part of a LlamaStackDistribution.
Starting with version 3.2, PostgreSQL is the default and recommended metadata store for Llama Stack, supporting production-ready persistence, concurrency, and scalability.
The following vector database options are supported in Open Data Hub:
-
Inline Milvus Inline Milvus runs embedded within the Llama Stack Distribution (LSD) pod and is suitable for development and small-scale RAG workloads. In Open Data Hub 3.2 and later, Inline Milvus uses PostgreSQL as the backing metadata store by default. This option provides a simplified deployment model while retaining durable metadata storage.
-
Inline FAISS Inline FAISS uses the FAISS (Facebook AI Similarity Search) library to provide an in-process vector store for RAG workflows. Inline FAISS is designed for experimentation, prototyping, and development scenarios where simplicity and low operational overhead are priorities. In Open Data Hub 3.2 and later, Inline FAISS also relies on PostgreSQL for metadata storage.
-
Remote Milvus Remote Milvus runs as a standalone vector database service, either within the cluster or as an external managed deployment. This option is suitable for large-scale or production-grade RAG workloads that require high availability, horizontal scalability, and isolation from the Llama Stack server. In OpenShift Container Platform environments, Milvus typically requires an accompanying etcd service for coordination. For more information, see Providing redundancy with etcd.
-
Remote PostgreSQL with pgvector PostgreSQL with the pgvector extension provides a production-ready vector database option that integrates vector similarity search directly into PostgreSQL. This option is well suited for environments that already operate PostgreSQL and require durable storage, transactional consistency, and centralized management. pgvector enables Llama Stack to store embeddings and perform similarity search without deploying a separate vector database service.
Consider the following guidance when choosing a vector database for your RAG workloads:
-
Use Inline Milvus or Inline FAISS for development, testing, or early experimentation.
-
Use Remote Milvus when you require large-scale vector indexing and high-throughput similarity search.
-
Use PostgreSQL with pgvector when you want production-ready persistence and integration with existing PostgreSQL-based data platforms.
Starting with Open Data Hub 3.2, SQLite-based storage is no longer recommended for production deployments. PostgreSQL-based backends provide improved reliability, concurrency, and scalability as Llama Stack moves toward general availability.
Overview of Milvus vector databases
Milvus is an open source vector database designed for high-performance similarity search across large volumes of embedding data. In Open Data Hub, Milvus is supported as a vector store provider for Llama Stack and enables retrieval-augmented generation (RAG) workloads that require efficient vector indexing, scalable search, and durable storage.
Starting with Open Data Hub 3.2, production-grade Llama Stack deployments default to PostgreSQL for metadata persistence. When Milvus is used as the vector store, PostgreSQL is typically used for Llama Stack metadata, while Milvus manages vector indexes and similarity search.
Milvus vector databases provide the following capabilities in Open Data Hub:
-
High-performance similarity search using Approximate Nearest Neighbor (ANN) algorithms
-
Efficient indexing and query optimization for dense embeddings
-
Persistent storage of vector data
-
Integration with Llama Stack through an OpenAI-compatible Vector Stores API
In a typical RAG workflow in Open Data Hub, the following responsibilities are separated:
-
Embedding generation Embeddings are generated by the configured embedding provider. In Open Data Hub 3.2, remote embedding models are the recommended and default option for production deployments.
-
Vector storage and retrieval Milvus stores embedding vectors and performs similarity search operations.
-
Metadata persistence Llama Stack stores vector store metadata, file references, and configuration state using PostgreSQL in production deployments.
-
Llama Stack server Coordinates ingestion, retrieval, and model inference through a unified API surface.
In Open Data Hub, Milvus can be used in the following operational modes:
-
Inline Milvus Lite Runs embedded within the Llama Stack Distribution pod. Inline Milvus Lite is intended for experimentation, development, or small datasets. It does not provide high availability or horizontal scalability and is not recommended for production use.
-
Remote Milvus Runs as a standalone service within your OpenShift Container Platform project or as an external managed Milvus deployment. Remote Milvus is recommended for production-grade RAG workloads.
A remote Milvus deployment typically includes the following components:
-
A Milvus service that exposes a gRPC endpoint (port 19530) for client traffic
-
An etcd service that Milvus uses for metadata coordination, collection state, and index management
-
Persistent storage for durable vector data
Milvus requires a dedicated etcd instance for metadata coordination, even when running in standalone mode. Do not use the OpenShift control plane etcd for this purpose. For more information about etcd, see Providing redundancy with etcd.
|
Important
|
You must deploy a dedicated etcd service for Milvus or connect Milvus to an external etcd instance. Do not share the OpenShift control plane etcd with application workloads. |
Use Remote Milvus when you require scalable vector search, high-performance retrieval, and integration with production-grade Llama Stack deployments in Open Data Hub.
For instructions on deploying Milvus as a remote vector database, see Deploying a remote Milvus vector database.
Overview of Qdrant vector databases
Qdrant is an open source vector database optimized for high-performance similarity search and advanced filtering. In Open Data Hub, Qdrant is supported as a remote vector store provider for Llama Stack and can be used in retrieval-augmented generation (RAG) workloads that require efficient vector indexing and durable storage.
When used with Llama Stack in Open Data Hub, Qdrant provides:
-
High-performance similarity search using Hierarchical Navigable Small World (HNSW) indexing
-
Filtering based on stored metadata during vector search
-
Persistent storage of vector data
-
Integration through the OpenAI-compatible Vector Stores API
In a RAG workflow:
-
Embeddings are generated by the configured embedding provider.
-
Qdrant stores embedding vectors and performs similarity search.
-
Llama Stack manages ingestion, retrieval, and model inference through a unified API.
In Open Data Hub, you must deploy Qdrant as a remote service, either within your OpenShift Container Platform project or as an externally managed deployment.
|
Note
|
Inline Qdrant is not supported. To use Qdrant with Llama Stack in Open Data Hub, deploy Qdrant as a remote service. |
A typical remote deployment includes:
-
A Qdrant service exposing HTTP (port 6333) and gRPC (port 6334) endpoints
-
Persistent storage for vector data
-
Optional API key authentication
For deployment and configuration instructions, see Using Qdrant in Llama Stack.
Overview of FAISS vector databases
The FAISS (Facebook AI Similarity Search) library is an open source framework for high-performance vector search and clustering. It is optimized for dense numerical embeddings and supports both CPU and GPU execution. In Open Data Hub, FAISS is supported as an inline vector store provider for Llama Stack, enabling fast, in-process similarity search without requiring a separate vector database service.
When you enable inline FAISS in a LlamaStackDistribution, Llama Stack uses FAISS as an embedded vector index that runs inside the Llama Stack server container. This configuration is designed for lightweight development, experimentation, and single-node retrieval-augmented generation (RAG) workflows.
Inline FAISS provides the following capabilities in Open Data Hub:
-
In-process similarity search using FAISS indexes.
-
Low-latency embedding ingestion and query operations.
-
Simple deployment with no external vector database service.
-
Compatibility with OpenAI-compatible Vector Stores API endpoints.
In Open Data Hub 3.2, inline FAISS relies on the Llama Stack metadata and persistence backend for managing vector store state. PostgreSQL is the default and recommended backend for production-grade deployments, even when FAISS is used as the inline vector index.
SQLite can be explicitly configured for local or on-the-fly development scenarios, but it is not recommended for production use.
Inline FAISS is suitable for the following use cases:
-
Rapid prototyping of RAG workflows.
-
Development or testing environments.
-
Disconnected or single-node deployments where external vector databases are not required.
|
Note
|
Inline FAISS does not provide distributed storage, replication, or high availability. For production-grade RAG workloads that require durability, scalability, or multi-node access, use a remote vector database such as Milvus or PostgreSQL with the pgvector extension. |
For an example of deploying a LlamaStackDistribution instance with inline FAISS, see Example C: LlamaStackDistribution with Inline FAISS.
Overview of pgvector vector databases
pgvector is an open source PostgreSQL extension that enables vector similarity search on embedding data stored in relational tables. In Open Data Hub, PostgreSQL with the pgvector extension is supported as a remote vector database provider for the Llama Stack Operator. pgvector supports retrieval augmented generation workflows that require persistent vector storage while integrating with existing PostgreSQL environments.
pgvector vector databases provide the following capabilities in Open Data Hub:
-
Storage of vector embeddings in PostgreSQL tables.
-
Similarity search across embeddings by using pgvector distance metrics.
-
Persistent storage of vectors alongside structured relational data.
-
Integration with existing PostgreSQL security and operational tooling.
In a typical retrieval augmented generation workflow in Open Data Hub, your application uses the following components:
-
Inference provider Generates embeddings and model responses.
-
Vector store provider Stores embeddings and performs similarity search. When you use pgvector, PostgreSQL provides this capability as a remote vector store.
-
File storage provider Stores the source files that are ingested into vector stores.
-
Llama Stack server Provides a unified API surface, including an OpenAI compatible Vector Stores API.
When you ingest content, Llama Stack splits source material into chunks, generates embeddings, and stores them in PostgreSQL through the pgvector extension. When you query a vector store, Llama Stack performs similarity search and returns the most relevant chunks for use in prompts.
In Open Data Hub, pgvector is used in the following operational mode:
-
Remote PostgreSQL with pgvector, which runs as a standalone PostgreSQL database service accessed by the Llama Stack server. This mode is suitable for development and production workloads that require persistent storage and integration with existing PostgreSQL infrastructure.
When you deploy PostgreSQL with the pgvector extension, you typically manage the following components:
-
Secrets for PostgreSQL connection credentials.
-
Persistent storage for durable database data.
-
A PostgreSQL service that exposes a network endpoint.
PostgreSQL with pgvector does not require an external coordination service. Vector data, indexes, and metadata are stored directly in PostgreSQL tables and managed through standard database mechanisms.
Use PostgreSQL with pgvector when you require persistent vector storage and want to integrate vector search into existing PostgreSQL based data platforms within Open Data Hub. For instructions on deploying PostgreSQL with the pgvector extension, see Deploying a PostgreSQL instance with pgvector.
Deploying a Llama model with KServe
To use Llama Stack and retrieval-augmented generation (RAG) workloads in Open Data Hub, you must deploy a Llama model with a vLLM model server and configure KServe in KServe RawDeployment mode.
-
You have installed OpenShift Container Platform 4.19 or newer.
-
You have logged in to Open Data Hub.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You have activated the Llama Stack Operator. For more information, see Installing the Llama Stack Operator.
-
You have installed KServe.
-
You have enabled the model serving platform. For more information about enabling the model serving platform, see Enabling the model serving platform.
-
You can access the model serving platform in the dashboard configuration. For more information about setting dashboard configuration options, see Customizing the dashboard.
-
You have enabled GPU support in Open Data Hub, including installing the Node Feature Discovery Operator and NVIDIA GPU Operator. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
You have created a project.
-
The vLLM serving runtime is installed and available in your environment.
-
You have created a storage connection for your model that contains a
URI - v1connection type. This storage connection must define the location of your Llama 3.2 model artifacts. For example,oci://quay.io/redhat-ai-services/modelcar-catalog:llama-3.2-3b-instruct. For more information about creating storage connections, see Adding a connection to your project.
|
Important
|
Procedure
These steps are only supported in Open Data Hub versions 2.19 and later. |
-
In the Open Data Hub dashboard, navigate to the project details page and click the Deployments tab.
-
In the Model serving platform tile, click Select model.
-
Click the Deploy model button.
The Deploy model dialog opens.
-
Configure the deployment properties for your model:
-
In the Model deployment name field, enter a unique name for your deployment.
-
In the Serving runtime field, select
vLLM NVIDIA GPU serving runtime for KServefrom the drop-down list. -
In the Deployment mode field, select KServe RawDeployment from the drop-down list.
-
Set Number of model server replicas to deploy to
1. -
In the Model server size field, select
Customfrom the drop-down list.-
Set CPUs requested to
1 core. -
Set Memory requested to
10 GiB. -
Set CPU limit to
2 core. -
Set Memory limit to
14 GiB. -
Set Accelerator to
NVIDIA GPUs. -
Set Accelerator count to
1.
-
-
From the Connection type, select a relevant data connection from the drop-down list.
-
-
In the Additional serving runtime arguments field, specify the following recommended arguments:
--dtype=half --max-model-len=20000 --gpu-memory-utilization=0.95 --enable-chunked-prefill --enable-auto-tool-choice --tool-call-parser=llama3_json --chat-template=/opt/app-root/template/tool_chat_template_llama3.2_json.jinja-
Click Deploy.
NoteModel deployment can take several minutes, especially for the first model that is deployed on the cluster. Initial deployment may take more than 10 minutes while the relevant images download.
-
-
Verify that the
kserve-controller-managerandodh-model-controllerpods are running:-
Open a new terminal window.
-
Log in to your OpenShift Container Platform cluster from the CLI:
-
In the upper-right corner of the OpenShift web console, click your user name and select Copy login command.
-
After you have logged in, click Display token.
-
Copy the Log in with this token command and paste it in the OpenShift CLI (
oc).$ oc login --token=<token> --server=<openshift_cluster_url> -
Enter the following command to verify that the
kserve-controller-managerandodh-model-controllerpods are running:$ oc get pods -n opendatahub | grep -E 'kserve-controller-manager|odh-model-controller' -
Confirm that you see output similar to the following example:
kserve-controller-manager-7c865c9c9f-xyz12 1/1 Running 0 4m21s odh-model-controller-7b7d5fd9cc-wxy34 1/1 Running 0 3m55s -
If you do not see either of the
kserve-controller-managerandodh-model-controllerpods, there could be a problem with your deployment. In addition, if the pods appear in the list, but theirStatusis not set toRunning, check the pod logs for errors:$ oc logs <pod-name> -n opendatahub -
Check the status of the inference service:
$ oc get inferenceservice -n llamastack $ oc get pods -n <project name> | grep llama-
The deployment automatically creates the following resources:
-
A
ServingRuntimeresource. -
An
InferenceServiceresource, aDeployment, a pod, and a service pointing to the pod.
-
-
Verify that the server is running. For example:
$ oc logs llama-32-3b-instruct-predictor-77f6574f76-8nl4r -n <project name>Check for output similar to the following example log:
INFO 2025-05-15 11:23:52,750 __main__:498 server: Listening on ['::', '0.0.0.0']:8321 INFO: Started server process [1] INFO: Waiting for application startup. INFO 2025-05-15 11:23:52,765 __main__:151 server: Starting up INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) -
The deployed model displays in the Deployments tab on the project details page for the project it was deployed under.
-
-
-
If you see a
ConvertTritonGPUToLLVMerror in the pod logs when querying the/v1/chat/completionsAPI, and the vLLM server restarts or returns a500 Internal Servererror, apply the following workaround:Before deploying the model, remove the
--enable-chunked-prefillargument from the Additional serving runtime arguments field in the deployment dialog.The error is displayed similar to the following:
/opt/vllm/lib64/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py:36:0: error: Failures have been detected while processing an MLIR pass pipeline /opt/vllm/lib64/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py:36:0: note: Pipeline failed while executing [`ConvertTritonGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` INFO: 10.129.2.8:0 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
Testing your vLLM model endpoints
To verify that your deployed Llama 3.2 model is accessible externally, ensure that your vLLM model server is exposed as a network endpoint. You can then test access to the model from outside both the OpenShift Container Platform cluster and the Open Data Hub interface.
|
Important
|
If you selected Make deployed models available through an external route during deployment, your vLLM model endpoint is already accessible outside the cluster. You do not need to manually expose the model server. Manually exposing vLLM model endpoints, for example, by using |
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You have logged in to Open Data Hub.
-
You have activated the Llama Stack Operator in Open Data Hub.
-
You have deployed an inference model, for example, the llama-3.2-3b-instruct model.
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
Open a new terminal window.
-
Log in to your OpenShift Container Platform cluster from the CLI:
-
In the upper-right corner of the OpenShift web console, click your user name and select Copy login command.
-
After you have logged in, click Display token.
-
Copy the Log in with this token command and paste it in the OpenShift CLI (
oc).$ oc login --token=<token> --server=<openshift_cluster_url>
-
-
If you enabled Require token authentication during model deployment, retrieve your token:
$ export MODEL_TOKEN=$(oc get secret default-name-llama-32-3b-instruct-sa -n <project name> --template={{ .data.token }} | base64 -d) -
Obtain your model endpoint URL:
-
If you enabled Make deployed models available through an external route during model deployment, click Endpoint details on the Deployments page in the Open Data Hub dashboard to obtain your model endpoint URL.
-
In addition, if you did not enable Require token authentication during model deployment, you can also enter the following command to retrieve the endpoint URL:
$ export MODEL_ENDPOINT="https://$(oc get route llama-32-3b-instruct -n <project name> --template={{ .spec.host }})"
-
-
Test the endpoint with a sample chat completion request:
-
If you did not enable Require token authentication during model deployment, enter a chat completion request. For example:
$ curl -X POST $MODEL_ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-32-3b-instruct", "messages": [ { "role": "user", "content": "Hello" } ] }' -
If you enabled Require token authentication during model deployment, include a token in your request. For example:
curl -s -k $MODEL_ENDPOINT/v1/chat/completions \ --header "Authorization: Bearer $MODEL_TOKEN" \ --header 'Content-Type: application/json' \ -d '{ "model": "llama-32-3b-instruct", "messages": [ { "role": "user", "content": "can you tell me a funny joke?" } ] }' | jq .NoteThe
-kflag disables SSL verification and should only be used in test environments or with self-signed certificates.
-
Confirm that you received a JSON response containing a chat completion. For example:
{
"id": "chatcmpl-05d24b91b08a4b78b0e084d4cc91dd7e",
"object": "chat.completion",
"created": 1747279170,
"model": "llama-32-3b-instruct",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": null,
"content": "Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}],
"usage": {
"prompt_tokens": 37,
"total_tokens": 62,
"completion_tokens": 25,
"prompt_tokens_details": null
},
"prompt_logprobs": null
}
If you do not receive a response similar to the example, verify that the endpoint URL and token are correct, and ensure your model deployment is running.
Deploying a remote Milvus vector database
To use Milvus as a remote vector database provider for Llama Stack in Open Data Hub, you must deploy Milvus and its required etcd service in your OpenShift project. This procedure shows how to deploy Milvus in standalone mode without the Milvus Operator.
|
Note
|
The following example configuration is intended for testing or evaluation environments. For production-grade deployments, see https://milvus.io/docs in the Milvus documentation. |
-
You have installed OpenShift Container Platform 4.19 or newer.
-
You have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You are logged in to Open Data Hub.
-
You have a StorageClass available that can provision persistent volumes.
-
You created a root password to secure your Milvus service.
-
You have deployed an inference model with vLLM, for example, the llama-3.2-3b-instruct model, and you have selected Make deployed models available through an external route and Require token authentication during model deployment.
-
You have the correct inference model identifier, for example, llama-3-2-3b.
-
You have the model endpoint URL, ending with
/v1, such ashttps://llama-32-3b-instruct-predictor:8443/v1. -
You have the API token required to access the model endpoint.
-
You have installed the OpenShift command line interface (
oc) as described in Installing the OpenShift CLI.
-
In the OpenShift Container Platform console, click the Quick Create (
) icon and then click the Import YAML option. -
Verify that your project is the selected project.
-
In the Import YAML editor, paste the following manifest and click Create:
apiVersion: v1 kind: Secret metadata: name: milvus-secret type: Opaque stringData: root-password: "MyStr0ngP@ssw0rd" --- kind: PersistentVolumeClaim apiVersion: v1 metadata: name: milvus-pvc spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi volumeMode: Filesystem --- apiVersion: apps/v1 kind: Deployment metadata: name: etcd-deployment labels: app: etcd spec: replicas: 1 selector: matchLabels: app: etcd strategy: type: Recreate template: metadata: labels: app: etcd spec: containers: - name: etcd image: quay.io/coreos/etcd:v3.5.5 command: - etcd - --advertise-client-urls=http://127.0.0.1:2379 - --listen-client-urls=http://0.0.0.0:2379 - --data-dir=/etcd ports: - containerPort: 2379 volumeMounts: - name: etcd-data mountPath: /etcd env: - name: ETCD_AUTO_COMPACTION_MODE value: revision - name: ETCD_AUTO_COMPACTION_RETENTION value: "1000" - name: ETCD_QUOTA_BACKEND_BYTES value: "4294967296" - name: ETCD_SNAPSHOT_COUNT value: "50000" volumes: - name: etcd-data emptyDir: {} restartPolicy: Always --- apiVersion: v1 kind: Service metadata: name: etcd-service spec: ports: - port: 2379 targetPort: 2379 selector: app: etcd --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: milvus-standalone name: milvus-standalone spec: replicas: 1 selector: matchLabels: app: milvus-standalone strategy: type: Recreate template: metadata: labels: app: milvus-standalone spec: containers: - name: milvus-standalone image: milvusdb/milvus:v2.6.0 args: ["milvus", "run", "standalone"] env: - name: DEPLOY_MODE value: standalone - name: ETCD_ENDPOINTS value: etcd-service:2379 - name: COMMON_STORAGETYPE value: local - name: MILVUS_ROOT_PASSWORD valueFrom: secretKeyRef: name: milvus-secret key: root-password livenessProbe: exec: command: ["curl", "-f", "http://localhost:9091/healthz"] initialDelaySeconds: 90 periodSeconds: 30 timeoutSeconds: 20 failureThreshold: 5 ports: - containerPort: 19530 protocol: TCP - containerPort: 9091 protocol: TCP volumeMounts: - name: milvus-data mountPath: /var/lib/milvus restartPolicy: Always volumes: - name: milvus-data persistentVolumeClaim: claimName: milvus-pvc --- apiVersion: v1 kind: Service metadata: name: milvus-service spec: selector: app: milvus-standalone ports: - name: grpc port: 19530 targetPort: 19530 - name: http port: 9091 targetPort: 9091Note-
Use the gRPC port (
19530) for theMILVUS_ENDPOINTsetting in Llama Stack. -
The HTTP port (
9091) is reserved for health checks. -
If you deploy Milvus in a different namespace, use the fully qualified service name in your Llama Stack configuration. For example:
http://milvus-service.<namespace>.svc.cluster.local:19530
-
-
In the OpenShift Container Platform web console, click Workloads → Deployments.
-
Verify that both
etcd-deploymentandmilvus-standaloneshow a status of 1 of 1 pods available. -
Click Pods in the navigation panel and confirm that pods for both deployments are Running.
-
Click the
milvus-standalonepod name, then select the Logs tab. -
Verify that Milvus reports a healthy startup with output similar to:
Milvus Standalone is ready to serve ... Listening on 0.0.0.0:19530 (gRPC) -
Click Networking → Services and confirm that the
milvus-serviceandetcd-serviceresources exist and are exposed on ports19530and2379, respectively. -
(Optional) Click Pods → milvus-standalone → Terminal and run the following health check:
curl http://localhost:9091/healthzA response of
{"status": "healthy"}confirms that Milvus is running correctly.
Deploying a LlamaStackDistribution instance
You can deploy Llama Stack with retrieval-augmented generation (RAG) by pairing it with a vLLM-served Llama 3.2 model. This module provides the following deployment examples of the LlamaStackDistribution custom resource (CR):
-
Example A: Inline Milvus (embedded, single-node, remote embeddings)
-
Example B: Remote Milvus (external service, inline embeddings served with the sentence-transformers library)
-
Example C: Inline FAISS (embedded, single node, inline embeddings served with the sentence-transformers library)
-
Example D: Remote PostgreSQL with pgvector (external service, remote embeddings)
-
You have installed OpenShift Container Platform 4.19 or newer.
-
You have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You are logged in to Open Data Hub.
-
You have activated the Llama Stack Operator in Open Data Hub.
-
You have deployed an inference model with vLLM (for example, llama-3.2-3b-instruct) and selected Make deployed models available through an external route and Require token authentication during model deployment. In addition, in Add custom runtime arguments, you have added --enable-auto-tool-choice.
-
You have the correct inference model identifier, for example,
llama-3-2-3b. -
You have the model endpoint URL ending with
/v1, for example,https://llama-32-3b-instruct-predictor:8443/v1. -
You have the API token required to access the model endpoint.
-
You have installed the PostgreSQL Operator version 14 or later and configured a PostgreSQL database for Llama Stack metadata storage. For more information, see the documentation for "Deploying a Llama Stack server".
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
Open a new terminal window and log in to your OpenShift Container Platform cluster from the CLI:
In the upper-right corner of the OpenShift web console, click your user name and select Copy login command. After you have logged in, click Display token. Copy the Log in with this token command and paste it in the OpenShift CLI (
oc).$ oc login --token=<token> --server=<openshift_cluster_url> -
Create a secret that contains the inference model and the remote embeddings environment variables:
# Remote LLM export INFERENCE_MODEL="llama-3-2-3b" export VLLM_URL="https://llama-32-3b-instruct-predictor:8443/v1" export VLLM_TLS_VERIFY="false" # Use "true" in production export VLLM_API_TOKEN="<token identifier>" export VLLM_MAX_TOKENS=16384 # Remote embedding configuration export EMBEDDING_MODEL="nomic-embed-text-v1-5" export EMBEDDING_PROVIDER_MODEL_ID="nomic-embed-text-v1-5" export VLLM_EMBEDDING_URL="<embedding-endpoint>/v1" export VLLM_EMBEDDING_API_TOKEN="<embedding-token>" export VLLM_EMBEDDING_MAX_TOKENS=8192 export VLLM_EMBEDDING_TLS_VERIFY="true" oc create secret generic llama-stack-secret -n <project-name> \ --from-literal=INFERENCE_MODEL="$INFERENCE_MODEL" \ --from-literal=VLLM_URL="$VLLM_URL" \ --from-literal=VLLM_TLS_VERIFY="$VLLM_TLS_VERIFY" \ --from-literal=VLLM_API_TOKEN="$VLLM_API_TOKEN" \ --from-literal=VLLM_MAX_TOKENS="$VLLM_MAX_TOKENS" \ --from-literal=EMBEDDING_MODEL="$EMBEDDING_MODEL" \ --from-literal=EMBEDDING_PROVIDER_MODEL_ID="$EMBEDDING_PROVIDER_MODEL_ID" \ --from-literal=VLLM_EMBEDDING_URL="$VLLM_EMBEDDING_URL" \ --from-literal=VLLM_EMBEDDING_TLS_VERIFY="$VLLM_EMBEDDING_TLS_VERIFY" \ --from-literal=VLLM_EMBEDDING_API_TOKEN="$VLLM_EMBEDDING_API_TOKEN" \ --from-literal=VLLM_EMBEDDING_MAX_TOKENS="$VLLM_EMBEDDING_MAX_TOKENS" -
Choose one of the following deployment examples:
|
Important
|
To enable inline embeddings in a disconnected environment, add the following parameters to your
The built-in Llama Stack tool |
Example A: LlamaStackDistribution with Inline Milvus
Use this example for development or small datasets where an embedded, single-node Milvus is sufficient. This example uses remote embeddings.
-
In the OpenShift web console, select Administrator → Quick Create (
) → Import YAML, and create a CR similar to the following:apiVersion: llamastack.io/v1alpha1 kind: LlamaStackDistribution metadata: name: lsd-llama-milvus-inline spec: replicas: 1 server: containerSpec: resources: requests: cpu: "250m" memory: "500Mi" limits: cpu: 4 memory: "12Gi" env: # PostgreSQL metadata store (required in {productname-short} 3.2) - name: POSTGRES_HOST value: <postgres-host> - name: POSTGRES_PORT value: "5432" - name: POSTGRES_DB value: <postgres-database> - name: POSTGRES_USER value: <postgres-username> - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: <postgres-secret-name> key: <postgres-password-key> # Remote LLM configuration - name: INFERENCE_MODEL valueFrom: secretKeyRef: name: llama-stack-secret key: INFERENCE_MODEL - name: VLLM_URL valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_URL - name: VLLM_TLS_VERIFY valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_TLS_VERIFY - name: VLLM_API_TOKEN valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_API_TOKEN - name: VLLM_MAX_TOKENS valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_MAX_TOKENS # Remote embedding configuration - name: EMBEDDING_MODEL valueFrom: secretKeyRef: name: llama-stack-secret key: EMBEDDING_MODEL - name: EMBEDDING_PROVIDER_MODEL_ID valueFrom: secretKeyRef: name: llama-stack-secret key: EMBEDDING_PROVIDER_MODEL_ID - name: VLLM_EMBEDDING_URL valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_EMBEDDING_URL - name: VLLM_EMBEDDING_TLS_VERIFY valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_EMBEDDING_TLS_VERIFY - name: VLLM_EMBEDDING_API_TOKEN valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_EMBEDDING_API_TOKEN - name: VLLM_EMBEDDING_MAX_TOKENS valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_EMBEDDING_MAX_TOKENS - name: FMS_ORCHESTRATOR_URL value: "http://localhost" name: llama-stack port: 8321 distribution: name: upstream storage: size: 5GiNoteThe
upstreamvalue is an example distribution name. If your deployment uses a different distribution name, replaceupstreamwith the name that matches your Llama Stack Distribution image and configuration.
Example B: LlamaStackDistribution with Remote Milvus
Use this example for production-grade or large datasets with an external Milvus service. This example uses inline embeddings served with the sentence-transformers library.
-
Create the Milvus connection secret:
# Required: gRPC endpoint on port 19530 export MILVUS_ENDPOINT="tcp://milvus-service:19530" export MILVUS_TOKEN="<milvus-root-or-user-token>" export MILVUS_CONSISTENCY_LEVEL="Bounded" # Optional; choose per your deployment oc create secret generic milvus-secret \ --from-literal=MILVUS_ENDPOINT="$MILVUS_ENDPOINT" \ --from-literal=MILVUS_TOKEN="$MILVUS_TOKEN" \ --from-literal=MILVUS_CONSISTENCY_LEVEL="$MILVUS_CONSISTENCY_LEVEL"ImportantUse the gRPC port
19530forMILVUS_ENDPOINT. Ports such as9091are typically used for health checks and are not valid for client traffic. -
In the OpenShift web console, select Administrator → Quick Create (
) → Import YAML, and create a CR similar to the following:apiVersion: llamastack.io/v1alpha1 kind: LlamaStackDistribution metadata: name: lsd-llama-milvus-remote spec: replicas: 1 server: containerSpec: resources: requests: cpu: "250m" memory: "500Mi" limits: cpu: 4 memory: "12Gi" env: # PostgreSQL metadata store (required in {productname-short} 3.2) - name: POSTGRES_HOST value: <postgres-host> - name: POSTGRES_PORT value: "5432" - name: POSTGRES_DB value: <postgres-database> - name: POSTGRES_USER value: <postgres-username> - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: <postgres-secret-name> key: <postgres-password-key> # Inline embeddings (sentence-transformers) - name: ENABLE_SENTENCE_TRANSFORMERS value: "true" - name: EMBEDDING_PROVIDER value: "sentence-transformers" # Remote LLM configuration - name: INFERENCE_MODEL valueFrom: secretKeyRef: name: llama-stack-secret key: INFERENCE_MODEL - name: VLLM_MAX_TOKENS value: "4096" - name: VLLM_URL valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_URL - name: VLLM_TLS_VERIFY valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_TLS_VERIFY - name: VLLM_API_TOKEN valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_API_TOKEN # Remote Milvus configuration from secret - name: MILVUS_ENDPOINT valueFrom: secretKeyRef: name: milvus-secret key: MILVUS_ENDPOINT - name: MILVUS_TOKEN valueFrom: secretKeyRef: name: milvus-secret key: MILVUS_TOKEN - name: MILVUS_CONSISTENCY_LEVEL valueFrom: secretKeyRef: name: milvus-secret key: MILVUS_CONSISTENCY_LEVEL name: llama-stack port: 8321 distribution: name: upstreamNoteThe
upstreamvalue is an example distribution name. If your deployment uses a different distribution name, replaceupstreamwith the name that matches your Llama Stack Distribution image and configuration.
Example C: LlamaStackDistribution with Inline FAISS
Use this example to enable the inline FAISS vector store. This example uses inline embeddings served with the sentence-transformers library.
-
In the OpenShift web console, select Administrator → Quick Create (
) → Import YAML, and create a CR similar to the following:apiVersion: llamastack.io/v1alpha1 kind: LlamaStackDistribution metadata: name: lsd-llama-faiss-inline spec: replicas: 1 server: containerSpec: resources: requests: cpu: "250m" memory: "500Mi" limits: cpu: "8" memory: "12Gi" env: # PostgreSQL metadata store (required in {productname-short} 3.2) - name: POSTGRES_HOST value: <postgres-host> - name: POSTGRES_PORT value: "5432" - name: POSTGRES_DB value: <postgres-database> - name: POSTGRES_USER value: <postgres-username> - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: <postgres-secret-name> key: <postgres-password-key> # Inline embeddings (sentence-transformers) - name: ENABLE_SENTENCE_TRANSFORMERS value: "true" - name: EMBEDDING_PROVIDER value: "sentence-transformers" # Remote LLM configuration - name: INFERENCE_MODEL valueFrom: secretKeyRef: name: llama-stack-secret key: INFERENCE_MODEL - name: VLLM_URL valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_URL - name: VLLM_TLS_VERIFY valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_TLS_VERIFY - name: VLLM_API_TOKEN valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_API_TOKEN # Enable inline FAISS - name: ENABLE_FAISS value: "faiss" - name: FMS_ORCHESTRATOR_URL value: "http://localhost" name: llama-stack port: 8321 distribution: name: upstreamNoteThe
upstreamvalue is an example distribution name. If your deployment uses a different distribution name, replaceupstreamwith the name that matches your Llama Stack Distribution image and configuration.
Example D: LlamaStackDistribution with Remote PostgreSQL with pgvector
Use this example when you want to use a PostgreSQL database with the pgvector extension as the vector store backend. This configuration enables the pgvector provider and reads connection values from a secret. This example uses remote embeddings.
-
Create the pgvector connection secret:
export PGVECTOR_HOST="<pgvector-hostname>" export PGVECTOR_PORT="5432" export PGVECTOR_DB="<pgvector-database>" export PGVECTOR_USER="<pgvector-username>" export PGVECTOR_PASSWORD="<pgvector-password>" oc create secret generic pgvector-connection -n <project-name> \ --from-literal=PGVECTOR_HOST="$PGVECTOR_HOST" \ --from-literal=PGVECTOR_PORT="$PGVECTOR_PORT" \ --from-literal=PGVECTOR_DB="$PGVECTOR_DB" \ --from-literal=PGVECTOR_USER="$PGVECTOR_USER" \ --from-literal=PGVECTOR_PASSWORD="$PGVECTOR_PASSWORD" -
In the OpenShift web console, select Administrator → Quick Create (
) → Import YAML, and create a custom resource similar to the following:apiVersion: llamastack.io/v1alpha1 kind: LlamaStackDistribution metadata: name: lsd-llama-pgvector-remote spec: replicas: 1 server: containerSpec: resources: requests: cpu: "250m" memory: "500Mi" limits: cpu: 4 memory: "12Gi" env: # PostgreSQL metadata store (required in {productname-short} 3.2) - name: POSTGRES_HOST value: <postgres-host> - name: POSTGRES_PORT value: "5432" - name: POSTGRES_DB value: <postgres-database> - name: POSTGRES_USER value: <postgres-username> - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: <postgres-secret-name> key: <postgres-password-key> # Remote LLM configuration - name: INFERENCE_MODEL valueFrom: secretKeyRef: name: llama-stack-secret key: INFERENCE_MODEL - name: VLLM_URL valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_URL - name: VLLM_TLS_VERIFY valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_TLS_VERIFY - name: VLLM_API_TOKEN valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_API_TOKEN - name: VLLM_MAX_TOKENS valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_MAX_TOKENS # Remote embedding configuration - name: EMBEDDING_MODEL valueFrom: secretKeyRef: name: llama-stack-secret key: EMBEDDING_MODEL - name: EMBEDDING_PROVIDER_MODEL_ID valueFrom: secretKeyRef: name: llama-stack-secret key: EMBEDDING_PROVIDER_MODEL_ID - name: VLLM_EMBEDDING_URL valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_EMBEDDING_URL - name: VLLM_EMBEDDING_TLS_VERIFY valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_EMBEDDING_TLS_VERIFY - name: VLLM_EMBEDDING_API_TOKEN valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_EMBEDDING_API_TOKEN - name: VLLM_EMBEDDING_MAX_TOKENS valueFrom: secretKeyRef: name: llama-stack-secret key: VLLM_EMBEDDING_MAX_TOKENS # Enable and configure pgvector provider - name: ENABLE_PGVECTOR value: "true" - name: PGVECTOR_HOST valueFrom: secretKeyRef: name: pgvector-connection key: PGVECTOR_HOST - name: PGVECTOR_PORT valueFrom: secretKeyRef: name: pgvector-connection key: PGVECTOR_PORT - name: PGVECTOR_DB valueFrom: secretKeyRef: name: pgvector-connection key: PGVECTOR_DB - name: PGVECTOR_USER valueFrom: secretKeyRef: name: pgvector-connection key: PGVECTOR_USER - name: PGVECTOR_PASSWORD valueFrom: secretKeyRef: name: pgvector-connection key: PGVECTOR_PASSWORD - name: FMS_ORCHESTRATOR_URL value: "http://localhost" name: llama-stack port: 8321 distribution: name: upstreamNoteThe
upstreamvalue is an example distribution name. If your deployment uses a different distribution name, replaceupstreamwith the name that matches your Llama Stack Distribution image and configuration. -
Click Create.
-
In the left-hand navigation, click Workloads → Pods and verify that the Llama Stack pod is running in the correct namespace.
-
To verify that the Llama Stack server is running, click the pod name and select the Logs tab. Look for output similar to the following:
INFO 2025-05-15 11:23:52,750 __main__:498 server: Listening on ['::', '0.0.0.0']:8321 INFO: Started server process [1] INFO: Waiting for application startup. INFO 2025-05-15 11:23:52,765 __main__:151 server: Starting up INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
|
Tip
|
If you switch between vector store configurations, delete the existing pod to ensure the new environment variables and backing store are picked up cleanly. |
Ingesting content into a Llama model
You can quickly customize and prototype retrievable content by uploading a document and adding it to a vector store from inside a Jupyter notebook. This approach avoids building a separate ingestion pipeline. By using the Llama Stack SDK, you can ingest documents into a vector store and enable retrieval-augmented generation (RAG) workflows.
-
You have installed OpenShift Container Platform 4.19 or newer.
-
You have deployed a Llama 3.2 model with a vLLM model server.
-
You have created a
LlamaStackDistributioninstance. -
You have configured a PostgreSQL database for Llama Stack metadata storage.
-
You have configured an embedding model:
-
Recommended: You have configured a remote embedding model by using environment variables in the
LlamaStackDistribution. -
Optional: You have enabled inline embeddings with the sentence-transformers library for development or testing.
-
-
You have created a workbench within a project.
-
You have opened a Jupyter notebook and it is running in your workbench environment.
-
You have installed
llama_stack_clientversion 0.3.1 or later in your workbench environment. -
You have installed
requestsin your workbench environment. This is required for downloading example documents.
-
In a new notebook cell, install the client:
%pip install llama_stack_client -
Install the
requestslibrary if it is not already available:%pip install requests -
Import
LlamaStackClientand create a client instance:from llama_stack_client import LlamaStackClient # Use the Llama Stack service or route URL that is reachable from the workbench. # Do not append /v1 when using llama_stack_client. client = LlamaStackClient(base_url="<llama-stack-base-url>") -
List the available models:
models = client.models.list() -
Verify that the list includes:
-
At least one LLM model.
-
At least one embedding model (remote or inline).
[Model(identifier='llama-32-3b-instruct', model_type='llm', provider_id='vllm-inference'), Model(identifier='nomic-embed-text-v1-5', model_type='embedding', metadata={'embedding_dimension': 768})]
-
-
Select one LLM and one embedding model:
model_id = next(m.identifier for m in models if m.model_type == "llm") embedding_model = next(m for m in models if m.model_type == "embedding") embedding_model_id = embedding_model.identifier embedding_dimension = int(embedding_model.metadata["embedding_dimension"]) -
(Optional) Create a vector store. Skip this step if you already have one.
NoteProvider IDs can differ between interfaces. In the Python SDK, you typically use the provider name directly (for example,
provider_id: "pgvector"). In some CLI tools and examples, remote providers might use a prefixed identifier (for example,--vector-db-provider-id remote-pgvector). Use the provider ID format that matches the interface you are using.
vector_store = client.vector_stores.create(
name="my_inline_milvus",
extra_body={
"embedding_model": embedding_model_id,
"embedding_dimension": embedding_dimension,
"provider_id": "milvus",
},
)
vector_store_id = vector_store.id
|
Note
|
Inline Milvus is suitable for development and small datasets. In Open Data Hub 3.2 and later, metadata persistence uses PostgreSQL by default. |
vector_store = client.vector_stores.create(
name="my_remote_milvus",
extra_body={
"embedding_model": embedding_model_id,
"embedding_dimension": embedding_dimension,
"provider_id": "milvus-remote",
},
)
vector_store_id = vector_store.id
|
Note
|
Ensure your LlamaStackDistribution is configured with MILVUS_ENDPOINT and MILVUS_TOKEN.
|
vector_store = client.vector_stores.create(
name="my_inline_faiss",
extra_body={
"embedding_model": embedding_model_id,
"embedding_dimension": embedding_dimension,
"provider_id": "faiss",
},
)
vector_store_id = vector_store.id
|
Note
|
Inline FAISS is an in-process vector store intended for development and testing. In Open Data Hub 3.2 and later, FAISS uses PostgreSQL as the default metadata store. |
vector_store = client.vector_stores.create(
name="my_pgvector_store",
extra_body={
"embedding_model": embedding_model_id,
"embedding_dimension": embedding_dimension,
"provider_id": "pgvector",
},
)
vector_store_id = vector_store.id
|
Note
|
Ensure that the pgvector provider is enabled in your LlamaStackDistribution and that the PostgreSQL instance has the pgvector extension installed.
|
-
If you already have a vector store, set its identifier:
# vector_store_id = "<existing-vector-store-id>" -
Download a PDF, upload it to Llama Stack, and add it to your vector store:
import requests pdf_url = "https://www.federalreserve.gov/aboutthefed/files/quarterly-report-20250822.pdf" filename = "quarterly-report-20250822.pdf" response = requests.get(pdf_url) response.raise_for_status() with open(filename, "wb") as f: f.write(response.content) with open(filename, "rb") as f: file_info = client.files.create( file=(filename, f), purpose="assistants", ) vector_store_file = client.vector_stores.files.create( vector_store_id=vector_store_id, file_id=file_info.id, chunking_strategy={ "type": "static", "static": { "max_chunk_size_tokens": 800, "chunk_overlap_tokens": 400, }, }, ) print(vector_store_file)
-
The call to
client.vector_stores.files.create()succeeds and returns metadata for the ingested file. -
The vector store contains indexed chunks associated with the uploaded document.
-
Subsequent RAG queries can retrieve content from the vector store.
Querying ingested content in a Llama model
You can use the Llama Stack SDK in your Jupyter notebook to query ingested content by running retrieval-augmented generation (RAG) queries on content stored in your vector store. You can perform one-off lookups without setting up a separate retrieval service.
-
You have installed OpenShift Container Platform 4.19 or newer.
-
You have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
-
If you are using GPU acceleration, you have at least one NVIDIA GPU available.
-
You have activated the Llama Stack Operator in Open Data Hub.
-
You have deployed an inference model, for example, the llama-3.2-3b-instruct model.
-
You have created a
LlamaStackDistributioninstance with:-
PostgreSQL configured as the metadata store.
-
An embedding model configured, preferably as a remote embedding provider.
-
-
You have created a workbench within a project and opened a running Jupyter notebook.
-
You have installed
llama_stack_clientversion 0.3.1 or later in your workbench environment. -
You have already ingested content into a vector store.
|
Note
|
This procedure requires that content has already been ingested into a vector store. If no content is available, RAG queries return empty or non-contextual responses. |
-
In a new notebook cell, install the client:
%pip install -q llama_stack_client -
Import
LlamaStackClient:from llama_stack_client import LlamaStackClient -
Create a client instance:
# Use the Llama Stack service or route URL that is reachable from the workbench. # Do not append /v1 when using llama_stack_client. client = LlamaStackClient(base_url="<llama-stack-base-url>") -
List available models:
models = client.models.list() -
Select an LLM. If you plan to register a new vector store, also capture an embedding model:
model_id = next(m.identifier for m in models if m.model_type == "llm") embedding = next((m for m in models if m.model_type == "embedding"), None) if embedding: embedding_model_id = embedding.identifier embedding_dimension = int(embedding.metadata.get("embedding_dimension", 768)) -
If you do not already have a vector store ID, register a vector store (choose one):
Example 5. Option 1: Inline Milvus (embedded)vector_store = client.vector_stores.create( name="my_inline_milvus", extra_body={ "embedding_model": embedding_model_id, "embedding_dimension": embedding_dimension, "provider_id": "milvus", }, ) vector_store_id = vector_store.idNoteInline Milvus is suitable for development and small datasets. In Open Data Hub 3.2 and later, metadata persistence uses PostgreSQL by default.
vector_store = client.vector_stores.create(
name="my_remote_milvus",
extra_body={
"embedding_model": embedding_model_id,
"embedding_dimension": embedding_dimension,
"provider_id": "milvus-remote",
},
)
vector_store_id = vector_store.id
|
Note
|
Ensure your LlamaStackDistribution sets MILVUS_ENDPOINT (gRPC port 19530) and MILVUS_TOKEN.
|
vector_store = client.vector_stores.create(
name="my_inline_faiss",
extra_body={
"embedding_model": embedding_model_id,
"embedding_dimension": embedding_dimension,
"provider_id": "faiss",
},
)
vector_store_id = vector_store.id
|
Note
|
Inline FAISS is an in-process vector store intended for development and testing. In Open Data Hub 3.2 and later, FAISS uses PostgreSQL as the default metadata store. |
vector_store = client.vector_stores.create(
name="my_pgvector_store",
extra_body={
"embedding_model": embedding_model_id,
"embedding_dimension": embedding_dimension,
"provider_id": "pgvector",
},
)
vector_store_id = vector_store.id
|
Note
|
Ensure the pgvector provider is enabled in your LlamaStackDistribution and that the PostgreSQL instance has the pgvector extension installed. This option is suitable for production-grade RAG workloads that require durability and concurrency.
|
-
If you already have a vector store, set its identifier:
# vector_store_id = "<existing-vector-store-id>" -
Query without using a vector store:
system_instructions = """You are a precise and reliable AI assistant. Use retrieved context when it is available. If nothing relevant is found, say so clearly.""" query = "How do you do great work?" response = client.responses.create( model=model_id, input=query, instructions=system_instructions, ) print(response.output_text) -
Query by using the Responses API with file search:
response = client.responses.create( model=model_id, input=query, instructions=system_instructions, tools=[ { "type": "file_search", "vector_store_ids": [vector_store_id], } ], ) print(response.output_text)
|
Note
|
When you include the |
-
The notebook returns a response without vector stores and a context-aware response when vector stores are enabled.
-
No errors appear, confirming successful retrieval and model execution.
Preparing documents with Docling for Llama Stack retrieval
You can transform your source documents with a Docling-enabled pipeline and ingest the output into a Llama Stack vector store by using the Llama Stack SDK. This modular approach separates document preparation from ingestion while still enabling an end-to-end, retrieval-augmented generation (RAG) workflow.
The pipeline registers a vector store and downloads the source PDFs, then splits them for parallel processing and converts each batch to Markdown with Docling. It generates embeddings from the Markdown and stores them in the vector store, making the documents searchable through Llama Stack.
-
You have installed OpenShift Container Platform 4.19 or newer.
-
You have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
-
You have logged in to the OpenShift Container Platform web console.
-
You have a project and access to pipelines in the Open Data Hub dashboard.
-
You have created and configured a pipeline server within the project that contains your workbench.
-
You have activated the Llama Stack Operator in Open Data Hub.
-
You have deployed an inference model, for example, the llama-3.2-3b-instruct model.
-
You have configured a Llama Stack deployment by creating a
LlamaStackDistributioninstance to enable RAG functionality. -
You have created a workbench within a project.
-
You have opened a Jupyter notebook and it is running in your workbench environment.
-
You have installed local object storage buckets and created connections, as described in Adding a connection to your project.
-
You have installed the
llama_stack_clientversion 0.3.1 or later in your workbench environment. -
You have compiled to YAML a pipeline that includes a Docling transform, either one of the RAG demo samples or your own custom pipeline.
-
Your project quota allows between 500 millicores (0.5 CPU) and 4 CPU cores for the pipeline run.
-
Your project quota allows from 2 GiB up to 6 GiB of RAM for the pipeline run.
-
If you are using GPU acceleration, you have at least one NVIDIA GPU available.
-
In a new notebook cell, install the client:
%pip install -q llama_stack_client -
In a new notebook cell, import
LlamaStackClient:from llama_stack_client import LlamaStackClient -
In a new notebook cell, assign your deployment endpoint to the
base_urlparameter to create aLlamaStackClientinstance:client = LlamaStackClient(base_url="http://<llama-stack-service>:8321")NoteLlamaStackClientrequires the service root without the/v1path suffix. For example, usehttp://llama-stack-service:8321.The
/v1suffix is required only when you use OpenAI-compatible SDKs or send raw HTTP requests to the OpenAI-compatible API surface. -
List the available models:
models = client.models.list() -
Select the first LLM and the first embedding model:
model_id = next(m.identifier for m in models if m.model_type == "llm") embedding_model = next(m for m in models if m.model_type == "embedding") embedding_model_id = embedding_model.identifier embedding_dimension = int(embedding_model.metadata.get("embedding_dimension", 768)) -
Register a vector store (choose one option). Skip this step if your pipeline registers the store automatically.
vector_store_name = "my_inline_db"
vector_store = client.vector_stores.create(
name=vector_store_name,
extra_body={
"embedding_model": embedding_model_id,
"embedding_dimension": embedding_dimension,
"provider_id": "milvus", # inline Milvus Lite
},
)
vector_store_id = vector_store.id
print(f"Registered inline Milvus Lite DB: {vector_store_id}")
|
Note
|
Inline Milvus Lite is best for development. Data durability and scale are limited compared to remote Milvus. |
vector_store_name = "my_remote_db"
vector_store = client.vector_stores.create(
name=vector_store_name,
extra_body={
"embedding_model": embedding_model_id,
"embedding_dimension": embedding_dimension,
"provider_id": "milvus-remote", # remote Milvus provider
},
)
vector_store_id = vector_store.id
print(f"Registered remote Milvus DB: {vector_store_id}")
|
Note
|
Ensure your LlamaStackDistribution includes MILVUS_ENDPOINT and MILVUS_TOKEN (gRPC :19530).
|
vector_store_name = "my_faiss_db"
vector_store = client.vector_stores.create(
name=vector_store_name,
extra_body={
"embedding_model": embedding_model_id,
"embedding_dimension": embedding_dimension,
"provider_id": "faiss", # inline FAISS provider
},
)
vector_store_id = vector_store.id
print(f"Registered inline FAISS DB: {vector_store_id}")
|
Note
|
Inline FAISS (available in Open Data Hub 3.0 and later) is a lightweight, in-process vector store. It is best for local experimentation, disconnected environments, or single-node RAG deployments. |
|
Important
|
If you are using the sample Docling pipeline from the RAG demo repository, the pipeline registers the vector store automatically and you can skip the previous step. If you are using your own pipeline, you must register the vector store yourself. |
-
In the OpenShift Container Platform web console, import your YAML file containing your Docling pipeline into your project, as described in Importing a pipeline version.
-
Create a pipeline run to execute your Docling pipeline, as described in Executing a pipeline run. The pipeline run inserts your PDF documents into the vector store. If you run the Docling pipeline from the RAG demo samples repository, you can optionally customize the following parameters before starting the pipeline run:
-
base_url: The base URL to fetch PDF files from. -
pdf_filenames: A comma-separated list of PDF filenames to download and convert. -
num_workers: The number of parallel workers. -
vector_store_id: The vector store identifier. -
service_url: The Milvus service URL (only for remote Milvus). -
embed_model_id: The embedding model to use. -
max_tokens: The maximum tokens for each chunk. -
use_gpu: Enable or disable GPU acceleration.
-
-
In your Jupyter notebook, query the LLM with a question that relates to the ingested content:
system_instructions = """You are a precise and reliable AI assistant. Use retrieved context when it is available. If nothing relevant is found in the available files, say so clearly.""" prompt = "What can you tell me about the birth of word processing?" # Query using the Responses API with file search response = client.responses.create( model=model_id, input=prompt, instructions=system_instructions, tools=[ { "type": "file_search", "vector_store_ids": [vector_store_id], } ], ) print("Answer (with vector stores):") print(response.output_text) -
Query chunks from the vector store:
query_result = client.vector_io.query( vector_store_id=vector_store_id, query="word processing", ) print(query_result)-
The pipeline run completes successfully in your project.
-
Document embeddings are stored in the vector store and are available for retrieval.
-
No errors or warnings appear in the pipeline logs or your notebook output.
-
About Llama stack search types
Llama Stack supports keyword, vector, and hybrid search modes for retrieving context in retrieval-augmented generation (RAG) workloads. Each mode offers different tradeoffs in precision, recall, semantic depth, and computational cost.
Supported search modes
Keyword search
Keyword search applies lexical matching techniques, such as TF-IDF or BM25, to locate documents that contain exact or near-exact query terms. This approach is effective when precise term-matching is required, such as searching for identifiers, names, or regulatory terms.
query_result = client.vector_io.query(
vector_store_id=vector_store_id,
query="FRBNY",
params={
"mode": "keyword",
"max_chunks": 3,
"score_threshold": 0.7,
},
)
print(query_result)
For more information about keyword-based retrieval, see The Probabilistic Relevance Framework: BM25 and Beyond.
Vector search
Vector search encodes documents and queries as dense numerical vectors, known as embeddings, and measures similarity using metrics such as cosine similarity or inner product. This approach captures semantic meaning and supports contextual matching beyond exact word overlap.
query_result = client.vector_io.query(
vector_store_id=vector_store_id,
query="FRBNY",
params={
"mode": "vector",
"max_chunks": 3,
"score_threshold": 0.7,
},
)
print(query_result)
For more information, see Billion-scale similarity search with GPUs.
Hybrid search
Hybrid search combines keyword and vector-based retrieval techniques, typically by blending lexical and semantic relevance scores. This approach aims to balance exact term matching with semantic similarity.
query_result = client.vector_io.query(
vector_store_id=vector_store_id,
query="FRBNY",
params={
"mode": "hybrid",
"max_chunks": 3,
"score_threshold": 0.7,
},
)
print(query_result)
For more information, see Sparse, Dense, and Hybrid Retrieval for Answer Ranking.
|
Note
|
Search mode availability depends on the selected vector store provider and its configured capabilities. Not all providers support every search mode. For example, some providers might support vector search only, while keyword or hybrid search might be unavailable or return empty results. Always verify supported search modes for your chosen vector store provider. |
Evaluating RAG systems with Llama Stack
You can use the evaluation providers that Llama Stack exposes to measure and improve the quality of your Retrieval-Augmented Generation (RAG) workloads in Open Data Hub. This section introduces RAG evaluation providers, describes how to use Ragas with Llama Stack, shows how to benchmark embedding models with BEIR, and helps you choose the right provider for your use case.
Understanding RAG evaluation providers
Llama Stack supports pluggable evaluation providers that measure the quality and performance of Retrieval-Augmented Generation (RAG) pipelines. Evaluation providers assess how accurately, faithfully, and relevantly the generated responses align with the retrieved context and the original user query. Each provider implements its own metrics and evaluation methodology. You can enable a specific provider through the configuration of the LlamaStackDistribution custom resource.
Open Data Hub supports the following evaluation providers:
-
Ragas: A lightweight, Python-based framework that evaluates factuality, contextual grounding, and response relevance.
-
BEIR: A benchmarking framework for retrieval performance across multiple datasets.
-
TrustyAI: A Red Hat framework that evaluates explainability, fairness, and reliability of model outputs.
Evaluation providers operate independently of model serving and retrieval components. You can run evaluations asynchronously and aggregate results for quality tracking over time.
Using Ragas with Llama Stack
You can use the Ragas (Retrieval-Augmented Generation Assessment) evaluation provider with Llama Stack to measure the quality of your Retrieval-Augmented Generation (RAG) workflows in Open Data Hub. Ragas integrates with the Llama Stack evaluation API to compute metrics such as faithfulness, answer relevancy, and context precision for your RAG workloads.
Llama Stack exposes evaluation providers as part of its API surface. When you configure Ragas as a provider, the Llama Stack server sends RAG inputs and outputs to Ragas and records the resulting metrics for later analysis.
Ragas evaluation with Llama Stack in Open Data Hub supports the following deployment modes:
-
Inline provider for development and small-scale experiments.
-
Remote provider for production-scale evaluations that run as Open Data Hub AI pipelines.
You choose the mode that best fits your workflow:
-
Use the inline provider when you want fast, low-overhead evaluation while you iterate on prompts, retrieval configuration, or model choices.
-
Use the remote provider when you need to evaluate large datasets, integrate with CI/CD pipelines, or run repeated benchmarks at scale.
For information on evaluating RAG systems with Ragas in Open Data Hub, see Evaluating RAG systems with Llama Stack.
Benchmarking embedding models with BEIR datasets and Llama Stack
This procedure explains how to set up, run, and verify embedding model benchmarks by using the Llama Stack framework. Embedding models are neural networks that convert text or other data into dense numerical vectors called embeddings, which capture semantic meaning. In retrieval augmented generation systems, embeddings enable semantic search so that the system retrieves the documents most relevant to a query.
Selecting an embedding model depends on several factors, such as the content type, accuracy requirements, performance needs, and model license. The beir_benchmarks.py script compares the retrieval accuracy of embedding models by using standardized information retrieval benchmarks from the BEIR framework. The script is included in the RAG repository, which provides demonstrations, benchmarking scripts, and deployment guides for the RAG stack on OpenShift Container Platform.
The examples use the sentence-transformers inference provider, which you can replace with another provider if required.
-
You have cloned the
https://github.com/opendatahub-io/ragrepository. -
You have changed into the
/rag/benchmarks/beir-benchmarksdirectory. -
You have initialized and activated a virtual environment.
-
You have defined and installed the relevant script package dependencies to a
requirements.txtfile. -
You have built the Llama Stack starter distribution to install all dependencies.
-
You have verified that your vector database is accessible and configured in the
run.yamlfile, and that any required embedding models were preloaded or registered with Llama Stack.
|
Important
|
Before you run the benchmark script, the Llama Stack server must be running and a vector database provider must be enabled and reachable. If you plan to compare embedding models beyond the default set, you must also register those embedding models with Llama Stack. |
-
Optional: Start the Llama Stack server and enable a vector database provider If you have not already started Llama Stack with a vector database provider enabled, start the server by using a configuration similar to one of the following examples:
-
Using inline Milvus:
MILVUS_URL=milvus uv run llama stack run run.yaml
-
Using remote PostgreSQL with pgvector:
ENABLE_PGVECTOR=true PGVECTOR_DB=pgvector PGVECTOR_USER=<user> PGVECTOR_PASSWORD=<password> uv run llama stack run run.yaml
-
-
Optional: Register additional embedding models The default supported embedding models are
granite-embedding-30mandgranite-embedding-125m, served by thesentence-transformersframework. If you want to benchmark additional embedding models, register them with Llama Stack before running the benchmark script.For example, register an embedding model by using the Llama Stack client:
llama-stack-client models register all-MiniLM-L6-v2 \ --provider-id sentence-transformers \ --provider-model-id all-minilm:latest \ --metadata {"embedding_dimension": 384} \ --model-type embeddingNoteAny embedding models specified in the
--embedding-modelsoption must be registered before running the benchmark script. -
Run the
beir_benchmarks.pybenchmarking script.-
Enter the following command to use the configuration from
run.yamland the default datasetscifactwith inline Milvus:MILVUS_URL=milvus uv run python beir_benchmarks.py
-
Enter the following command to run the benchmark by using remote PostgreSQL with pgvector:
ENABLE_PGVECTOR=true PGVECTOR_DB=pgvector uv run python beir_benchmarks.py \ --vector-db-provider-id pgvector
-
Alternatively, enter the following command to connect to a custom Llama Stack server:
LLAMA_STACK_URL="http://localhost:8321" MILVUS_URL=milvus uv run python beir_benchmarks.py
-
-
Use environment variables and command line options to modify the benchmark run. For example, set the environment variable for the vector database provider before executing the script.
-
Enter the following command to use a larger batch size for document ingestion:
MILVUS_URL=milvus uv run python beir_benchmarks.py --batch-size 300
-
Enter the following command to benchmark multiple datasets, for example,
scifactandscidocs:MILVUS_URL=milvus uv run python beir_benchmarks.py \ --dataset-names scifact scidocs
-
Enter the following command to compare embedding models, for example,
granite-embedding-30mandall-MiniLM-L6-v2:MILVUS_URL=milvus uv run python beir_benchmarks.py \ --embedding-models granite-embedding-30m all-MiniLM-L6-v2
NoteEnsure that
all-MiniLM-L6-v2is registered with Llama Stack before running this command. See step 2 for registration instructions. -
Enter the following command to use a custom BEIR compatible dataset:
MILVUS_URL=milvus uv run python beir_benchmarks.py \ --dataset-names my-dataset \ --custom-datasets-urls https://example.com/my-beir-dataset.zip
-
Enter the following command to change the vector database provider:
# Use remote PostgreSQL with pgvector ENABLE_PGVECTOR=true PGVECTOR_DB=llama-stack PGVECTOR_USER=<user> PGVECTOR_PASSWORD=<password> uv run python beir_benchmarks.py \ --vector-db-provider-id pgvector
-
For information on command line options for benchmarking embedding models with BEIR datasets, see Benchmarking embedding models with BEIR datasets and Llama Stack command line options.
To verify that the benchmark completed successfully and to review the results, perform the following steps:
-
Locate the
resultsdirectory. All output files are saved to the following path:<path-to>/rag/benchmarks/beir-benchmarks/results -
Examine the output. Compare your results with the sample output structure. The report includes performance metrics such as
map@cut_kandndcg@cut_kfor each dataset and embedding model pair. The script also calculates a statistical significance test called a p value.Example output for
scifactandmap_cut_10:scifact map_cut_10 granite-embedding-125m : 0.6879 granite-embedding-30m : 0.6578 p_value : 0.0150 p_value < 0.05 indicates a statistically significant difference. The granite-embedding-125m model performs better for this dataset and metric.
-
Interpret the results. A p value below
0.05indicates that the performance difference between models is statistically significant. The model with the higher metric value performs better. Use these results to identify which embedding model performs best for your dataset.
BEIR benchmarking command-line options
The BEIR benchmarking script accepts the following command-line options:
-
--vector-db-provider-id-
Description: Specifies the vector database provider to use. The provider must also be enabled through the appropriate environment variable.
-
Type: String.
-
Default:
milvus. -
Example values:
milvus,pgvector,faiss. -
Example:
--vector-db-provider-id pgvector
-
-
--dataset-names-
Description: Specifies which BEIR datasets to use for benchmarking. Use this option together with
--custom-datasets-urlswhen testing custom datasets. -
Type: List of strings.
-
Default:
["scifact"]. -
Example:
--dataset-names scifact scidocs nq
-
-
--embedding-models-
Description: Specifies the embedding models to compare. Models must be defined in the
run.yamlfile. -
Type: List of strings.
-
Default:
["granite-embedding-30m", "granite-embedding-125m"]. -
Example:
--embedding-models all-MiniLM-L6-v2 granite-embedding-125m
-
-
--batch-size-
Description: Controls how many documents are processed per batch during ingestion. Larger batch sizes improve speed but use more memory.
-
Type: Integer.
-
Default:
150. -
Example:
--batch-size 50 --batch-size 300
-
-
--custom-datasets-urls-
Description: Specifies URLs for custom BEIR compatible datasets. Use this option with
--dataset-names. -
Type: List of strings.
-
Default:
[]. -
Example:
--dataset-names my-custom-dataset \ --custom-datasets-urls https://example.com/my-dataset.zip
-
|
Note
|
Custom BEIR datasets must follow the required file structure and format: dataset-name.zip/ ├── qrels/ │ └── test.tsv ├── corpus.jsonl └── queries.jsonl |
For information on benchmarking embedding models with BEIR datasets, see Benchmarking embedding models with BEIR datasets and Llama Stack.
Using PostgreSQL in Llama Stack
PostgreSQL is a dependency for Llama Stack deployments in Open Data Hub, where it serves as the mandatory metadata storage backend for supported vector storage configurations. Additionally, you can configure PostgreSQL as a remote vector database provider by enabling the pgvector extension.
In Open Data Hub, PostgreSQL serves the following roles in Llama Stack deployments:
-
Required metadata storage for Llama Stack APIs and orchestration services.
-
An optional remote vector database when the pgvector provider is enabled.
Depending on your deployment requirements, these roles can be fulfilled by the same PostgreSQL instance or separate instances. For example, you might use a single instance for development and testing environments, and separate instances for production deployments that require independent scaling or isolation.
|
Important
|
The procedures provide basic configuration suitable for development and testing. Production deployments require additional planning, including the following considerations:
|
Understanding PostgreSQL in Llama Stack
Understanding Llama Stack metadata storage
In Open Data Hub, Llama Stack requires PostgreSQL as a metadata storage backend to persist state and configuration data across multiple components. Metadata storage provides durable persistence for vector stores, file management, agent state, conversation history, and other Llama Stack services.
PostgreSQL is required as a metadata storage backend for all Open Data Hub deployments.
Role of metadata storage in Llama Stack
Llama Stack components require persistent storage beyond in-memory data structures. Without metadata storage, component state would be lost on pod restarts or application failures.
Llama Stack uses metadata storage to persist:
-
Vector store metadata, such as collection identifiers and document mappings.
-
File metadata, including file locations, identifiers, and attributes.
-
Agent state and conversation history.
-
Dataset configurations and batch processing state.
-
Model registry information and prompt templates.
This persistent storage allows Llama Stack to maintain operational state across pod restarts, rescheduling, and application updates.
PostgreSQL metadata storage backends
Llama Stack uses PostgreSQL to store multiple categories of metadata, including vector store metadata, file records, agent state, conversation history, and configuration data. These data types have different storage characteristics but are managed automatically within a single PostgreSQL instance.
|
Important
|
Starting with Open Data Hub 3.2, PostgreSQL version 14 or later is required for all Llama Stack deployments, including development, testing, and production environments. If validation errors occur, confirm that the deployed Llama Stack image version matches the configuration schema referenced by your |
Llama Stack does not provision or manage the PostgreSQL instance used for metadata storage. You must deploy and manage the PostgreSQL database and supply its connection details when deploying Llama Stack.
Overview of pgvector vector databases
pgvector is an open source PostgreSQL extension that enables vector similarity search on embedding data stored in relational tables. In Open Data Hub, PostgreSQL with the pgvector extension is supported as a remote vector database provider for the Llama Stack Operator. pgvector supports retrieval augmented generation workflows that require persistent vector storage while integrating with existing PostgreSQL environments.
pgvector vector databases provide the following capabilities in Open Data Hub:
-
Storage of vector embeddings in PostgreSQL tables.
-
Similarity search across embeddings by using pgvector distance metrics.
-
Persistent storage of vectors alongside structured relational data.
-
Integration with existing PostgreSQL security and operational tooling.
In a typical retrieval augmented generation workflow in Open Data Hub, your application uses the following components:
-
Inference provider Generates embeddings and model responses.
-
Vector store provider Stores embeddings and performs similarity search. When you use pgvector, PostgreSQL provides this capability as a remote vector store.
-
File storage provider Stores the source files that are ingested into vector stores.
-
Llama Stack server Provides a unified API surface, including an OpenAI compatible Vector Stores API.
When you ingest content, Llama Stack splits source material into chunks, generates embeddings, and stores them in PostgreSQL through the pgvector extension. When you query a vector store, Llama Stack performs similarity search and returns the most relevant chunks for use in prompts.
In Open Data Hub, pgvector is used in the following operational mode:
-
Remote PostgreSQL with pgvector, which runs as a standalone PostgreSQL database service accessed by the Llama Stack server. This mode is suitable for development and production workloads that require persistent storage and integration with existing PostgreSQL infrastructure.
When you deploy PostgreSQL with the pgvector extension, you typically manage the following components:
-
Secrets for PostgreSQL connection credentials.
-
Persistent storage for durable database data.
-
A PostgreSQL service that exposes a network endpoint.
PostgreSQL with pgvector does not require an external coordination service. Vector data, indexes, and metadata are stored directly in PostgreSQL tables and managed through standard database mechanisms.
Use PostgreSQL with pgvector when you require persistent vector storage and want to integrate vector search into existing PostgreSQL based data platforms within Open Data Hub. For instructions on deploying PostgreSQL with the pgvector extension, see Deploying a PostgreSQL instance with pgvector.
Deploying and Configuring PostgreSQL
Deploying a PostgreSQL instance with pgvector
You can connect Llama Stack in Open Data Hub to an existing PostgreSQL instance that has the pgvector extension enabled. For development or evaluation, you can also deploy a PostgreSQL instance with the pgvector extension directly in your OpenShift Container Platform project by creating Kubernetes resources through the OpenShift web console. This procedure focuses on deploying PostgreSQL with the pgvector extension for use as a remote vector store. It does not cover preparing a PostgreSQL database for use as Llama Stack metadata storage.
-
You have installed OpenShift Container Platform 4.19 or newer.
-
You have permissions to create resources in a project in your OpenShift Container Platform cluster.
-
You have PostgreSQL connection details available, including the database name, user name, and password.
-
If you plan to deploy PostgreSQL in-cluster, you have a StorageClass that can provision persistent volumes.
-
If you are using an existing PostgreSQL instance, the pgvector extension is installed and enabled on the target database.
-
Log in to the OpenShift Container Platform web console.
-
Select the project where you want to deploy the PostgreSQL instance.
-
Click the Quick Create (
) icon, and then click Import YAML. -
Verify that the correct project is selected.
-
Copy the following YAML, replace the placeholder values, paste it into the YAML editor, and then click Create.
ImportantThis example deploys a standalone PostgreSQL service with the pgvector extension enabled.
Llama Stack does not automatically use this database. To use this PostgreSQL instance as a vector store, you must explicitly configure the pgvector provider in a
LlamaStackDistribution.This example is intended for development or evaluation purposes. For production deployments, review and adapt the configuration to meet your organization’s security, availability, backup, and lifecycle requirements.
Example PostgreSQL deployment with pgvector (development or evaluation)apiVersion: v1 kind: Secret metadata: name: <pgvector-postgresql-credentials-secret> type: Opaque stringData: POSTGRES_DB: "<database-name>" POSTGRES_USER: "<database-username>" POSTGRES_PASSWORD: "<database-password>" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: <pgvector-postgresql-pvc> spec: accessModes: - ReadWriteOnce resources: requests: storage: <storage-size> --- apiVersion: apps/v1 kind: Deployment metadata: name: <pgvector-postgresql-deployment> spec: replicas: 1 selector: matchLabels: app: <pgvector-postgresql-app-label> template: metadata: labels: app: <pgvector-postgresql-app-label> spec: containers: - name: postgres image: pgvector/pgvector:pg16 ports: - name: postgres containerPort: 5432 env: - name: POSTGRES_DB valueFrom: secretKeyRef: name: <pgvector-postgresql-credentials-secret> key: POSTGRES_DB - name: POSTGRES_USER valueFrom: secretKeyRef: name: <pgvector-postgresql-credentials-secret> key: POSTGRES_USER - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: <pgvector-postgresql-credentials-secret> key: POSTGRES_PASSWORD volumeMounts: - name: pgdata mountPath: /var/lib/postgresql/data # Replace TCP socket probes with exec probes that validate SQL readiness. readinessProbe: exec: command: - /bin/sh - -c - pg_isready -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 6 livenessProbe: exec: command: - /bin/sh - -c - pg_isready -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" initialDelaySeconds: 30 periodSeconds: 20 timeoutSeconds: 5 failureThreshold: 6 # Create the pgvector extension after PostgreSQL is actually accepting SQL. lifecycle: postStart: exec: command: - /bin/sh - -c - | set -e echo "Waiting for PostgreSQL to be ready before enabling pgvector..." until PGPASSWORD="$POSTGRES_PASSWORD" psql -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT 1" >/dev/null 2>&1; do sleep 2 done PGPASSWORD="$POSTGRES_PASSWORD" psql -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "CREATE EXTENSION IF NOT EXISTS vector;" volumes: - name: pgdata persistentVolumeClaim: claimName: <pgvector-postgresql-pvc> --- apiVersion: v1 kind: Service metadata: name: <pgvector-postgresql-service> spec: selector: app: <pgvector-postgresql-app-label> ports: - name: postgres port: 5432 targetPort: 5432 type: ClusterIP -
Click Create.
-
Navigate to Networking → Services.
-
Confirm that the PostgreSQL Service is listed and exposes port
5432. -
Navigate to Workloads → Pods.
-
Confirm that the PostgreSQL pod is running.
|
Note
|
This procedure verifies only that PostgreSQL with pgvector is deployed and reachable within the project. It does not verify integration with Llama Stack. |
Configuring the pgvector remote provider in Llama Stack
To use PostgreSQL with the pgvector extension as a remote vector store, configure pgvector in your existing LlamaStackDistribution and provide PostgreSQL connection details as environment variables. Ensure that your LlamaStackDistribution already includes the PostgreSQL metadata storage configuration. This setup enables retrieval augmented generation (RAG) workflows in Open Data Hub by using PostgreSQL-based vector storage.
-
You have installed and enabled the Llama Stack Operator in Open Data Hub.
-
You have a PostgreSQL database with the pgvector extension enabled. Llama Stack uses PostgreSQL for two purposes: metadata storage and the optional pgvector remote vector store. You can use a single PostgreSQL instance for both roles or deploy separate instances.
-
You have the PostgreSQL connection details, including the host name, port number, database name, user name, and password.
-
You have permissions to create Secrets and edit custom resources in your project.
-
In the OpenShift web console, switch to the Administrator perspective.
-
Create a Secret that stores the PostgreSQL connection details.
-
Ensure that the correct project is selected.
-
Click Workloads → Secrets.
-
Click Create → From YAML.
-
Paste the following YAML, update the placeholder values, and then click Create.
Example Secret for pgvector connection detailsapiVersion: v1 kind: Secret metadata: name: pgvector-connection type: Opaque stringData: PGVECTOR_HOST: "<pgvector-hostname>" PGVECTOR_PORT: "<pgvector-port>" PGVECTOR_DB: "<database-name>" PGVECTOR_USER: "<database-username>" PGVECTOR_PASSWORD: "<database-password>"ImportantThe pgvector provider is not enabled automatically.
You must explicitly enable pgvector and supply its connection details through environment variables in your
LlamaStackDistribution.In Open Data Hub, the pgvector provider is enabled when the
ENABLE_PGVECTORenvironment variable is set.
-
-
Update your
LlamaStackDistributioncustom resource to enable pgvector and reference the Secret.-
Click Ecosystem → Installed Operators.
-
Select the Llama Stack Operator.
-
Click the LlamaStackDistribution tab.
-
Select your
LlamaStackDistributionresource. -
Click YAML.
-
Update the resource to include the following fields, and then click Save.
Before you enable pgvector, deploy a Llama Stack server and configure the PostgreSQL metadata store.
-
For more information, see Deploying a Llama Stack server.
Then update your existing LlamaStackDistribution to add the pgvector configuration shown in the following example. The example shows only the additional environment variables required to enable the pgvector provider.
LlamaStackDistribution configuration for pgvectorapiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: llamastack
spec:
server:
containerSpec:
env:
- name: ENABLE_PGVECTOR
value: "true"
- name: PGVECTOR_HOST
valueFrom:
secretKeyRef:
name: pgvector-connection
key: PGVECTOR_HOST
- name: PGVECTOR_PORT
valueFrom:
secretKeyRef:
name: pgvector-connection
key: PGVECTOR_PORT
- name: PGVECTOR_DB
valueFrom:
secretKeyRef:
name: pgvector-connection
key: PGVECTOR_DB
- name: PGVECTOR_USER
valueFrom:
secretKeyRef:
name: pgvector-connection
key: PGVECTOR_USER
- name: PGVECTOR_PASSWORD
valueFrom:
secretKeyRef:
name: pgvector-connection
key: PGVECTOR_PASSWORD
-
Click Workloads → Pods.
-
Confirm that the Llama Stack pod restarts and reaches the Running state.
-
Open the pod logs and confirm that the server starts successfully and initializes the pgvector provider without errors.
Using Qdrant in Llama Stack
Qdrant is a supported remote vector store provider for Llama Stack in Open Data Hub. You can deploy Qdrant in your OpenShift Container Platform project or connect to an existing Qdrant instance, and configure Llama Stack to use Qdrant for retrieval-augmented generation (RAG) workloads.
To use Qdrant with Llama Stack, complete the following tasks:
-
Review how Qdrant integrates with Llama Stack.
-
Deploy a Qdrant instance or connect to an existing deployment.
-
Configure your
LlamaStackDistributionto use Qdrant as the vector store provider. -
Perform vector operations through the OpenAI-compatible Vector Stores API.
Overview of Qdrant vector databases
Qdrant is an open source vector database optimized for high-performance similarity search and advanced filtering. In Open Data Hub, Qdrant is supported as a remote vector store provider for Llama Stack and can be used in retrieval-augmented generation (RAG) workloads that require efficient vector indexing and durable storage.
When used with Llama Stack in Open Data Hub, Qdrant provides:
-
High-performance similarity search using Hierarchical Navigable Small World (HNSW) indexing
-
Filtering based on stored metadata during vector search
-
Persistent storage of vector data
-
Integration through the OpenAI-compatible Vector Stores API
In a RAG workflow:
-
Embeddings are generated by the configured embedding provider.
-
Qdrant stores embedding vectors and performs similarity search.
-
Llama Stack manages ingestion, retrieval, and model inference through a unified API.
In Open Data Hub, you must deploy Qdrant as a remote service, either within your OpenShift Container Platform project or as an externally managed deployment.
|
Note
|
Inline Qdrant is not supported. To use Qdrant with Llama Stack in Open Data Hub, deploy Qdrant as a remote service. |
A typical remote deployment includes:
-
A Qdrant service exposing HTTP (port 6333) and gRPC (port 6334) endpoints
-
Persistent storage for vector data
-
Optional API key authentication
For deployment and configuration instructions, see Using Qdrant in Llama Stack.
Deploying a Qdrant vector database
You can connect Llama Stack in Open Data Hub to an existing Qdrant instance or deploy a Qdrant vector database in your OpenShift Container Platform project. For development or evaluation purposes, you can deploy Qdrant by creating Kubernetes resources in the OpenShift web console.
-
You have installed OpenShift Container Platform 4.19 or later.
-
You have permission to create resources in a project.
-
A StorageClass is available that can provision a PersistentVolume for the PersistentVolumeClaim used by this deployment.
NoteThis example uses a single PersistentVolumeClaim. If your cluster uses dynamic provisioning, the StorageClass provisions the required PersistentVolume automatically.
-
Optional: You have an API key for Qdrant authentication. If your Qdrant instance does not require authentication, remove the Secret and the
QDRANT__SERVICE__API_KEYenvironment variable from the deployment example.
-
Log in to the OpenShift Container Platform web console.
-
From the Project list, select the project where you want to deploy Qdrant.
-
Click Import YAML.
-
Paste the following YAML:
ImportantThis example deploys a standalone Qdrant service for development or evaluation. For production deployments, review and adapt the configuration to meet your organization’s security, availability, backup, and lifecycle requirements.
apiVersion: v1 kind: Secret metadata: name: <qdrant_credentials_secret> type: Opaque stringData: QDRANT_API_KEY: "<api_key>" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: <qdrant_pvc> spec: accessModes: - ReadWriteOnce resources: requests: storage: <storage_size> --- apiVersion: apps/v1 kind: Deployment metadata: name: <qdrant_deployment> spec: replicas: 1 selector: matchLabels: app: <qdrant_app_label> template: metadata: labels: app: <qdrant_app_label> spec: containers: - name: qdrant image: qdrant/qdrant:v1.12.0 ports: - name: http containerPort: 6333 - name: grpc containerPort: 6334 env: - name: QDRANT__SERVICE__API_KEY valueFrom: secretKeyRef: name: <qdrant_credentials_secret> key: QDRANT_API_KEY volumeMounts: - name: qdrant-storage mountPath: /qdrant/storage - name: qdrant-storage mountPath: /qdrant/snapshots subPath: snapshots readinessProbe: httpGet: path: /readyz port: 6333 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: path: /healthz port: 6333 initialDelaySeconds: 10 periodSeconds: 20 volumes: - name: qdrant-storage persistentVolumeClaim: claimName: <qdrant_pvc> --- apiVersion: v1 kind: Service metadata: name: <qdrant_service> spec: selector: app: <qdrant_app_label> ports: - name: http port: 6333 targetPort: 6333 - name: grpc port: 6334 targetPort: 6334 type: ClusterIPNoteIf your Qdrant instance does not require authentication, remove the Secret and the
QDRANT__SERVICE__API_KEYenvironment variable from the Deployment configuration. -
Replace the placeholder values as follows:
-
<qdrant_credentials_secret>: A name for the Secret that stores the Qdrant API key, for exampleqdrant-credentials. -
<api_key>: An API key for authenticating with Qdrant. If authentication is not required, remove the Secret and theQDRANT__SERVICE__API_KEYenvironment variable from the Deployment. -
<qdrant_pvc>: A name for the PersistentVolumeClaim, for exampleqdrant-pvc. -
<storage_size>: The storage capacity to request, for example10Gi. -
<qdrant_deployment>: A name for the Deployment, for exampleqdrant. -
<qdrant_app_label>: A label for the application, for exampleqdrant. -
<qdrant_service>: A name for the Service, for exampleqdrant-service.
-
-
Click Create.
-
The Qdrant Service is present in the project and exposes ports
6333(HTTP) and6334(gRPC). You can confirm this on the Networking → Services page in the OpenShift Container Platform web console. -
The Qdrant pod reaches the Running state. You can confirm this on the Workloads → Pods page in the OpenShift Container Platform web console.
|
Note
|
This verification confirms only that Qdrant is deployed and reachable within the project. To use this Qdrant instance with Llama Stack, configure the Qdrant provider in a |
Configuring the Qdrant remote provider in Llama Stack
To use Qdrant as a remote vector store, configure your LlamaStackDistribution resource with the connection details for your Qdrant service. This configuration enables Llama Stack to store and retrieve embedding vectors using Qdrant in Open Data Hub.
-
You have installed and enabled the Llama Stack Operator in Open Data Hub.
-
You have a running Qdrant instance that is accessible from your OpenShift Container Platform cluster.
-
You have the Qdrant connection details, including the service URL and, if required, an API key.
-
You have permission to create Secrets and modify custom resources in your project.
-
In the OpenShift web console, switch to the Administrator perspective.
-
Create a Secret that stores the Qdrant connection details used by Llama Stack. This Secret must contain the URL of the Qdrant service and, if required, the API key.
NoteIf you deployed Qdrant by using the procedure in Deploying a Qdrant vector database, create this Secret separately for the Llama Stack configuration. The Secret created during the Qdrant deployment does not contain the
QDRANT_URLvalue required by the Llama Stack provider.-
From the Project list, select the project where the
LlamaStackDistributionresource is deployed. -
Click Workloads → Secrets.
-
Click Create → From YAML.
-
Paste the following YAML:
apiVersion: v1 kind: Secret metadata: name: qdrant-connection type: Opaque stringData: QDRANT_URL: "<qdrant_url>" QDRANT_API_KEY: "<api_key>" -
Replace the placeholder values as follows:
-
<qdrant_url>: The full URL to the Qdrant service, for examplehttp://qdrant-service:6333. For in-cluster deployments, use the Service name and port. For external deployments, use the external URL. -
<api_key>: The API key for authenticating with Qdrant. If authentication is not enabled for your Qdrant instance, remove theQDRANT_API_KEYentry from both the Secret and theenvsection in theLlamaStackDistributionconfiguration.
-
-
Click Create.
-
-
Update your
LlamaStackDistributioncustom resource to reference the Secret and supply the required environment variables.-
Click Operators → Installed Operators.
-
Select the Llama Stack Operator.
-
Click the LlamaStackDistribution tab.
-
Select your
LlamaStackDistributionresource. -
Click YAML.
-
Update the resource to include the following fields.
NoteThe environment variable names and configuration fields used by the Qdrant provider can vary depending on the Llama Stack version included with Open Data Hub. Before applying this configuration, verify that the variables and fields match the supported versions listed in Supported Configurations for 3.x.
apiVersion: llamastack.io/v1alpha1 kind: LlamaStackDistribution metadata: name: llamastack spec: server: containerSpec: env: - name: ENABLE_QDRANT value: "true" - name: QDRANT_URL valueFrom: secretKeyRef: name: qdrant-connection key: QDRANT_URL - name: QDRANT_API_KEY valueFrom: secretKeyRef: name: qdrant-connection key: QDRANT_API_KEY -
Click Save.
-
-
The Llama Stack pod reaches the Running state. You can confirm this on the Workloads → Pods page in the OpenShift Container Platform web console.
-
The pod logs show that the Qdrant provider initializes successfully and does not report connection errors.
-
Vector operations executed through the Llama Stack API complete successfully, confirming that Llama Stack can communicate with Qdrant.
For information about performing vector operations, see:
Performing vector operations with Qdrant
After configuring Qdrant as the vector store provider in Llama Stack, you can perform vector operations by using the OpenAI-compatible Vector Stores API exposed by Llama Stack. These operations include creating vector stores, adding documents, performing similarity search, and deleting vector stores. You interact with the Llama Stack API rather than connecting directly to Qdrant. Llama Stack manages collection creation, embedding generation, and query execution on your behalf.
-
You have installed and enabled the Llama Stack Operator in Open Data Hub.
-
You have configured Qdrant as the vector store provider in your
LlamaStackDistribution. -
You have an embedding model available through a configured inference provider.
-
You have network access to the Llama Stack API endpoint.
-
You have installed the
jqcommand-line utility.For installation instructions, see jq.
-
You have the
curlcommand-line tool installed.
-
Determine how you will access the Llama Stack API.
You can access the API from within the cluster or from outside the cluster.
-
In-cluster access: Run the
curlcommands from a pod in the same project, or from a workstation that has network access to the Llama Stack Service. -
External access: Expose the Llama Stack Service by creating a Route, and then use the Route URL from your local workstation.
For this procedure, set
LLAMA_STACK_URLto the service or route root URL without the/v1suffix. The example commands append/v1as part of the endpoint path.For more information about API compatibility and base URL requirements, see
Example base URL for in-cluster accessLLAMA_STACK_URL="http://llamastack-service:8321"Example base URL for external access through a RouteLLAMA_STACK_URL="https://llamastack-route.example.com"
-
-
Create a vector store and capture its ID.
CREATE_RESPONSE=$(curl -s -X POST "${LLAMA_STACK_URL}/v1/vector_stores" \ -H "Content-Type: application/json" \ -d '{ "name": "my-rag-store", "embedding_model": "sentence-transformers/ibm-granite/granite-embedding-125m-english", "embedding_dimension": 768, "provider_id": "qdrant-remote" }') VECTOR_STORE_ID=$(echo "$CREATE_RESPONSE" | jq -r '.id') echo "Vector store ID: ${VECTOR_STORE_ID}"Ensure that the
VECTOR_STORE_IDvariable contains a valid value before continuing.
Add files to a vector store
Upload files to the vector store for ingestion. Llama Stack automatically splits the content into chunks, generates embeddings, and stores them in Qdrant.
FILE_RESPONSE=$(curl -s -X POST "${LLAMA_STACK_URL}/v1/vector_stores/${VECTOR_STORE_ID}/files" \
-F "file=@/path/to/document.pdf" \
-F "purpose=assistants")
FILE_ID=$(echo "$FILE_RESPONSE" | jq -r '.id')
echo "File ID: ${FILE_ID}"
Query a vector store
Perform similarity search to retrieve relevant content from the vector store. The search query is converted into an embedding and compared with stored vectors in Qdrant.
curl -X POST "${LLAMA_STACK_URL}/v1/vector_stores/${VECTOR_STORE_ID}/search" \
-H "Content-Type: application/json" \
-d '{
"query": "What is retrieval-augmented generation?",
"max_results": 5
}'
Delete a vector store
Delete a vector store when it is no longer required. This removes the vector store and its associated data from Qdrant.
curl -X DELETE "${LLAMA_STACK_URL}/v1/vector_stores/${VECTOR_STORE_ID}"
-
Creating a vector store returns a valid vector store ID.
-
File uploads complete successfully and are accepted by the API.
-
Search queries return results from the ingested content.
Configuring Llama Stack with OAuth Authentication
You can configure Llama Stack to enable Role-Based Access Control (RBAC) for model access using OAuth authentication on Open Data Hub. The following example shows how to configure Llama Stack so that a vLLM model can be accessed by all authenticated users, while an OpenAI model is restricted to specific users. This example uses Keycloak to issue and validate tokens.
This document assumes the Keycloak server is available at https://my-keycloak-server.com
|
Important
|
When accessing Llama Stack APIs, the required base URL depends on the client you are using.
Using an incorrect base URL results in request failures. |
-
You have installed OpenShift Container Platform 4.19 or newer.
-
You have logged in to Open Data Hub.
-
You have cluster administrator privileges for your OpenShift cluster.
-
You have a Keycloak instance configured with the following settings:
-
Realm:
llamastack-demo -
Client:
llamastackwith direct access grants enabled -
Role:
inference_max -
A protocol mapper that adds realm roles to the access token under the
llamastack_rolesclaim -
Two test users:
-
user1with no assigned roles -
user2assigned theinference_maxrole
-
-
-
You have saved the Keycloak client secret for token requests.
-
Your Keycloak server is reachable at
https://my-keycloak-server.com. -
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
To configure Llama Stack to use Role-Based Access Control (RBAC) to access models, you need to view and verify the OAuth provider token structure.
-
Generate a Keycloak test token to view the structure with the following command:
$ curl -d client_id=llamastack -d client_secret=YOUR_CLIENT_SECRET -d username=user1 -d password=user-password -d grant_type=password https://my-keycloak-server.com/realms/llamastack-demo/protocol/openid-connect/token | jq -r .access_token > test.token -
View the token claims by running the following command:
$ cat test.token | cut -d . -f 2 | base64 -d 2>/dev/null | jq .Example token structure from Keycloak$ { "iss": "http://my-keycloak-server.com/realms/llamastack-demo", "aud": "account", "sub": "761cdc99-80e5-4506-9b9e-26a67a8566f7", "preferred_username": "user1", "llamastack_roles": [ "inference_max" ], }
-
-
Update your existing
run.yamlfile by adding the OAuth parameter.
run.yaml file+
server:
port: 8321
auth:
provider_config:
type: "oauth2_token"
jwks:
uri: "https://my-keycloak-server.com/realms/llamastack-demo/protocol/openid-connect/certs" (1)
key_recheck_period: 3600
issuer: "https://my-keycloak-server.com/realms/llamastack-demo" (1)
audience: "account"
verify_tls: true
claims_mapping:
llamastack_roles: "roles" (2)
access_policy:
- permit: (3)
actions: [read]
resource: model::vllm-inference/llama-3-2-3b
description: Allow all authenticated users to access Llama 3.2 model
- permit: (4)
actions: [read]
resource: model::openai/gpt-4o-mini
when: user with inference_max in roles
description: Allow only users with inference_max role to access OpenAI models
+
<1> Specify your Keycloak host and Realm in the URL.
<2> Maps the llamastack_roles path from the token to the roles field.
<3> Policy 1: Allow all authenticated users to access vLLM models.
<4> Policy 2: Restrict OpenAI models to users with the inference_max role.
-
Create a ConfigMap that uses the updated
run.yamlconfiguration by running the following command:$ oc create configmap llamastack-custom-config --from-file=run.yaml=run.yaml -n redhat-ods-operator -
Create a
llamastack-distribution.yamlfile with the following parameters:apiVersion: llamastack.io/v1alpha1 kind: LlamaStackDistribution metadata: name: llamastack-distribution namespace: redhat-ods-operator spec: replicas: 1 server: distribution: name: rh-dev containerSpec: name: llama-stack port: 8321 env: # vLLM Provider Configuration - name: VLLM_URL value: "https://your-vllm-service:8000/v1" - name: VLLM_API_TOKEN value: "your-vllm-token" - name: VLLM_TLS_VERIFY value: "false" # OpenAI Provider Configuration - name: OPENAI_API_KEY value: "your-openai-api-key" - name: OPENAI_BASE_URL value: "https://api.openai.com/v1" userConfig: configMapName: llamastack-custom-config configMapNamespace: redhat-ods-operator -
To apply the distribution, run the following command:
$ oc apply -f llamastack-distribution.yaml -
Wait for the distribution to be ready by running the following command:
oc wait --for=jsonpath='{.status.phase}'=Ready llamastackdistribution/llamastack-distribution -n redhat-ods-operator --timeout=300s -
Generate the OAuth tokens for each user account to authenticate API requests.
-
To request a basic access token, and to add the token to a
user1.tokenfile, run the following command:$ curl -d client_id=llamastack \ -d client_secret=YOUR_CLIENT_SECRET \ -d username=user1 \ -d password=user1-password \ -d grant_type=password \ https://my-keycloak-server.com/realms/llamastack-demo/protocol/openid-connect/token \ | jq -r .access_token > user1.token -
To request full access token and add it to a
user2.tokenfile, run the following command:$ curl -d client_id=llamastack \ -d client_secret=YOUR_CLIENT_SECRET \ -d username=user2 \ -d password=user2-password \ -d grant_type=password \ https://my-keycloak-server.com/realms/llamastack-demo/protocol/openid-connect/token \ | jq -r .access_token > user2.token -
Verify the credentials by running the following command:
$ cat user2.token | cut -d . -f 2 | base64 -d 2>/dev/null | jq .
-
-
Set the Llama Stack base URL:
export LLAMASTACK_URL="http://<llama-stack-host>:8321" -
Verify basic access for
user1(no privileged roles).Load the token:
USER1_TOKEN=$(cat user1.token)Confirm that
user1can access the vLLM-served model:curl -s -o /dev/null -w "%{http_code}\n" \ -X POST "${LLAMASTACK_URL}/v1/openai/v1/chat/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${USER1_TOKEN}" \ -d '{"model":"vllm-inference/llama-3-2-3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'Expected result: HTTP
200.Confirm that
user1is denied access to the restricted OpenAI model:curl -s -o /dev/null -w "%{http_code}\n" \ -X POST "${LLAMASTACK_URL}/v1/openai/v1/chat/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${USER1_TOKEN}" \ -d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'Expected result: HTTP
403(forbidden). -
Verify privileged access for
user2(assigned theinference_maxrole).Load the token:
USER2_TOKEN=$(cat user2.token)Confirm that
user2can access both models:curl -s -o /dev/null -w "%{http_code}\n" \ -X POST "${LLAMASTACK_URL}/v1/openai/v1/chat/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${USER2_TOKEN}" \ -d '{"model":"vllm-inference/llama-3-2-3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'curl -s -o /dev/null -w "%{http_code}\n" \ -X POST "${LLAMASTACK_URL}/v1/openai/v1/chat/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${USER2_TOKEN}" \ -d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'Expected result: HTTP
200for both requests. -
Verify that requests without a Bearer token are denied.
curl -s -o /dev/null -w "%{http_code}\n" \ -X POST "${LLAMASTACK_URL}/v1/openai/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{"model":"vllm-inference/llama-3-2-3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'Expected result: HTTP
401(unauthorized).
Enabling high availability and autoscaling on Llama Stack
Llama Stack servers can be configured to remain operational in the event of a single point of failure.
If a pod restarts, an application crashes, or node maintenance occurs, you can maintain availability by enabling PostgreSQL high-availability settings in your Llama Stack server.
You can also enable autoscaling settings to adjust server capacity and automatic resource adjustment.
The following documentation displays how to configure high availability and autoscaling in your LlamaStackDistribution custom resource.
-
You have installed OpenShift Container Platform 4.19 or newer.
-
You have logged in to Open Data Hub.
-
You have cluster administrator privileges for your OpenShift cluster.
-
You have installed the PostgreSQL Operator version 14 or later.
-
You have activated the Llama Stack Operator in your cluster.
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
To enable high availability for your Llama Stack server, add the following parameters to your
LlamaStackDistributionCR:spec: replicas: 2 (1) server: podDisruptionBudget: minAvailable: 1 (2) topologySpreadConstraints: (3) - maxSkew: 1 (4) topologyKey: topology.kubernetes.io/zone (5) whenUnsatisfiable: ScheduleAnyway (6) labelSelector: matchLabels: app.kubernetes.io/instance: llamastackdistribution-sample (7)-
This example runs two llama stack pods for high availability.
-
Specifies voluntary disruption tolerance for the pods. For example, in a voluntary disruption, this configuration keeps at least one server pod available.
-
Specifies how to spread matching pods in the topology.
-
Instructs the scheduler to minimize replica imbalance across zones. With a skew of one and two replicas, the scheduler targets one Pod per zone when multiple zones are available
-
Configures and uses the node’s zone label as the failure-domain for pod spreading.
-
Configures and allows scheduling to proceed even when spread constraints cannot be met. For example, if the cluster has insufficient capacity, Pods are scheduled instead of remaining
Pending. -
Ensures that only pods from the same application instance are considered when calculating spread
-
-
To enable autoscaling for your Llama Stack server, add the following parameters to your
LlamaStackDistributionCR:spec: server: autoscaling: (1) minReplicas: 1 (2) maxReplicas: 5 (3) targetCPUUtilizationPercentage: 75 (4) targetMemoryUtilizationPercentage: 70 (5)-
Configures HorizontalPodAutoscaler (HPA) for the server pods.
-
Specifies the lower bound replica count maintained by the HPA.
-
Specifies the upper bound replica count maintained by the HPA.
-
Configures CPU based scaling.
-
Configures memory based scaling.
-