Skip to content

Latest commit

 

History

History
63 lines (42 loc) · 2.52 KB

README.md

File metadata and controls

63 lines (42 loc) · 2.52 KB

Ollama Deployment

Example deployment of the Ollama server for OpenShift.

This deployment uses a UBI9 image defined in Containerfile and accessible at quay.io/rh-aiservices-bu/ollama-ubi9 (check for the latest version available). The image is compiled for the avx2 instruction set (so all Intel and AMD processors post-2016). However this image does not support GPU acceleration to reduce its size. It is meant to be used as a simple LLM server for tests when you don't have access to GPUs.

A notebook example using Langchain is available here.

Installation

The default installation deploys the Mistral-7B-Instruct-v0.2 model, of course in its quantized Ollama version. See Advanced installation for instructions on how to change the model as well as various settings.

Automated Deployment:

  • Use the OpenShift GitOps/ArgoCD Application definition at gitops/vllm-app.yaml

Manual Deployment (from the gitops folder):

  • Create the PVC named using the file pvc.yaml to hold the models cache.
  • Create the Deployment using the file deployment.yaml.
  • Create the Service using file service.yaml.
  • If you want to expose the server outside of your OpenShift cluster, create the Route with the file route.yaml

The API is now accessible at the endpoints:

  • defined by your Service, accessible internally on port 11434 using http. E.g. http://ollama.your-project.svc.cluster.local:11434/
  • defined by your Route, accessible externally through https, e.g. https://vllm.your-project.your-cluster/

Usage

You can directly query the model using either the /api/generate endpoint:

curl http://service:11434/api/generate \
      -H "Content-Type: application/json" \
      -d '{
        "model": "mistral",
        "prompt":"Why is the sky blue?"
      }'

or from Python:

pip install ollama
import ollama

client = ollama.Client(host='http://service:11434')

stream = client.chat(
  model='mistral',
  messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
  stream=True,
)

for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)

You can also find a notebook example using Langchain to query Ollama in this repo here.