~/Blog

Brandon Rozek

Photo of Brandon Rozek

PhD Student @ RPI, Writer of Tidbits, and Linux Enthusiast

Setting up Ollama with CUDA on Podman Quadlets

Published on

4 minute reading time

Open WebUI provides a nice chat interface for interacting with LLMs over Ollama and OpenAI compatible APIs. Using Ollama, we can self-host many different LLMs that are open-sourced! This post documents the steps that I took in order to get Ollama working with CUDA using my Podman setup. However given how fast Machine Learning projects iterate, I wouldn’t be surprised if these exact steps no longer work. In that case, I’ll provide links to the official documentation which hopefully can help.

I’ll assume that you have the NVIDIA driver installed on your machine. The steps vary by OS/distribution and how modern of a driver you want, but I generally recommend to stick with what’s packaged in your distribution’s repository. This is to minimize headaches…

With that, our first step is to install the nvidia-container-toolkit. This package contains a collection of libraries and scripts to help us run our GPU inside a container.

sudo dnf install nvidia-container-toolkit

As of this time of writing, instructions for installing the toolkit can be found on NVIDIA’s website.

We can use this toolkit to generate a Common Device Interface (CDI) file which Podman will use to talk to the GPU.

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Note: Every time you update your NVIDIA driver, you’ll have to run this command.

NVIDIA also documents the steps for configuring CDI on their website.

From here, we should make sure that the NVIDIA toolkit found the appropriate GPU(s) and has set up their CDI.

nvidia-ctk cdi list

I only have one GPU on my machine, so it outputs something like the following:

INFO[0000] Found 3 CDI devices                          
nvidia.com/gpu=0
nvidia.com/gpu=GPU-52785a8a-f8ca-99b9-0312-01a1f59e789b
nvidia.com/gpu=all

If you want your container to be able to access all the GPUs, we can use the nvidia.com/gpu=all device interface. Otherwise, we can use a specific one.

Then, we restart Podman so that the CDI files are loaded.

sudo systemctl restart podman

For our first test, we’ll make sure that the container can appropriately access the GPU by running the nvidia-smi command.

sudo podman run --rm \
	--device nvidia.com/gpu=all \
	docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 \
	nvidia-smi

For my GPU it outputs:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.04             Driver Version: 570.124.04     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:02:00.0  On |                  N/A |
|  0%   50C    P8             19W /  170W |    1546MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Now we are ready to set up Ollama! To save time when running our systemd commands, let’s pull the image ahead of time.

sudo podman pull docker.io/ollama/ollama

We’ll have to save the models somewhere, so in this example we’ll save them to /opt/ollama.

sudo mkdir /opt/ollama

Let’s configure the Quadlet. Save the following to /etc/containers/systemd/ollama.container:

[Container]
ContainerName=ollama
HostName=ollama
Image=docker.io/ollama/ollama
AutoUpdate=registry
Volume=/opt/ollama:/root/.ollama
PublishPort=11434:11434
AddDevice=nvidia.com/gpu=all

[Unit]

[Service]
Restart=always

[Install]
WantedBy=default.target

This file specifies the flags that we pass to the podman command:

  • Publish the port 11434: This is the port we’ll use when sending messages to Ollama from Open WebUI. Of course you’re welcome to use other networking tricks to pull that off.
  • Mount the folder /opt/ollama on the filesystem to /root/.ollama within the container: We don’t want to have to re-download the LLM models each time!

For the moment of truth, let’s start it!

sudo systemctl start ollama

I won’t show in this post how to configure Open WebUI, but we can make sure that everything is working by looking at the Ollama container itself.

sudo podman exec -it ollama /bin/bash

We’ll perform a test with a smaller model (1.2 GB):

ollama run llama3.2:1b

Depending on your Internet connection, this will take a couple minutes to download and load onto the GPU.

When it’s done the prompt will be replaced with:

>>> 

From here you can chat with the LLM!


Have any questions or want to chat: Reply via Email

Enjoyed this post?

Published a response to this? :