GPU guide

Ollama GPU acceleration on Windows — CUDA & DirectML guide

Ollama automatically uses your GPU for inference on Windows but requires the right drivers. This guide covers NVIDIA CUDA, AMD DirectML, how to verify GPU is active, and how to fix it when it is not.

NVIDIA CUDAAMD DirectML5-10x faster

Check if GPU acceleration is already active

Ollama detects your GPU automatically on install. Before changing anything, verify whether it is already working:

cmd.exe
C:\> ollama run llama3 --verbose
# After typing a message, check the stats:
total duration: 1.24s
gpu layers: 32
# Or check nvidia-smi while model is running:
C:\> nvidia-smi
| ollama runner ... 2048MiB |

If gpu layers shows 0, GPU is not active. Follow the relevant section below.

NVIDIA GPU — CUDA setup

Ollama uses CUDA for NVIDIA GPUs. Requirements: GTX 1000 series or newer, driver version 527+.

  • 1

    Update NVIDIA drivers

    Download and install the latest Game Ready or Studio driver from nvidia.com/drivers. Minimum version 527.

  • 2

    Verify CUDA is available

    cmd.exe
    C:\> nvidia-smi
    NVIDIA-SMI 546.01 | Driver Version: 546.01 | CUDA Version: 12.3
    | GeForce RTX 3080 |
  • 3

    Restart Ollama and verify

    Right-click the Ollama tray icon → Quit, then restart. Run ollama run llama3 --verbose and confirm gpu layers: 32.

With 8 GB VRAM you can run 7B models fully on GPU. With 16+ GB VRAM you can run 13B–34B models.

AMD GPU — DirectML on Windows

Ollama uses Microsoft DirectML for AMD GPUs on Windows. DirectML is built into Windows 10 v1903+ and Windows 11 — no separate install needed. Just update your AMD Adrenalin driver from amd.com/support.

cmd.exe
C:\> ollama run mistral --verbose
gpu layers: 32
backend: directml
AMD DirectML performance is generally lower than NVIDIA CUDA for equivalent hardware. If inference is slow, try smaller quantized models.

For a detailed AMD DirectML walkthrough see DirectML guide.

Get the most from your GPU

Keep models on an SSD
Ollama loads model weights from disk into VRAM on startup. An NVMe SSD reduces load time from 10–30 seconds to 1–3 seconds. Models are stored in the .ollama/models folder in your user profile by default.
Use quantized models for lower VRAM usage
Models come in quantization levels: Q4_K_M uses ~4 GB VRAM for a 7B model, Q8_0 uses ~8 GB. Pull a specific quantization: ollama pull llama3:8b-instruct-q4_K_M. Lower Q = less VRAM, slightly lower quality.
Close GPU-heavy apps while running Ollama
Games, video editors and other ML apps compete for VRAM. Close them before running large models to give Ollama the maximum VRAM budget.
GPU not detected after driver update
Fully quit Ollama from the system tray, then restart it from the Start menu. New driver versions sometimes require a fresh service start to be picked up by Ollama.

Want to measure your GPU speed?

Run benchmarks to measure tokens/second on your hardware.

Benchmarks guide