Benchmarks

Ollama benchmarks on Windows — measure LLM speed

The key metric for local LLM performance is tokens per second (t/s). This guide shows how to measure it with the built-in --verbose flag, explains what each metric means, and gives reference speeds for common Windows hardware.

The main metric for local LLM speed is tokens per second (t/s) — how many tokens the model generates per second. Higher is faster. A typical conversation feels fluid at 15+ t/s.

Run Ollama's built-in benchmark

Ollama includes a built-in benchmark command that measures inference speed on your hardware:

cmd.exe
# Benchmark a specific model:
C:\> ollama run llama3 --verbose
# Type a prompt, then press Enter. After the response:
total duration: 4.5s
load duration: 423ms
prompt eval count: 26 token(s)
prompt eval duration: 312ms
prompt eval rate: 83.3 tokens/s
eval count: 312 token(s)
eval rate: 24.8 tokens/s

eval rate is the key number — how fast the model generates output tokens. prompt eval rate is how fast it processes your input.

What the numbers mean

MetricWhat it measuresGood value
eval rateOutput tokens per second15+ t/s for fluid conversation
prompt eval rateInput processing speed50+ t/s typical
load durationTime to load model into memory<5s on SSD
total durationWall-clock time for full responseDepends on response length
gpu layersModel layers running on GPUSame as total layers = full GPU

What affects benchmark results

GPU vs CPU
A mid-range GPU like an RTX 3070 (8 GB VRAM) typically achieves 40–80 t/s on a 7B model. CPU-only on a modern Ryzen 9 is typically 5–15 t/s. GPU acceleration is the single biggest performance factor. See GPU Acceleration.
Model size and quantization
Smaller models and lower quantizations are faster. A Q4_K_M 7B model is roughly 2x faster than a Q8_0 7B model on the same hardware, with a modest quality trade-off. Use ollama pull llama3:8b-instruct-q4_K_M to get a specific quantization.
RAM and VRAM
If the model fits entirely in VRAM, it runs at GPU speed. If it overflows to RAM, parts run on CPU which is much slower. Running at 2–5 t/s usually means VRAM overflow. Try a smaller model or lower quantization.
Warm-up effect
The first run after loading a model is always slower because the model is being paged from disk to memory. Run at least 2–3 prompts before comparing benchmarks — subsequent runs are representative of steady-state performance.

Typical benchmark results on Windows

HardwareModelTokens/s (approx)
RTX 4090 (24 GB)Llama 3 8B Q4120–150 t/s
RTX 3080 (10 GB)Llama 3 8B Q460–80 t/s
RTX 3060 (12 GB)Mistral 7B Q440–55 t/s
RX 7900 XTX (DirectML)Llama 3 8B Q425–45 t/s
Ryzen 9 7900X (CPU only)Mistral 7B Q48–14 t/s
Intel Core i7-12700 (CPU only)Mistral 7B Q45–10 t/s
These are approximate real-world figures. Actual results vary by context length, system load and exact quantization. Use them as a rough reference only.

GPU not performing well?

Enable NVIDIA CUDA or AMD DirectML for maximum inference speed.

GPU Acceleration guide