The main metric for local LLM speed is tokens per second (t/s) — how many tokens the model generates per second. Higher is faster. A typical conversation feels fluid at 15+ t/s.
Built-in benchmark
Run Ollama's built-in benchmark
Ollama includes a built-in benchmark command that measures inference speed on your hardware:
# Benchmark a specific model:
C:\> ollama run llama3 --verbose
# Type a prompt, then press Enter. After the response:
total duration: 4.5s
load duration: 423ms
prompt eval count: 26 token(s)
prompt eval duration: 312ms
prompt eval rate: 83.3 tokens/s
eval count: 312 token(s)
eval rate: 24.8 tokens/s
eval rate is the key number — how fast the model generates output tokens. prompt eval rate is how fast it processes your input.
Understanding metrics
What the numbers mean
| Metric | What it measures | Good value |
|---|---|---|
| eval rate | Output tokens per second | 15+ t/s for fluid conversation |
| prompt eval rate | Input processing speed | 50+ t/s typical |
| load duration | Time to load model into memory | <5s on SSD |
| total duration | Wall-clock time for full response | Depends on response length |
| gpu layers | Model layers running on GPU | Same as total layers = full GPU |
Performance factors
What affects benchmark results
GPU vs CPU
A mid-range GPU like an RTX 3070 (8 GB VRAM) typically achieves 40–80 t/s on a 7B model. CPU-only on a modern Ryzen 9 is typically 5–15 t/s. GPU acceleration is the single biggest performance factor. See GPU Acceleration.
Model size and quantization
Smaller models and lower quantizations are faster. A Q4_K_M 7B model is roughly 2x faster than a Q8_0 7B model on the same hardware, with a modest quality trade-off. Use
ollama pull llama3:8b-instruct-q4_K_M to get a specific quantization.RAM and VRAM
If the model fits entirely in VRAM, it runs at GPU speed. If it overflows to RAM, parts run on CPU which is much slower. Running at 2–5 t/s usually means VRAM overflow. Try a smaller model or lower quantization.
Warm-up effect
The first run after loading a model is always slower because the model is being paged from disk to memory. Run at least 2–3 prompts before comparing benchmarks — subsequent runs are representative of steady-state performance.
Reference speeds
Typical benchmark results on Windows
| Hardware | Model | Tokens/s (approx) |
|---|---|---|
| RTX 4090 (24 GB) | Llama 3 8B Q4 | 120–150 t/s |
| RTX 3080 (10 GB) | Llama 3 8B Q4 | 60–80 t/s |
| RTX 3060 (12 GB) | Mistral 7B Q4 | 40–55 t/s |
| RX 7900 XTX (DirectML) | Llama 3 8B Q4 | 25–45 t/s |
| Ryzen 9 7900X (CPU only) | Mistral 7B Q4 | 8–14 t/s |
| Intel Core i7-12700 (CPU only) | Mistral 7B Q4 | 5–10 t/s |
These are approximate real-world figures. Actual results vary by context length, system load and exact quantization. Use them as a rough reference only.