Benchmarks

Measure latency, throughput (tokens/s), and memory use to tune your local LLM setup on Windows.

Simple, repeatable method

• Warm up once to load weights into memory.

ollama run llama3 "Warm up run"

• Use a fixed prompt and measure wall‑clock time.

Measure-Command { ollama run llama3 "Summarize this paragraph about Windows local LLMs in 2 sentences." }

• Repeat 3–5 times and average results. Keep other apps closed to reduce noise.

Tokens/s and latency depend on quantization, context length, GPU/CPU, and disk speed.

• First‑token latency: time to first character — affected by model size and cold starts.

• Throughput (tokens/s): steady‑state speed once generation starts.

• Memory use: RAM and VRAM consumption — check Task Manager while running.

• Stability: consistency across runs (low variance is better).

• Quantization: lower bits (e.g., Q4) use less memory and run faster with some quality tradeoff.

• GPU vs CPU: GPUs deliver higher tokens/s if VRAM is sufficient; CPU‑only can handle small models.

• Context length: longer prompts/history slow decoding; trim when testing.

• Storage: keep models on an SSD to avoid stalls during loading.

• Close GPU‑heavy apps and set your Windows power plan to “High performance”.

• Benchmark with the same prompt, temperature=0, and similar output length for apples‑to‑apples results.

• Compare quantizations of the same model before switching models.

• Log results (model, quant, prompt, tokens/s, hardware) for future reference.

Community‑driven guide. Not affiliated with the official Ollama project.