Simple, repeatable method
• Warm up once to load weights into memory.
ollama run llama3 "Warm up run"
• Use a fixed prompt and measure wall‑clock time.
Measure-Command { ollama run llama3 "Summarize this paragraph about Windows local LLMs in 2 sentences." }
• Repeat 3–5 times and average results. Keep other apps closed to reduce noise.
Tokens/s and latency depend on quantization, context length, GPU/CPU, and disk speed.
Metrics that matter
• First‑token latency: time to first character — affected by model size and cold starts.
• Throughput (tokens/s): steady‑state speed once generation starts.
• Memory use: RAM and VRAM consumption — check Task Manager while running.
• Stability: consistency across runs (low variance is better).
Key factors
• Quantization: lower bits (e.g., Q4) use less memory and run faster with some quality tradeoff.
• GPU vs CPU: GPUs deliver higher tokens/s if VRAM is sufficient; CPU‑only can handle small models.
• Context length: longer prompts/history slow decoding; trim when testing.
• Storage: keep models on an SSD to avoid stalls during loading.
Best practices
• Close GPU‑heavy apps and set your Windows power plan to “High performance”.
• Benchmark with the same prompt, temperature=0, and similar output length for apples‑to‑apples results.
• Compare quantizations of the same model before switching models.
• Log results (model, quant, prompt, tokens/s, hardware) for future reference.
Community‑driven guide. Not affiliated with the official Ollama project.