Python guide

Ollama Python API on Windows — chat, stream & REST

Run Ollama from Python on Windows using the official ollama library or the REST API at localhost:11434. This guide covers install, chat calls, streaming and practical tips.

Ollama must be running locally before you can call its API. Check with curl http://localhost:11434 — you should see "Ollama is running".

Install the Ollama Python library

PowerShell or cmd.exe
C:\> pip install ollama
Successfully installed ollama
# Or with uv:
C:\> uv add ollama

Your first chat call

Python
import ollama
response = ollama.chat(
model='llama3',
messages=[{
'role': 'user',
'content': 'Explain Python generators in one paragraph'
}]
)
print(response['message']['content'])

Stream responses token by token

For long responses, stream the output so tokens appear as they are generated rather than waiting for the full response:

Python
import ollama
for chunk in ollama.chat(
model='llama3',
messages=[{'role': 'user', 'content': 'Write a haiku'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)

Call the REST API directly

Ollama exposes a REST API at http://localhost:11434. You can call it from any language or tool that supports HTTP:

cmd.exe
# Single-turn generation:
C:\> curl http://localhost:11434/api/generate -d "{"model":"llama3","prompt":"Why is the sky blue?","stream":false}"
# Chat endpoint (multi-turn):
C:\> curl http://localhost:11434/api/chat -d "{"model":"llama3","messages":[{"role":"user","content":"Hello"}]}"
# List running models:
C:\> curl http://localhost:11434/api/ps

Full API documentation: github.com/ollama/ollama/blob/main/docs/api.md

Tips for using Ollama from Python

Preload a model to avoid cold-start latency
The first request to a model takes longer because the weights load from disk. Send a warm-up prompt at app startup to load the model into memory, then subsequent calls respond quickly.
Use keep_alive to control model unloading
By default Ollama unloads a model after 5 minutes of inactivity. To keep it loaded: ollama.chat(model="llama3", messages=[...], keep_alive="60m"). Set to "0" to unload immediately after use.
Use smaller quantizations for lower latency
When building applications, latency matters more than quality for many tasks. Use Q4_K_M or Q4_0 quantizations: model="llama3:8b-instruct-q4_K_M". Roughly 2x faster than Q8_0 with minimal quality loss.
Handle connection errors gracefully
Wrap calls in try/except to handle the case where Ollama is not running. The ollama library raises ollama.ResponseError for API errors and httpx.ConnectError if the service is not reachable.

Need a model to test with?

Browse available models with pull commands and VRAM requirements.

Models Hub