Ollama must be running locally before you can call its API. Check with
curl http://localhost:11434 — you should see "Ollama is running".Setup
Install the Ollama Python library
C:\> pip install ollama
Successfully installed ollama
# Or with uv:
C:\> uv add ollama
Quick start
Your first chat call
import ollama
response = ollama.chat(
model='llama3',
messages=[{
'role': 'user',
'content': 'Explain Python generators in one paragraph'
}]
)
print(response['message']['content'])
Streaming
Stream responses token by token
For long responses, stream the output so tokens appear as they are generated rather than waiting for the full response:
import ollama
for chunk in ollama.chat(
model='llama3',
messages=[{'role': 'user', 'content': 'Write a haiku'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
REST API
Call the REST API directly
Ollama exposes a REST API at http://localhost:11434. You can call it from any language or tool that supports HTTP:
# Single-turn generation:
C:\> curl http://localhost:11434/api/generate -d "{"model":"llama3","prompt":"Why is the sky blue?","stream":false}"
# Chat endpoint (multi-turn):
C:\> curl http://localhost:11434/api/chat -d "{"model":"llama3","messages":[{"role":"user","content":"Hello"}]}"
# List running models:
C:\> curl http://localhost:11434/api/ps
Full API documentation: github.com/ollama/ollama/blob/main/docs/api.md
Best practices
Tips for using Ollama from Python
Preload a model to avoid cold-start latency
The first request to a model takes longer because the weights load from disk. Send a warm-up prompt at app startup to load the model into memory, then subsequent calls respond quickly.
Use keep_alive to control model unloading
By default Ollama unloads a model after 5 minutes of inactivity. To keep it loaded:
ollama.chat(model="llama3", messages=[...], keep_alive="60m"). Set to "0" to unload immediately after use.Use smaller quantizations for lower latency
When building applications, latency matters more than quality for many tasks. Use Q4_K_M or Q4_0 quantizations:
model="llama3:8b-instruct-q4_K_M". Roughly 2x faster than Q8_0 with minimal quality loss.Handle connection errors gracefully
Wrap calls in try/except to handle the case where Ollama is not running. The ollama library raises
ollama.ResponseError for API errors and httpx.ConnectError if the service is not reachable.