Python (Developers)

Use the Python client or REST API to build apps with local LLMs. Private, fast, and simple on Windows.

Python quick start

Install the client and run your first chat call with a local model.

pip install ollama

Minimal example:

import ollama response = ollama.chat(model='llama3', messages=[{'role':'user','content':'Hello!'}]) print(response['message']['content'])

Tip: For streaming tokens, look for a streaming option in the client and handle incremental chunks.

Base URL: http://localhost:11434. Common endpoints:

• POST /api/generate — single‑turn generation (prompt → response)

• POST /api/chat — multi‑turn chat (messages array)

• POST /api/embeddings — vector embeddings

Example request (generate):

curl -s http://localhost:11434/api/generate -d '{"model":"llama3","prompt":"Hello"}'

Responses may stream. Buffer or process line‑by‑line (SSE/JSONL style) depending on your HTTP client.

• Keep models on SSD and preload frequently used models to reduce cold starts.

• Use smaller quantizations for faster latency; pick larger variants only if needed.

• Set reasonable timeouts/retries in your client; handle streaming gracefully.

• Separate system/user roles in chat for better control over responses.

• Log prompts/responses locally for debugging (respect privacy requirements).

Community‑driven guide. Not affiliated with the official Ollama project.