Run Qwen 3.5 Locally: The Model That Beats GPT-5 Nano on Your Own Hardware
Qwen 3.5 just dropped — 0.8B to 9B parameters, 262K context, natively multimodal, Apache 2.0. Here's how to run it on everything from a Raspberry Pi to a gaming PC.
Why Qwen 3.5 Changes Things
Alibaba dropped the Qwen 3.5 "Small" series on March 2, 2026 and the local LLM community lost its collective mind. Here's why:
- The 9B model beats GPT-5 Nano on vision benchmarks. Let that sink in — a model you run on a gaming laptop outperforms OpenAI's latest small model on image understanding.
- 262K context window. That's ~200,000 words. You can feed it an entire book and ask questions about it.
- Natively multimodal. Not a vision adapter bolted on — it was trained from scratch to understand both text and images.
- Runs on a $300 phone. The 0.8B variant is so small it runs on mobile hardware.
- Apache 2.0 license. Use it for anything — personal, commercial, whatever. No restrictions.
The gap between cloud AI and local AI has been closing fast. Qwen 3.5 just closed another huge chunk of it. For everyday tasks — writing, summarizing, coding, answering questions — this is "good enough" for most people. And "good enough" running on YOUR hardware with ZERO data sharing is better than "slightly better" running on someone else's server that just signed a Pentagon deal.
The Four Models (and Which One You Want)
| Model | Size on Disk | Min RAM | Best For | Speed |
|---|---|---|---|---|
| Qwen 3.5 0.8B | ~0.5 GB | 2 GB | Raspberry Pi, phones, IoT, lightweight tasks | Very fast |
| Qwen 3.5 2B | ~1.5 GB | 4 GB | Old laptops, basic chat, simple coding | Fast |
| Qwen 3.5 4B | ~2.5 GB | 8 GB | Good balance of speed and quality | Fast |
| Qwen 3.5 9B ⭐ | ~5.5 GB | 8 GB (16 GB ideal) | Everything. This is the one. | Medium-Fast |
TL;DR: If you have 16GB of RAM or 8GB of VRAM, get the 9B. It's the best quality-per-resource model available right now. If you're on a tight machine (8GB RAM laptop), the 4B is surprisingly capable. If you're doing something weird like running AI on a Raspberry Pi (respect), grab the 0.8B.
All four models share the same architecture and training approach — they're not "dumbed down" versions of a bigger model. Alibaba specifically designed each size to be the best possible model at that parameter count. The 9B outperforms last-generation Qwen 3 30B on several benchmarks. That's a 3x size reduction for better performance.
Quickstart: Running in 60 Seconds
This is genuinely a 60-second setup. If you have a Mac, Linux box, or Windows machine with WSL:
# Install Ollama if you haven't
curl -fsSL https://ollama.com/install.sh | sh
# Pick your size:
ollama pull qwen3.5:0.8b # 0.5GB — runs on anything, even phones
ollama pull qwen3.5:2b # 1.5GB — good for 8GB RAM machines
ollama pull qwen3.5:4b # 2.5GB — balanced
ollama pull qwen3.5:9b # 5.5GB — the sweet spot, beats GPT-5 Nano
# Run it
ollama run qwen3.5:9b That's it. You're chatting with a model that beats GPT-5 Nano. On your machine. No account, no API key, no internet required after the download.
Type /bye to exit. Type /set parameter num_ctx 65536 to increase the context window mid-conversation. Type /show info to see model details.
Add a ChatGPT-like UI
The terminal is fine for quick questions, but if you want the full ChatGPT experience — conversation history, file uploads, model switching, a pretty interface — pair Ollama with Open WebUI:
# docker-compose.yml — Ollama + Open WebUI
version: '3.8'
services:
ollama:
image: ollama/ollama
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu] # Remove if no NVIDIA GPU
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- open-webui:/app/backend/data
depends_on:
- ollama
volumes:
ollama_data:
open-webui:
Run docker compose up -d, open http://localhost:3000, create an account, and you've got your own private ChatGPT running Qwen 3.5. Conversation history, dark mode, mobile-friendly — the works.
Don't want Docker? Open WebUI also has a pip install: pip install open-webui && open-webui serve. Or use Jan for a dead-simple desktop app — download, open, pick Qwen 3.5, chat. For more UI options beyond Open WebUI, see our 5 Self-Hosted ChatGPT Alternatives guide.
Using the Vision Capabilities
Unlike previous small models where vision was an afterthought, Qwen 3.5 was trained from the ground up to understand images. You can ask it to describe photos, read text from screenshots, analyze charts, identify objects — all locally.
# Qwen 3.5 is natively multimodal — it can see images
# Use the /api/chat endpoint with images
curl http://localhost:11434/api/chat -d '{
"model": "qwen3.5:9b",
"messages": [{
"role": "user",
"content": "What is in this image?",
"images": ["base64_encoded_image_here"]
}]
}'
# Or with Ollama CLI:
# ollama run qwen3.5:9b "Describe this image: /path/to/photo.jpg" In Open WebUI, you can just drag and drop images into the chat. It works exactly like ChatGPT's image understanding — except it's running on your machine and nobody sees your screenshots.
How good is the vision? On the OmniDocBench benchmark (document understanding), the 9B scores 87.7 — leading its weight class. It can read handwritten notes, parse receipts, describe complex diagrams. It won't match GPT-4V on tricky edge cases, but for everyday "what's in this image?" tasks, it's excellent.
Using the API in Your Code
Ollama exposes an OpenAI-compatible API. That means any library, tool, or app that works with the OpenAI API works with your local Qwen 3.5 — just change the base URL.
# Ollama exposes an OpenAI-compatible API at localhost:11434
# Use it with any tool that supports the OpenAI API format
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5:9b",
"messages": [
{"role": "user", "content": "Explain quantum computing in 3 sentences"}
]
}' # Works with the official OpenAI Python library
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but ignored
)
response = client.chat.completions.create(
model="qwen3.5:9b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a Python function to merge two sorted lists"}
]
)
print(response.choices[0].message.content) This is where local AI gets really powerful. Build apps, scripts, and automations that use AI — without paying per token, without rate limits, without sending your data anywhere. Process sensitive documents, analyze private data, run 24/7 batch jobs — all for the cost of electricity. If you're using Claude Code and watching token costs, check our Claude Code token management guide — running local models for some tasks can save significant money.
Creating Custom Models
Want a coding-focused Qwen? A creative writing version? A customer support bot? Ollama's Modelfile system lets you create custom personalities on top of any base model:
# Create a custom Qwen 3.5 with specific behavior
# Save as "Modelfile" (no extension)
FROM qwen3.5:9b
PARAMETER temperature 0.7
PARAMETER num_ctx 32768
SYSTEM """
You are a senior software engineer. Be concise. Write clean code.
When asked to write code, output ONLY the code with brief comments.
No explanations unless explicitly asked.
""" # Build your custom model
ollama create code-assistant -f Modelfile
# Use it
ollama run code-assistant You can create as many custom models as you want — they all share the same base weights, so they don't eat extra disk space. Just different system prompts, temperatures, and context sizes. Think of it like ChatGPT's "Custom GPTs" but running locally.
Running on a Raspberry Pi
Yes, really. The Qwen 3.5 0.8B model runs on a Raspberry Pi 5 with 8GB of RAM. It's not fast — expect 2-5 tokens per second — but it works. And there's something deeply satisfying about having a private AI running on a $80 computer the size of a credit card.
# Raspberry Pi 5 (8GB) — yes, it actually works
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull the smallest model
ollama pull qwen3.5:0.8b
# Run it — expect ~2-5 tokens/second
ollama run qwen3.5:0.8b
# For a slightly better experience with more patience:
ollama pull qwen3.5:2b
# ~1-2 tokens/second on Pi 5, but noticeably smarter What it's good for on Pi
- Home automation: Process voice commands, parse sensor data, generate responses
- Smart home assistant: Pair with a mic and speaker for a private, offline AI assistant
- Learning: Teach yourself about LLMs on cheap hardware
- Novelty: Honestly, just showing people AI running on a Pi is a conversation starter
Pro tip: If you want faster inference on a Pi, use the Q4_0 quantization: ollama pull qwen3.5:0.8b-q4_0. Slightly lower quality, noticeably faster.
How Good Is It, Really?
Numbers don't lie. Here's how Qwen 3.5 9B stacks up against models you've heard of:
| Benchmark | Qwen 3.5 9B | GPT-5 Nano | Llama 3.3 8B | Gemma 3 12B |
|---|---|---|---|---|
| MMMLU (multilingual) | 81.2 | 79.8 | 73.0 | 78.5 |
| OmniDocBench (doc understanding) | 87.7 | 83.2 | N/A | 84.1 |
| Context window | 262K | 128K | 128K | 128K |
| License | Apache 2.0 | Proprietary | Llama License | Gemma License |
| Vision built-in | ✅ Native | ✅ | ❌ | ✅ |
| Runs on 8GB VRAM | ✅ | Cloud only | ✅ | Tight |
The headline number: Qwen 3.5 9B outperforms last-generation Qwen 3 30B — a model 3x its size — on multiple benchmarks. That's the kind of efficiency gain that makes local AI practical for normal people, not just researchers with GPU clusters.
Performance Tips
Get the most out of your hardware
- Apple Silicon: Ollama uses Metal acceleration automatically. A 16GB M2 MacBook Air runs the 9B at 20-30 tokens/sec. Best consumer local AI setup per dollar.
- NVIDIA GPU: Make sure you have the CUDA drivers installed. Ollama auto-detects and uses your GPU. RTX 3060 12GB handles 9B comfortably.
- AMD GPU: ROCm support is improving but still hit-or-miss. Check Ollama's compatibility list.
- CPU-only: Works fine with the 0.8B and 2B models. For 4B+, expect slower speeds. Use Q4_0 quantization for a speed boost.
Context window tuning
The default context in Ollama is usually 2048 or 4096 tokens. Qwen 3.5 supports up to 262K but uses more RAM as you increase it. Set what you need:
- Quick chat: Default (4096) is fine
- Long documents:
/set parameter num_ctx 32768 - Entire books:
/set parameter num_ctx 131072(needs 32GB+ RAM for 9B)
Keep it running
By default, Ollama unloads models after 5 minutes of inactivity. To keep it warm (instant responses):
- Set
OLLAMA_KEEP_ALIVE=-1in your environment — model stays loaded until you stop Ollama - Or set a longer timeout:
OLLAMA_KEEP_ALIVE=1h
Your AI, Your Rules
Three commands. That's all it takes to go from "my conversations feed someone else's business model" to "my conversations stay on my machine."
curl -fsSL https://ollama.com/install.sh | sh && ollama pull qwen3.5:9b && ollama run qwen3.5:9b
No account. No API key. No Pentagon deals. Just you and a really good AI model.