← Guides Local LLM

Run Qwen 3.5 Locally: The Model That Beats GPT-5 Nano on Your Own Hardware

Qwen 3.5 just dropped — 0.8B to 9B parameters, 262K context, natively multimodal, Apache 2.0. Here's how to run it on everything from a Raspberry Pi to a gaming PC.

🤖 12 min read · Last updated: March 3, 2026

Why Qwen 3.5 Changes Things

Alibaba dropped the Qwen 3.5 "Small" series on March 2, 2026 and the local LLM community lost its collective mind. Here's why:

The 9B model beats GPT-5 Nano on vision benchmarks. Let that sink in — a model you run on a gaming laptop outperforms OpenAI's latest small model on image understanding.
262K context window. That's ~200,000 words. You can feed it an entire book and ask questions about it.
Natively multimodal. Not a vision adapter bolted on — it was trained from scratch to understand both text and images.
Runs on a $300 phone. The 0.8B variant is so small it runs on mobile hardware.
Apache 2.0 license. Use it for anything — personal, commercial, whatever. No restrictions.

The gap between cloud AI and local AI has been closing fast. Qwen 3.5 just closed another huge chunk of it. For everyday tasks — writing, summarizing, coding, answering questions — this is "good enough" for most people. And "good enough" running on YOUR hardware with ZERO data sharing is better than "slightly better" running on someone else's server that just signed a Pentagon deal.

The Four Models (and Which One You Want)

Model	Size on Disk	Min RAM	Best For	Speed
Qwen 3.5 0.8B	~0.5 GB	2 GB	Raspberry Pi, phones, IoT, lightweight tasks	Very fast
Qwen 3.5 2B	~1.5 GB	4 GB	Old laptops, basic chat, simple coding	Fast
Qwen 3.5 4B	~2.5 GB	8 GB	Good balance of speed and quality	Fast
Qwen 3.5 9B ⭐	~5.5 GB	8 GB (16 GB ideal)	Everything. This is the one.	Medium-Fast

TL;DR: If you have 16GB of RAM or 8GB of VRAM, get the 9B. It's the best quality-per-resource model available right now. If you're on a tight machine (8GB RAM laptop), the 4B is surprisingly capable. If you're doing something weird like running AI on a Raspberry Pi (respect), grab the 0.8B.

All four models share the same architecture and training approach — they're not "dumbed down" versions of a bigger model. Alibaba specifically designed each size to be the best possible model at that parameter count. The 9B outperforms last-generation Qwen 3 30B on several benchmarks. That's a 3x size reduction for better performance.

Quickstart: Running in 60 Seconds

This is genuinely a 60-second setup. If you have a Mac, Linux box, or Windows machine with WSL:

Terminal

# Install Ollama if you haven't
curl -fsSL https://ollama.com/install.sh | sh

# Pick your size:
ollama pull qwen3.5:0.8b   # 0.5GB — runs on anything, even phones
ollama pull qwen3.5:2b     # 1.5GB — good for 8GB RAM machines
ollama pull qwen3.5:4b     # 2.5GB — balanced
ollama pull qwen3.5:9b     # 5.5GB — the sweet spot, beats GPT-5 Nano

# Run it
ollama run qwen3.5:9b

That's it. You're chatting with a model that beats GPT-5 Nano. On your machine. No account, no API key, no internet required after the download.

Type /bye to exit. Type /set parameter num_ctx 65536 to increase the context window mid-conversation. Type /show info to see model details.

💡 First time with Ollama? The initial model download is the biggest wait — ~5.5GB for the 9B model. After that, everything runs locally. Starting the model takes 2-5 seconds. Responses begin in under a second on decent hardware.

Add a ChatGPT-like UI

The terminal is fine for quick questions, but if you want the full ChatGPT experience — conversation history, file uploads, model switching, a pretty interface — pair Ollama with Open WebUI:

docker-compose.yml

# docker-compose.yml — Ollama + Open WebUI
version: '3.8'
services:
  ollama:
    image: ollama/ollama
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]  # Remove if no NVIDIA GPU

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open-webui:/app/backend/data
    depends_on:
      - ollama

volumes:
  ollama_data:
  open-webui:

Run docker compose up -d, open http://localhost:3000, create an account, and you've got your own private ChatGPT running Qwen 3.5. Conversation history, dark mode, mobile-friendly — the works.

Don't want Docker? Open WebUI also has a pip install: pip install open-webui && open-webui serve. Or use Jan for a dead-simple desktop app — download, open, pick Qwen 3.5, chat. For more UI options beyond Open WebUI, see our 5 Self-Hosted ChatGPT Alternatives guide.

Using the Vision Capabilities

Unlike previous small models where vision was an afterthought, Qwen 3.5 was trained from the ground up to understand images. You can ask it to describe photos, read text from screenshots, analyze charts, identify objects — all locally.

Vision API

# Qwen 3.5 is natively multimodal — it can see images
# Use the /api/chat endpoint with images

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3.5:9b",
  "messages": [{
    "role": "user",
    "content": "What is in this image?",
    "images": ["base64_encoded_image_here"]
  }]
}'

# Or with Ollama CLI:
# ollama run qwen3.5:9b "Describe this image: /path/to/photo.jpg"

In Open WebUI, you can just drag and drop images into the chat. It works exactly like ChatGPT's image understanding — except it's running on your machine and nobody sees your screenshots.

How good is the vision? On the OmniDocBench benchmark (document understanding), the 9B scores 87.7 — leading its weight class. It can read handwritten notes, parse receipts, describe complex diagrams. It won't match GPT-4V on tricky edge cases, but for everyday "what's in this image?" tasks, it's excellent.

Using the API in Your Code

Ollama exposes an OpenAI-compatible API. That means any library, tool, or app that works with the OpenAI API works with your local Qwen 3.5 — just change the base URL.

cURL

# Ollama exposes an OpenAI-compatible API at localhost:11434
# Use it with any tool that supports the OpenAI API format

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5:9b",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in 3 sentences"}
    ]
  }'

Python

# Works with the official OpenAI Python library
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but ignored
)

response = client.chat.completions.create(
    model="qwen3.5:9b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python function to merge two sorted lists"}
    ]
)

print(response.choices[0].message.content)

This is where local AI gets really powerful. Build apps, scripts, and automations that use AI — without paying per token, without rate limits, without sending your data anywhere. Process sensitive documents, analyze private data, run 24/7 batch jobs — all for the cost of electricity. If you're using Claude Code and watching token costs, check our Claude Code token management guide — running local models for some tasks can save significant money.

Creating Custom Models

Want a coding-focused Qwen? A creative writing version? A customer support bot? Ollama's Modelfile system lets you create custom personalities on top of any base model:

Modelfile

# Create a custom Qwen 3.5 with specific behavior
# Save as "Modelfile" (no extension)

FROM qwen3.5:9b

PARAMETER temperature 0.7
PARAMETER num_ctx 32768

SYSTEM """
You are a senior software engineer. Be concise. Write clean code.
When asked to write code, output ONLY the code with brief comments.
No explanations unless explicitly asked.
"""

Build and run

# Build your custom model
ollama create code-assistant -f Modelfile

# Use it
ollama run code-assistant

You can create as many custom models as you want — they all share the same base weights, so they don't eat extra disk space. Just different system prompts, temperatures, and context sizes. Think of it like ChatGPT's "Custom GPTs" but running locally.

Running on a Raspberry Pi

Yes, really. The Qwen 3.5 0.8B model runs on a Raspberry Pi 5 with 8GB of RAM. It's not fast — expect 2-5 tokens per second — but it works. And there's something deeply satisfying about having a private AI running on a $80 computer the size of a credit card.

Raspberry Pi 5 Setup

# Raspberry Pi 5 (8GB) — yes, it actually works
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull the smallest model
ollama pull qwen3.5:0.8b

# Run it — expect ~2-5 tokens/second
ollama run qwen3.5:0.8b

# For a slightly better experience with more patience:
ollama pull qwen3.5:2b
# ~1-2 tokens/second on Pi 5, but noticeably smarter

What it's good for on Pi

Home automation: Process voice commands, parse sensor data, generate responses
Smart home assistant: Pair with a mic and speaker for a private, offline AI assistant
Learning: Teach yourself about LLMs on cheap hardware
Novelty: Honestly, just showing people AI running on a Pi is a conversation starter

Pro tip: If you want faster inference on a Pi, use the Q4_0 quantization: ollama pull qwen3.5:0.8b-q4_0. Slightly lower quality, noticeably faster.

How Good Is It, Really?

Numbers don't lie. Here's how Qwen 3.5 9B stacks up against models you've heard of:

Benchmark	Qwen 3.5 9B	GPT-5 Nano	Llama 3.3 8B	Gemma 3 12B
MMMLU (multilingual)	81.2	79.8	73.0	78.5
OmniDocBench (doc understanding)	87.7	83.2	N/A	84.1
Context window	262K	128K	128K	128K
License	Apache 2.0	Proprietary	Llama License	Gemma License
Vision built-in	✅ Native	✅	❌	✅
Runs on 8GB VRAM	✅	Cloud only	✅	Tight

The headline number: Qwen 3.5 9B outperforms last-generation Qwen 3 30B — a model 3x its size — on multiple benchmarks. That's the kind of efficiency gain that makes local AI practical for normal people, not just researchers with GPU clusters.

Context on benchmarks: These are standardized tests, not the full picture. In real-world usage, the 9B model handles conversational tasks, summarization, and coding help very well. It struggles with multi-step mathematical reasoning and very long, complex instructions — areas where 70B+ models still have a clear edge. For 80% of daily use, you won't notice the difference.

Performance Tips

Get the most out of your hardware

Apple Silicon: Ollama uses Metal acceleration automatically. A 16GB M2 MacBook Air runs the 9B at 20-30 tokens/sec. Best consumer local AI setup per dollar.
NVIDIA GPU: Make sure you have the CUDA drivers installed. Ollama auto-detects and uses your GPU. RTX 3060 12GB handles 9B comfortably.
AMD GPU: ROCm support is improving but still hit-or-miss. Check Ollama's compatibility list.
CPU-only: Works fine with the 0.8B and 2B models. For 4B+, expect slower speeds. Use Q4_0 quantization for a speed boost.

Context window tuning

The default context in Ollama is usually 2048 or 4096 tokens. Qwen 3.5 supports up to 262K but uses more RAM as you increase it. Set what you need:

Quick chat: Default (4096) is fine
Long documents: /set parameter num_ctx 32768
Entire books: /set parameter num_ctx 131072 (needs 32GB+ RAM for 9B)

Keep it running

By default, Ollama unloads models after 5 minutes of inactivity. To keep it warm (instant responses):

Set OLLAMA_KEEP_ALIVE=-1 in your environment — model stays loaded until you stop Ollama
Or set a longer timeout: OLLAMA_KEEP_ALIVE=1h

Your AI, Your Rules

Three commands. That's all it takes to go from "my conversations feed someone else's business model" to "my conversations stay on my machine."

curl -fsSL https://ollama.com/install.sh | sh && ollama pull qwen3.5:9b && ollama run qwen3.5:9b

No account. No API key. No Pentagon deals. Just you and a really good AI model.