📊

GPU VRAM Guide: How Much Do You Need for Each Model Size?

Q: What happens if my model doesn't fit in VRAM?

Most inference engines (Ollama, llama.cpp) will automatically split between GPU and CPU RAM. The GPU-loaded layers run fast, CPU layers run slow. If 80%+ fits in VRAM, you'll still get decent speed. Below 50% in VRAM, you're essentially running on CPU.

Complete VRAM reference guide for LLM model sizes. See exactly how much GPU VRAM you need for 7B to 70B models at Q4, Q5, Q8, and FP16 quantization levels.

Last updated: February 7, 2026

🎯 Why This Matters

VRAM (Video RAM) is the #1 bottleneck for running AI locally. If your model doesn't fit in VRAM, it either won't load or will spill to system RAM, dropping speed by 10-20x. This guide gives you the exact VRAM numbers for every common model size and quantization level so you can match your GPU to your target models.

🏆 Our Recommendations

Tested and ranked by real-world AI performance

💚 Budget

8GB VRAM (RTX 4060, RTX 3060 8GB)

$249-299

VRAM8 GB

SpecsFits: 7B Q4/Q5, 3B any quant. Won't fit: 13B+

Performance7B Q4: ~25-30 tok/s, 3B Q8: ~40 tok/s

Best For7B models only, entry-level local AI

✅ Pros

Cheapest GPU option
Handles 7B models well
Good for Stable Diffusion 1.5

❌ Cons

Can't fit 13B models
Limited to Q4/Q5 for 7B
No room for large context windows

Check Price on Amazon →

💙 Mid-Range

16GB VRAM (RTX 4060 Ti 16GB, RTX 4070 Ti Super)

$399-799

VRAM16 GB

SpecsFits: 13B Q4/Q5, 7B any quant. Tight: 30B Q2/Q3

Performance13B Q4: ~12-18 tok/s, 7B Q8: ~25-40 tok/s

Best For7B-13B models, the sweet spot for most users

✅ Pros

13B models fit comfortably
7B at high quantization (Q8)
Good for SDXL and Flux
Best value per GB of VRAM

❌ Cons

30B models only at very low quant
Can't run 70B at all
16GB becoming baseline — may want more

Check Price on Amazon →

💜 High-End

24GB VRAM (RTX 4090, RTX 3090)

$800-1,599

VRAM24 GB

SpecsFits: 30B Q4, 13B Q8/FP16, 7B any quant. Won't fit: 70B

Performance30B Q4: ~15 tok/s, 13B Q8: ~35 tok/s

Best ForUp to 30B models, professional use, training

✅ Pros

30B Q4 fits in VRAM
13B at full quality
Enough for LoRA training
Handles any image gen task

❌ Cons

70B still doesn't fit
$1,599 for RTX 4090
RTX 3090 used ~$800 but older architecture

Check Price on Amazon →

🔴 Extreme

48GB+ VRAM (RTX A6000, dual 4090, Apple 128GB)

$3,200-4,500

VRAM48+ GB

SpecsFits: 70B Q4 (48GB), 70B Q8 (requires 80GB+). Everything else easily.

Performance70B Q4: ~15-25 tok/s (varies by setup)

Best For70B models, research, no-compromise setup

✅ Pros

Can run 70B models
No model size limitations
Future-proof

❌ Cons

Extremely expensive
Complex setup for multi-GPU
High power consumption

Check Price on Amazon →

💡 Prices may vary. Links may earn us a commission at no extra cost to you. We only recommend products we'd actually use.

🤖 Compatible Models

Models you can run with this hardware

DeepSeek R1 14B

14B

10 GB min VRAM · DeepSeek

DeepSeek R1 32B

32B

20 GB min VRAM · DeepSeek

DeepSeek R1 70B

70B

40 GB min VRAM · DeepSeek

Gemma 2 27B

27B

18 GB min VRAM · Google

DeepSeek R1 7B

6 GB min VRAM · DeepSeek

Mistral 7B

6 GB min VRAM · Mistral AI

Llama 3.3 70B

70B

40 GB min VRAM · Meta

Phi-4

14B

10 GB min VRAM · Microsoft

Qwen 2.5 14B

14B

10 GB min VRAM · Alibaba

❓ Frequently Asked Questions

What's the VRAM formula for model sizes?

Quick formula: Model params × bytes per param ÷ 1 billion = GB needed. For Q4: multiply params by 0.6 (7B × 0.6 = 4.2GB). For Q8: multiply by 1.1. For FP16: multiply by 2.1. Add 1-2GB overhead for KV cache and inference engine.

Q4 vs Q5 vs Q8 vs FP16 — what's the quality difference?

Q4 loses ~1-3% quality vs FP16 but uses 75% less VRAM. Q5 is ~0.5-1.5% quality loss. Q8 is nearly lossless (<0.5% difference). For most users, Q4_K_M or Q5_K_M gives the best balance of quality and size. FP16 is only worth it for research or if you have VRAM to spare.

What happens if my model doesn't fit in VRAM?

Most inference engines (Ollama, llama.cpp) will automatically split between GPU and CPU RAM. The GPU-loaded layers run fast, CPU layers run slow. If 80%+ fits in VRAM, you'll still get decent speed. Below 50% in VRAM, you're essentially running on CPU.

Ready to build your AI setup?

Pick your hardware, install Ollama, and start running models in minutes.

Browse Models → More Hardware Guides