๐Ÿ“Š

GPU VRAM Guide: How Much Do You Need for Each Model Size?

Complete VRAM reference guide for LLM model sizes. See exactly how much GPU VRAM you need for 7B to 70B models at Q4, Q5, Q8, and FP16 quantization levels.

Last updated: February 7, 2026

๐ŸŽฏ Why This Matters

VRAM (Video RAM) is the #1 bottleneck for running AI locally. If your model doesn't fit in VRAM, it either won't load or will spill to system RAM, dropping speed by 10-20x. This guide gives you the exact VRAM numbers for every common model size and quantization level so you can match your GPU to your target models.

๐Ÿ† Our Recommendations

Tested and ranked by real-world AI performance

๐Ÿ’š Budget

8GB VRAM (RTX 4060, RTX 3060 8GB)

$249-299
VRAM8 GB
SpecsFits: 7B Q4/Q5, 3B any quant. Won't fit: 13B+
Performance7B Q4: ~25-30 tok/s, 3B Q8: ~40 tok/s
Best For7B models only, entry-level local AI

โœ… Pros

  • Cheapest GPU option
  • Handles 7B models well
  • Good for Stable Diffusion 1.5

โŒ Cons

  • Can't fit 13B models
  • Limited to Q4/Q5 for 7B
  • No room for large context windows
Check Price on Amazon โ†’
๐Ÿ’™ Mid-Range

16GB VRAM (RTX 4060 Ti 16GB, RTX 4070 Ti Super)

$399-799
VRAM16 GB
SpecsFits: 13B Q4/Q5, 7B any quant. Tight: 30B Q2/Q3
Performance13B Q4: ~12-18 tok/s, 7B Q8: ~25-40 tok/s
Best For7B-13B models, the sweet spot for most users

โœ… Pros

  • 13B models fit comfortably
  • 7B at high quantization (Q8)
  • Good for SDXL and Flux
  • Best value per GB of VRAM

โŒ Cons

  • 30B models only at very low quant
  • Can't run 70B at all
  • 16GB becoming baseline โ€” may want more
Check Price on Amazon โ†’
๐Ÿ’œ High-End

24GB VRAM (RTX 4090, RTX 3090)

$800-1,599
VRAM24 GB
SpecsFits: 30B Q4, 13B Q8/FP16, 7B any quant. Won't fit: 70B
Performance30B Q4: ~15 tok/s, 13B Q8: ~35 tok/s
Best ForUp to 30B models, professional use, training

โœ… Pros

  • 30B Q4 fits in VRAM
  • 13B at full quality
  • Enough for LoRA training
  • Handles any image gen task

โŒ Cons

  • 70B still doesn't fit
  • $1,599 for RTX 4090
  • RTX 3090 used ~$800 but older architecture
Check Price on Amazon โ†’
๐Ÿ”ด Extreme

48GB+ VRAM (RTX A6000, dual 4090, Apple 128GB)

$3,200-4,500
VRAM48+ GB
SpecsFits: 70B Q4 (48GB), 70B Q8 (requires 80GB+). Everything else easily.
Performance70B Q4: ~15-25 tok/s (varies by setup)
Best For70B models, research, no-compromise setup

โœ… Pros

  • Can run 70B models
  • No model size limitations
  • Future-proof

โŒ Cons

  • Extremely expensive
  • Complex setup for multi-GPU
  • High power consumption
Check Price on Amazon โ†’

๐Ÿ’ก Prices may vary. Links may earn us a commission at no extra cost to you. We only recommend products we'd actually use.

๐Ÿค– Compatible Models

Models you can run with this hardware

โ“ Frequently Asked Questions

What's the VRAM formula for model sizes?

Quick formula: Model params ร— bytes per param รท 1 billion = GB needed. For Q4: multiply params by 0.6 (7B ร— 0.6 = 4.2GB). For Q8: multiply by 1.1. For FP16: multiply by 2.1. Add 1-2GB overhead for KV cache and inference engine.

Q4 vs Q5 vs Q8 vs FP16 โ€” what's the quality difference?

Q4 loses ~1-3% quality vs FP16 but uses 75% less VRAM. Q5 is ~0.5-1.5% quality loss. Q8 is nearly lossless (<0.5% difference). For most users, Q4_K_M or Q5_K_M gives the best balance of quality and size. FP16 is only worth it for research or if you have VRAM to spare.

What happens if my model doesn't fit in VRAM?

Most inference engines (Ollama, llama.cpp) will automatically split between GPU and CPU RAM. The GPU-loaded layers run fast, CPU layers run slow. If 80%+ fits in VRAM, you'll still get decent speed. Below 50% in VRAM, you're essentially running on CPU.

Ready to build your AI setup?

Pick your hardware, install Ollama, and start running models in minutes.