GPU VRAM Guide: How Much Do You Need for Each Model Size?
Complete VRAM reference guide for LLM model sizes. See exactly how much GPU VRAM you need for 7B to 70B models at Q4, Q5, Q8, and FP16 quantization levels.
Last updated: February 7, 2026
๐ฏ Why This Matters
VRAM (Video RAM) is the #1 bottleneck for running AI locally. If your model doesn't fit in VRAM, it either won't load or will spill to system RAM, dropping speed by 10-20x. This guide gives you the exact VRAM numbers for every common model size and quantization level so you can match your GPU to your target models.
๐ Our Recommendations
Tested and ranked by real-world AI performance
8GB VRAM (RTX 4060, RTX 3060 8GB)
โ Pros
- Cheapest GPU option
- Handles 7B models well
- Good for Stable Diffusion 1.5
โ Cons
- Can't fit 13B models
- Limited to Q4/Q5 for 7B
- No room for large context windows
16GB VRAM (RTX 4060 Ti 16GB, RTX 4070 Ti Super)
โ Pros
- 13B models fit comfortably
- 7B at high quantization (Q8)
- Good for SDXL and Flux
- Best value per GB of VRAM
โ Cons
- 30B models only at very low quant
- Can't run 70B at all
- 16GB becoming baseline โ may want more
24GB VRAM (RTX 4090, RTX 3090)
โ Pros
- 30B Q4 fits in VRAM
- 13B at full quality
- Enough for LoRA training
- Handles any image gen task
โ Cons
- 70B still doesn't fit
- $1,599 for RTX 4090
- RTX 3090 used ~$800 but older architecture
48GB+ VRAM (RTX A6000, dual 4090, Apple 128GB)
โ Pros
- Can run 70B models
- No model size limitations
- Future-proof
โ Cons
- Extremely expensive
- Complex setup for multi-GPU
- High power consumption
๐ก Prices may vary. Links may earn us a commission at no extra cost to you. We only recommend products we'd actually use.
๐ค Compatible Models
Models you can run with this hardware
DeepSeek R1 14B
14B10 GB min VRAM ยท DeepSeek
DeepSeek R1 32B
32B20 GB min VRAM ยท DeepSeek
DeepSeek R1 70B
70B40 GB min VRAM ยท DeepSeek
Gemma 2 27B
27B18 GB min VRAM ยท Google
DeepSeek R1 7B
7B6 GB min VRAM ยท DeepSeek
Mistral 7B
7B6 GB min VRAM ยท Mistral AI
Llama 3.3 70B
70B40 GB min VRAM ยท Meta
Phi-4
14B10 GB min VRAM ยท Microsoft
Qwen 2.5 14B
14B10 GB min VRAM ยท Alibaba
โ Frequently Asked Questions
What's the VRAM formula for model sizes?
Quick formula: Model params ร bytes per param รท 1 billion = GB needed. For Q4: multiply params by 0.6 (7B ร 0.6 = 4.2GB). For Q8: multiply by 1.1. For FP16: multiply by 2.1. Add 1-2GB overhead for KV cache and inference engine.
Q4 vs Q5 vs Q8 vs FP16 โ what's the quality difference?
Q4 loses ~1-3% quality vs FP16 but uses 75% less VRAM. Q5 is ~0.5-1.5% quality loss. Q8 is nearly lossless (<0.5% difference). For most users, Q4_K_M or Q5_K_M gives the best balance of quality and size. FP16 is only worth it for research or if you have VRAM to spare.
What happens if my model doesn't fit in VRAM?
Most inference engines (Ollama, llama.cpp) will automatically split between GPU and CPU RAM. The GPU-loaded layers run fast, CPU layers run slow. If 80%+ fits in VRAM, you'll still get decent speed. Below 50% in VRAM, you're essentially running on CPU.
Ready to build your AI setup?
Pick your hardware, install Ollama, and start running models in minutes.