As of 2026-05-15
As of 2026-05-15
Running a model locally means downloading the model weights and doing inference on hardware you own. It is more accessible than it sounds; the entire stack is three pieces.
The three pieces
- A runtime. The program that loads the model into memory and decodes tokens. Pick from Ollama, LM Studio, llama.cpp directly, vLLM for high-throughput, or one of the GPU-vendor-specific stacks. All do the same job; they differ on ergonomics and throughput.
- A model in a runnable format. The community standardized on GGUF for CPU/GPU CPU-hybrid runtimes (Ollama, LM Studio, llama.cpp). For pure GPU serving you tend to use safetensors with vLLM or Hugging Face Transformers.
- A model that fits. This is the constraint that decides everything. Your usable VRAM (or unified memory on Apple Silicon) caps which model you can run at which quantization.
VRAM rough rules
At 4-bit quantization (Q4_K_M, the common default):
- 8 GB VRAM → 7B–8B class models, comfortably.
- 12 GB VRAM → 13B class, or 7B with a long context window.
- 16 GB VRAM → 13B with room, or 27B class squeezed.
- 24 GB VRAM (one 3090 / 4090) → 27B class comfortably; 70B class only with offload or heavy quant.
- 48 GB VRAM (two 3090s, an A6000, an RTX 6000 Ada) → 70B class natively, long contexts.
- 96 GB+ → frontier-tier open-weight models without compromise.
Apple Silicon: a 36 GB M3 Pro behaves roughly like a 24 GB GPU for these models, but ~3× slower per token. A 128 GB M3 Max can run 70B+ models that no consumer PC can touch — just slowly.
Quantization in one paragraph
You almost never run weights at full 16-bit precision locally. You quantize them to 8, 6, 5, or 4 bits per parameter. Memory drops proportionally; speed often improves; quality drops a little. 4-bit (Q4_K_M for GGUF) is the modern sweet spot for chat. Drop to 3-bit only if you must. Stay above 4-bit only if you are doing fine-grained reasoning or generation where you noticed quality cliffs.
What runs well right now
This is the dated part of the page; see the latest snapshot for a current rundown. Generally, in mid-2026, the workable open-weight tiers are:
- 7B–8B for fast chat on consumer hardware.
- 27B–32B as the new "default smart model" for 24GB cards.
- 70B+ for power users with multi-GPU setups; competitive with mid-tier API models on many tasks.
When local beats API
- Privacy. Data never leaves your machine. For regulated work, this is the whole game.
- Latency. Short prompts in chat-style interaction have near-zero round trip. Useful for fast UIs.
- Cost at high volume. Fixed cost. If you have a workload that would cost you four figures a month on an API, the math flips fast.
- No-network or low-network environments. Field work, secure facilities, intermittent connections.
When API beats local
- Raw capability per dollar. The frontier APIs are still ahead on the hardest reasoning, the longest context, and the latest skills.
- Multimodal. Open-weight is catching up on vision, lagging on audio and video.
- Operational simplicity. No GPU babysitting. Updates are someone else's problem.
- Scaling spikes. APIs absorb bursts; your local GPU does not.
Most real setups are hybrid: a frontier API for the hard stuff, a local model for the high-volume or sensitive stuff. Pick local where it pays for itself.