Running LLMs locally gives you privacy, speed, and zero per-token costs. Getting there requires navigating driver installations, quantization formats, memory constraints, and model selection. This guide compresses what took most developers a weekend of trial and error into a single readable document.
Step 1: Hardware Prerequisites
Before installing anything, confirm your hardware meets minimum requirements:
- GPU: NVIDIA RTX 30xx or 40xx with ≥12GB VRAM (minimum). 24GB for 70B models. AMD GPUs work via ROCm but are 20–30% slower for most workloads.
- System RAM: 32GB minimum. 64GB if you'll be doing RAG with large document sets.
- Storage: NVMe SSD required. 70B model in Q4_K_M format = ~40GB. Budget 500GB for model storage.
- OS: Ubuntu 22.04 LTS is the recommended OS for GPU AI workloads. macOS works for Apple Silicon. Windows works but adds WSL2 complexity.
Step 2: NVIDIA Driver and CUDA Setup (Linux)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# Install CUDA toolkit (12.4 as of 2025)
sudo apt-get install -y cuda-toolkit-12-4
# Verify installation
nvidia-smi
nvcc --version
Step 3: Install Ollama (Easiest Path)
Ollama is the fastest way to get a model running locally. One command installs a local model server with OpenAI-compatible API:
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Llama 3.1 8B (fastest, ~12GB)
ollama run llama3.1
# Pull Llama 3.1 70B (best quality, ~40GB needed)
ollama pull llama3.1:70b
# List downloaded models
ollama list
Step 4: API Access
Ollama exposes an OpenAI-compatible API at localhost:11434. You can drop it into any existing OpenAI SDK code by just changing the base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # any string works
)
response = client.chat.completions.create(
model="llama3.1:70b",
messages=[{"role": "user", "content": "Hello"}]
)
Step 5: Performance Tuning
Default Ollama settings leave performance on the table. Key tuning options:
OLLAMA_NUM_CTX=8192 ollama serve
# Set GPU layers (default auto, but explicit is more reliable)
OLLAMA_NUM_GPU=999 ollama serve # use all available VRAM
# Parallel requests (increase for serving to multiple users)
OLLAMA_NUM_PARALLEL=4 ollama serve
Choosing the Right Quantization
Quantization reduces model size and speeds up inference at the cost of small quality degradation. The practical tradeoffs:
- Q8_0 — nearly lossless quality, 8-bit. Use when you have VRAM to spare. Llama 3.1 70B Q8 = ~80GB (needs 2×4090)
- Q4_K_M — the sweet spot. Minimal quality loss, ~40% size vs Q8. Llama 3.1 70B Q4_K_M = ~43GB (fits on 2×4090)
- Q3_K_M — smaller but noticeable quality drop. Use for 7B–13B models on constrained hardware.
- Q2_K — emergency option only. Quality degrades significantly. Better to use a smaller model at Q4.
Rule of thumb: Use the largest model that fits in your VRAM at Q4_K_M before dropping to a smaller model at Q8. A 70B Q4_K_M outperforms a 13B Q8 by a significant margin on most reasoning tasks.
Need Help Setting Up Your AI Infrastructure?
We set up production-grade local LLM infrastructure for development teams. Start with a free consultation.
Talk to the Team