Setting Up a Local LLM Workstation: Step-by-Step Guide

Running LLMs locally gives you privacy, speed, and zero per-token costs. Getting there requires navigating driver installations, quantization formats, memory constraints, and model selection. This guide compresses what took most developers a weekend of trial and error into a single readable document.

Developer setting up local AI workstation

Setting up a local LLM workstation properly takes a few hours — and then runs indefinitely at near-zero cost.

Step 1: Hardware Prerequisites

Before installing anything, confirm your hardware meets minimum requirements:

GPU: NVIDIA RTX 30xx or 40xx with ≥12GB VRAM (minimum). 24GB for 70B models. AMD GPUs work via ROCm but are 20–30% slower for most workloads.
System RAM: 32GB minimum. 64GB if you'll be doing RAG with large document sets.
Storage: NVMe SSD required. 70B model in Q4_K_M format = ~40GB. Budget 500GB for model storage.
OS: Ubuntu 22.04 LTS is the recommended OS for GPU AI workloads. macOS works for Apple Silicon. Windows works but adds WSL2 complexity.

Step 2: NVIDIA Driver and CUDA Setup (Linux)

# Add NVIDIA package repository

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb

sudo dpkg -i cuda-keyring_1.1-1_all.deb

sudo apt-get update

# Install CUDA toolkit (12.4 as of 2025)

sudo apt-get install -y cuda-toolkit-12-4

# Verify installation

nvidia-smi

nvcc --version

Proper CUDA setup is the foundation — everything else builds on top of it.

Step 3: Install Ollama (Easiest Path)

Ollama is the fastest way to get a model running locally. One command installs a local model server with OpenAI-compatible API:

# Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Llama 3.1 8B (fastest, ~12GB)

ollama run llama3.1

# Pull Llama 3.1 70B (best quality, ~40GB needed)

ollama pull llama3.1:70b

# List downloaded models

ollama list

Step 4: API Access

Ollama exposes an OpenAI-compatible API at localhost:11434. You can drop it into any existing OpenAI SDK code by just changing the base URL:

# Python example

from openai import OpenAI

client = OpenAI(

  base_url="http://localhost:11434/v1",

  api_key="ollama",  # any string works

)

response = client.chat.completions.create(

  model="llama3.1:70b",

  messages=[{"role": "user", "content": "Hello"}]

)

Step 5: Performance Tuning

Default Ollama settings leave performance on the table. Key tuning options:

# Increase context window (default 2048, extend to 8192)

OLLAMA_NUM_CTX=8192 ollama serve

# Set GPU layers (default auto, but explicit is more reliable)

OLLAMA_NUM_GPU=999 ollama serve  # use all available VRAM

# Parallel requests (increase for serving to multiple users)

OLLAMA_NUM_PARALLEL=4 ollama serve

Code performance monitoring and optimization

Tuning GPU memory allocation and context window size can double effective inference speed.

Choosing the Right Quantization

Quantization reduces model size and speeds up inference at the cost of small quality degradation. The practical tradeoffs:

Q8_0 — nearly lossless quality, 8-bit. Use when you have VRAM to spare. Llama 3.1 70B Q8 = ~80GB (needs 2×4090)
Q4_K_M — the sweet spot. Minimal quality loss, ~40% size vs Q8. Llama 3.1 70B Q4_K_M = ~43GB (fits on 2×4090)
Q3_K_M — smaller but noticeable quality drop. Use for 7B–13B models on constrained hardware.
Q2_K — emergency option only. Quality degrades significantly. Better to use a smaller model at Q4.

Rule of thumb: Use the largest model that fits in your VRAM at Q4_K_M before dropping to a smaller model at Q8. A 70B Q4_K_M outperforms a 13B Q8 by a significant margin on most reasoning tasks.

Need Help Setting Up Your AI Infrastructure?

We set up production-grade local LLM infrastructure for development teams. Start with a free consultation.

Talk to the Team

Devin Mallonee

Founder & AI Agent Architect · CodeStaff

Devin has been building software products and remote teams since 2017. He founded CodeStaff to deploy purpose-built AI agents and workstations that replace repetitive work and scale operations for businesses of every size. He writes about AI strategy, agent architecture, and the practical reality of deploying AI in production.