Running LLMs on CPU-Only Linux Machines: What Actually Works

Question

25488

views

✓ Answered

Running LLMs on CPU-Only Linux Machines: What Actually Works

Asked 2026-05-16 01:01:08 Category: Software Tools

Many people believe that running large language models (LLMs) locally requires a powerful GPU. However, recent developments in model formats like GGUF and quantization techniques have made it possible to run LLMs efficiently on CPUs—even older ones. The key factor isn't model size or RAM but tokens per second (tok/s). A model offering 3–5 tok/s feels painfully slow, while 15–30 tok/s provides a responsive experience. This guide, based on hands-on testing of 8 models on a modest Intel i5 laptop with 12 GB RAM (no GPU acceleration), focuses on models that are truly usable on low-end hardware for everyday tasks.

What Changed to Make CPU Inference Possible?

Two innovations transformed CPU-based LLM inference. First, the GGUF format allows models to be stored and loaded in reduced precision, dramatically shrinking file sizes. Second, aggressive quantization—especially 4-bit variants like Q4_K_M—cuts memory requirements while preserving acceptable output quality. Combined with runtimes such as llama.cpp, which are highly optimized for CPU architectures, even older processors can now run small to medium models without a dedicated GPU.

Running LLMs on CPU-Only Linux Machines: What Actually Works — Source: itsfoss.com

Why Tokens per Second Is the Real Metric You Should Care About

Raw model size or RAM usage can be misleading. The true measure of usability is tokens per second (tok/s). In my tests, a model delivering below 5 tok/s felt glacial—waiting tens of seconds for each sentence. By contrast, reaching 15–30 tok/s made interactions feel natural and suitable for daily use. Quantization dramatically affects this number: Q8 offers higher quality but is slower, while Q4_K_M often doubles token speed, moving a model from frustrating to practical. Always test tok/s on your own hardware instead of relying solely on parameter counts.

Which Model Sizes Perform Best on Limited Hardware?

Models in the 1–2 billion parameter range consistently offer the best balance on CPUs. They comfortably fit within 8 GB RAM when quantized (e.g., Q4_K_M) and maintain steady token speeds of 25–40+ tok/s on an Intel i5. Despite their size, they handle basic reasoning, summarization, and chat tasks well. Larger models (3–4B) can run but often drop to 4–7 tok/s, which feels sluggish unless you are willing to wait. For low-end laptops or Raspberry Pis, stick to 1–2B models for a responsive experience.

What Quantization Should You Choose for CPU?

Based on extensive testing, Q4_K_M strikes the best compromise. Compared to Q8 (highest quality, slowest) or Q2 (fast but degraded output), Q4_K_M provides fast inference, low RAM consumption (often half of the full-precision model), and only a minor drop in answer coherence. It can push a borderline model from a painful 3 tok/s to a usable 15 tok/s. For most everyday tasks, this quantization level is the sweet spot. Try it first, then adjust to Q5_K_M if you have extra memory and need slightly better quality, or Q3_K_S for speed-critical scenarios.

What Hardware Do You Actually Need?

I ran all tests on an older Intel i5 (8th gen) laptop with 12 GB of RAM and integrated UHD Graphics 620—standard low-end hardware. The iGPU played no role; inference happened entirely on the CPU. The system stayed responsive (no swapping) with quantized 1–2B models. For a comfortable setup, aim for at least 8 GB free RAM after OS overhead. A Linux distribution with minimal background processes helps maximize available memory. Even a Raspberry Pi 4 or 5 with 4–8 GB RAM can run tiny models (e.g., 0.5B–1B quantized) at usable speeds (5–15 tok/s).

Real-World Examples: What Token Speeds Can You Expect?

On my test hardware, a quantized 1.5B model (Q4_K_M) scored around 40 tok/s, making it feel instant. A 3B model under the same quantization dropped to 12–15 tok/s—still usable for chat but with a slight pause. A 4B model (Q4_K_M) crawled at 4–5 tok/s, which was too slow for interactive use. For comparison, a 7B model (any quantization) failed entirely due to RAM constraints. These numbers highlight why small quantized models are the practical choice for CPU-only machines.

Tips for Running LLMs on Linux Without a GPU

Use llama.cpp or Ollama (which bundles llama.cpp) – both are CPU-first tools with excellent quantization support.
Monitor memory usage with htop before loading a model; leave at least 1 GB free for the system.
Start with Q4_K_M quantization and a 1–2B model, then test --tensor-split if you have multiple CPU sockets.
Disable browser tabs and background services to free RAM – every GB counts.
For headless servers, use the embedding or batch inference modes to improve throughput on non-interactive tasks.