Workflow strategy06/08/2026

Local AI agents on 16 GB: What Gemma 4 QAT means for European businesses

Google's Gemma 4 QAT models now run on 16 GB laptops and 1 GB mobile devices. What Quantization-Aware Training changes technically, which model fits which hardware, and how European businesses get into production in ten minutes — including the gotcha and GDPR-compliant deployment scenarios.

Local AI agents on 16 GB: What Gemma 4 QAT means for European businesses

Google shipped something on June 5, 2026 that rewrites the rules for running local AI agents: Gemma 4 in a Quantization-Aware Training (QAT) variant. The models need roughly 72 percent less memory than full precision while preserving nearly the original quality. Concretely, a 26B model that previously wouldn't fit on a 16 GB laptop now runs there. The smallest E2B drops to about 1 GB in the mobile format. The threshold is crossed: capable local AI agents now run on standard hardware, with no cloud dependency, no API costs, and no data leaving the building.

Memory

-72%

VRAM vs. full precision

Languages

140+

supported, including English and German

Context

256K

tokens for 12B and larger

Why Quantization-Aware Training changes the game

Most users know post-training quantization: a model trains in full precision, then weights get rounded down to 4 bits. That saves memory but costs quality, because the model never learned to deal with the rounding. The errors compound through dozens of transformer layers.

QAT flips the order. Quantization is built into the training process itself, so the model learns weights that survive the compression. Google reports that the 4-bit variant gets significantly closer to full-precision quality than a naive PTQ conversion. In practice that means fewer hallucinations, more reliable tool calls, and a better fit for multi-step agent workflows.

The QAT release ships in four formats, each targeted at a different runtime:

Format	Target runtime	Available models
GGUF (Q4_0)	llama.cpp, Ollama, LM Studio	E2B, E4B, 12B, 26B-A4B, 31B
Compressed Tensors (w4a16)	vLLM, SGLang (server)	E2B, E4B, 12B, 31B
Mobile (wNa8o8)	LiteRT-LM, edge runtimes	E2B, E4B
Unquantized QAT	Custom conversion	All sizes + drafters

Laptops and desktops use GGUF. High-concurrency server deployments use w4a16 with vLLM or SGLang. Mobile or edge setups use the new mobile format.

Which model fits which hardware

The QAT family comes in five sizes, each matched to common devices:

E2B (dense, ~3 GB, mobile ~1 GB): Phones, Raspberry Pi 5, basic laptops. Transcription, summaries, classification, simple chat.
E4B (dense, ~5 GB): 8 GB laptops, entry GPUs. Solid day-to-day quality.
12B (dense, ~7 GB, 256K context): The comfortable all-rounder for 16 GB Macs and 8-12 GB GPUs. Multimodal, encoder-free, covers most agent workloads.
26B-A4B (Mixture of Experts, ~15 GB, 256K context): Only 3.8B parameters active per token. Feels like a 4B model in speed, reasons far better. The standout QAT unlock.
31B (dense, ~18 GB, 256K context): Maximum accuracy for hard reasoning and coding. Needs 24 GB GPU or 32 GB Mac.

The memory figures assume moderate context lengths. Going to the full 256K window adds KV cache on top. On a 16 GB machine, the 12B with reduced context or the 26B-A4B with short-to-medium context is the practical choice.

Three scenarios that are realistic now

1. GDPR-compliant customer service agent. A 12B model runs on a company-owned server. Customer data, contract information, and internal knowledge bases never leave the building. The agent handles 70 to 80 percent of standard inquiries; a human reviews anything sensitive. Running costs drop to electricity; licensing costs disappear.

2. Industry-specific edge agent. An HVAC installer puts a mini PC or SolidRun board on the workbench. The local agent helps with material research, suggests solutions to common installation problems, and translates customer requests. Works offline when site Wi-Fi fails.

3. Compliance-grade knowledge assistant for tax advisors and law firms. Client data is highly sensitive. A 26B-A4B runs on a 16 GB workstation, searches the client database, and never sends a token to external providers. The audit trail and data sovereignty stay in-house.

Practical start in ten minutes

Three commands and you have a running QAT model with Ollama:

Install Ollama (macOS: brew install ollama; Linux: curl -fsSL https://ollama.com/install.sh | sh).
Pull a model: ollama pull gemma4:26b-it-qat for 16 GB machines, or gemma4:e4b-it-qat for laptops.
Test it: ollama run gemma4:26b-it-qat "Explain QAT in one sentence." and verify the API with curl http://localhost:11434/api/tags.

Recommended sampling: temperature 1.0, top_p 0.95, top_k 64. QAT does not change the sampling behavior of the full-precision models.

The gotcha everyone should know

Users running QAT checkpoints in llama.cpp or Ollama should avoid naive Q4_0 conversions. Unsloth measured only about 25 percent byte exactness between llama.cpp's Q4_0 and the true QAT weights, and the accuracy drop is noticeable.

The fix: Unsloth's dynamic GGUFs (UD-Q4_K_XL) force better agreement between the llama.cpp format and the QAT weights. The accuracy recovery is significant: the 26B-A4B jumps from 70.2 to 85.6 percent top-1, the 31B from 87.9 to 96.7 percent. Anyone on llama.cpp or Ollama should grab the Unsloth files.

What about the cloud?

The question is not cloud versus local, but which workloads run where. Local models shine for data-sensitive tasks, at the edge, and anywhere latency or availability is critical. Cloud APIs keep their edge for very large contexts, rare specialized cases, and tasks that need frontier capability. A pragmatic stack combines both: a local 12B or 26B-A4B for standard work, a cloud model as fallback for the hard five percent.

The QAT release fundamentally shifts the economics of many applications. What used to require a dedicated GPU server with 80 GB of VRAM now runs on an office laptop. For SMEs, trades businesses, and professional services firms bound by professional privilege, this is the most exciting AI development of the year. Anyone who puts a local model into production in the next three months gains a data protection advantage that pays off directly in tenders, audits, and customer trust.

The code to try it is ready. What's missing is just the first step.

centerbit

Book a consultation now

If you see similar manual work in your team, we can review the process together in a free initial consultation.

Request consultation