The privacy calculus for small teams has shifted. Running a 7B-parameter model on a consumer GPU no longer requires a GPU farm — it requires an afternoon, a USB stick, and a willingness to ignore the SaaS sales funnel.
Why Local Inference Changes the Equation
For teams handling sensitive client data — accountants, legal professionals, HR administrators — the cloud inference paradigm carries a systemic risk that enterprise data processing agreements rarely fully resolve. Local inference eliminates that risk category entirely.
Gemma 3, Google’s latest open-weights model family, arrives at a pivotal moment. The 4B and 12B quantised variants run efficiently on consumer RTX-class GPUs with 8–16GB of VRAM, making departmental deployment viable without a capital approval process.
Hardware Baseline
For this deployment we worked with an accounts team running a reconditioned workstation: an RTX 3090 (24GB VRAM), 64GB of DDR4, and a 2TB NVMe. Total hardware cost was roughly £1,800 — equivalent to about four months of a mid-tier LLM API subscription at moderate usage volume.
- GPU — RTX 3090, 24GB VRAM. Runs the 12B model with comfortable headroom.
- Memory — 64GB DDR4. Enough for inference plus the surrounding workflow tooling.
- Storage — 2TB NVMe. Holds the model weights and a local document cache.
- Total cost — ~£1,800, reconditioned. Roughly four months of mid-tier API spend.
Model Selection
The team evaluated three quantisation levels across Gemma 3 variants. The Q4_K_M quantisation of the 12B model emerged as the practical default: strong instruction-following on document tasks, a ~6GB VRAM footprint at rest, and sub-two-second first-token latency on a cold start.
40%reduction in admin overhead across invoice processing and document classification within the first operational month.
The Deployment Stack
The integration stack is deliberately minimal: Ollama as the inference server, a lightweight FastAPI wrapper for structured output, and n8n for workflow orchestration. No Docker knowledge required beyond a basic install; the entire stack runs as persistent background services that start with the machine.
- Ollama. The local inference server, exposing an OpenAI-compatible API on port 11434.
- FastAPI wrapper. A thin layer that enforces structured JSON output for downstream steps.
- n8n. Orchestrates the end-to-end workflow, from email trigger to spreadsheet write.
# Install Ollama and pull the model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma3:12b-instruct-q4_K_M
# Verify the deployment
ollama run gemma3:12b-instruct-q4_K_M \
"Summarise this invoice in JSON format: ..."
Connecting to n8n
With Ollama exposing a local OpenAI-compatible API on port 11434, integration with n8n requires nothing more than configuring an HTTP Request node to point at http://localhost:11434/v1/chat/completions, with the model name specified in the request body.
The workflow receives an email trigger, calls the local inference endpoint with a structured extraction prompt, and writes the parsed output to a shared spreadsheet — no human in the loop unless confidence falls below a defined threshold.
Limitations and Caveats
Local inference is not a universal replacement for cloud APIs. For tasks requiring real-time web retrieval, multi-modal reasoning at scale, or seamless multi-user collaborative access, cloud inference remains the pragmatic choice. The value proposition is narrower but sharp: sensitive document processing, repeatable structured extraction, and workflows where data residency is non-negotiable.
The hardware pays back in roughly six to eight months at moderate usage. After that, inference cost is effectively zero.
Next Steps
The complete blueprint for this deployment — including the Ollama install script, the FastAPI structured-output wrapper, and the n8n workflow JSON — is available in the Blueprints section. It is designed to be deployed in under 90 minutes on compatible hardware.


