Gemma 4 Brings Vision and Tool Calling — Agents That See, on Your Own Box

A model that can read text is useful. A model that can look at a photograph of a shelf and then do something about it — call a stock system, flag a gap, write a reorder line — is a different kind of tool. Gemma 4 brings both of those abilities together, and crucially, it brings them to hardware you can keep in the back office.

What Changed in Gemma 4

Gemma 4 was released on 2 April 2026 with two additions that matter for anyone building practical workflows: built-in tool calling and vision support, with the vision capabilities updated again in June 2026. Per the local-models guide from Hugging Face, the 26B “A4B” variant is the one recommended for local deployment — the sensible default for a team that wants capability without a server rack.

“Tool calling” is the unglamorous feature doing the heavy lifting here. It means the model can decide, mid-task, to call a function you have given it — look up a SKU, query a price, write to a spreadsheet — rather than just producing text and leaving you to wire up the next step. Combine that with vision, and you have the makings of an agent that can see an image and act on what it sees.

The Runtime Caught Up Fast

Capabilities in a model only matter once the software you run it with supports them. That arrived quickly. Ollama v0.24.0, released on 14 May 2026, added full Gemma 4 support — including “thinking” and tool calling — according to the May 2026 local-runtime round-up. The same update notes a meaningful speed gain on Apple hardware.

2xfaster: Gemma 4 MTP speculative decoding on Mac delivers over a twofold speed increase on the 31B model for coding tasks.

Speculative decoding is a technique where a small, fast model drafts tokens and the larger model verifies them in bulk — the net effect is more output per second without a hardware upgrade. For a small team on a Mac, that is the difference between a model that feels usable and one that feels like waiting.

What Vision and Tools Unlock for Retail

This is where the abstract features become a shop-floor proposition. Because the whole pipeline runs locally — Ollama on a single box, as documented on the Ollama site — the images you feed it never leave the building. That is the part that makes a real difference for retail.

Shelf and planogram checks. Photograph a shelf, ask the model what is out of stock or mis-faced, and have it call your inventory tool to flag the gaps — no photos of your store layout uploaded to a third party.
Stock-in and delivery reconciliation. Point the model at a delivery note or a pallet photo, extract the line items, and reconcile against what was ordered.
Document handling at the till and back office. Supplier invoices, returns paperwork and handwritten notes become structured data the model can write straight into a spreadsheet.
Faster internal tooling. With tool calling, the model can chain these steps itself rather than handing each one back to a person.

The common thread is sensitive imagery — your shelves, your suppliers, your paperwork — being processed on premises rather than streamed to a cloud API.

A minimal starting point

# Pull the locally-recommended Gemma 4 variant with vision support
ollama pull gemma4:26b

# Ask it about an image and let it call your stock-check tool
ollama run gemma4:26b \
  "Look at shelf-photo.jpg. List any empty facings, then call check_stock for each SKU."

Begin with one workflow — shelf gaps, or invoice extraction — and confirm both the vision accuracy and the tool calls behave before you let it run unattended.

What This Means for a Small UK Team

For a retail team leader, Gemma 4 is the first local model that genuinely earns the word “agent”: it can see an image, decide what to do, and call your own tools to do it — all on a single machine in the back office. Pick the 26B variant for deployment, make sure your runtime is current (Ollama v0.24.0 or later), and start with one camera-to-action workflow such as shelf checks or delivery reconciliation. The payoff is automation that reads your shop floor without ever sending a photograph of it to the cloud.

Filed under Local Inference · Models

Gemma 4 Brings Vision and Tool Calling — Agents That See, on Your Own Box

What Changed in Gemma 4

The Runtime Caught Up Fast

What Vision and Tools Unlock for Retail

A minimal starting point

What This Means for a Small UK Team

Continue Reading

OpenJarvis v1.0: The Local-First Agent Framework Ollama Has Been Waiting For

The $19 Agentic Stack: More Tokens for Your Money Than a $20 Seat

The UK's £500M Sovereign AI Unit: what it actually means for SMEs