Guides • July 1, 2026

Gemma 4 on Apple Silicon: running Google's latest open models with MLX and Ollama

By Maxime

Google's Gemma 4 family — released under Apache 2.0 in April 2026 — delivers frontier-level intelligence across five sizes: E2B (2.3B effective), E4B (4.5B effective), 12B, 26B MoE (3.8B active), and 31B Dense. While these models run everywhere from Raspberry Pis to server GPUs, Apple Silicon users have a unique advantage: MLX, Apple's open-source array framework built specifically for the Mac's unified memory architecture.

Since June 2026, Ollama ships MLX-optimized Gemma 4 variants (gemma4:e2b-mlx, gemma4:12b-mlx, gemma4:26b-mlx, etc.), and with Ollama 0.31+, multi-token prediction (MTP) delivers nearly 90% faster generation on Apple Silicon.

What is MLX?

MLX is an open-source array framework for machine learning on Apple Silicon, developed by Apple's ML research team. Unlike general-purpose frameworks, MLX is purpose-built to exploit the strengths of Apple's M-series chips:

Unified Memory Architecture — The CPU and GPU share a single pool of memory, eliminating costly data copies between devices. Operations simply declare their target device, and the framework handles the rest.
Lazy Computation — MLX builds a computation graph that executes only when a result is needed, enabling automatic optimization of operation ordering and fusion.
Function Transformations — Automatic differentiation (autograd), compilation, and graph optimization are applied as function transforms, making it simple to experiment with gradient-based techniques without manual plumbing.
Metal GPU Acceleration — All compute is accelerated via Apple's Metal GPU framework, and on M5 chips, MLX leverages the Neural Accelerators via Metal 4's Tensor Operations for up to 4x faster time-to-first-token compared to M4.

The MLX ecosystem includes:

MLX — Core array framework (Python, Swift, C++, C APIs)
MLX-LM — High-level package for loading, running, quantizing, and fine-tuning language models
MLX-VLM — Vision-language model support (used by Gemma 4's multimodal capabilities)
MLX-LM Server — OpenAI-compatible HTTP server for local inference

MLX performance on Gemma 4

The combination of Gemma 4's efficient architecture and MLX's Apple Silicon optimization yields impressive results:

E2B / E4B variants run comfortably on any M-series Mac with 8 GB+ unified memory, achieving 30-50 tok/s on M4 hardware.
12B model runs on 16 GB M-series machines at ~25 tok/s with 4-bit quantization.
26B MoE — despite 25B total parameters, only 3.8B are active per token, making it viable on 24 GB Macs with excellent quality-per-GB.
31B Dense requires 32 GB+ but ranks #3 on LMArena among all open models.

Ollama's contributed MLX kernel for batched speculation reads and unpacks each block of weights once, reusing them across the entire MTP batch rather than re-reading for every token. On M5 Max with nvfp4, this makes Gemma 4's largest matrix multiplications 2× to 2.5× faster.

Running Gemma 4 with Ollama on macOS

1. Install Ollama 0.31+

Ensure you have the latest Ollama:

ollama --version

If below 0.31, download the latest from ollama.com.

2. Pull an MLX-optimized Gemma 4 variant

# Recommended for most M-series Macs (12B, balanced quality/speed)
ollama pull gemma4:12b-mlx

# Lightweight edge model (runs on 8 GB Macs)
ollama pull gemma4:e2b-mlx

# MoE model — high quality, moderate memory
ollama pull gemma4:26b-mlx

# Maximum quality (needs 32 GB+)
ollama pull gemma4:31b-mlx

3. Run inference

# Text prompt
ollama run gemma4:12b-mlx "Explain how MLX leverages unified memory."

# With an image (Gemma 4 is natively multimodal)
ollama run gemma4:12b-mlx "Describe this diagram." /path/to/diagram.png

4. Use with the OpenAI-compatible API

Ollama runs a local server on port 11434 automatically:

curl http://localhost:11434/api/generate \
  -d '{"model": "gemma4:12b-mlx", "prompt": "What is MLX?"}'

Or use any OpenAI SDK by pointing the base URL to http://localhost:11434/v1.

5. Launch a coding agent with MTP speedups

Ollama 0.31+ enables multi-token prediction by default. To use Gemma 4 as a coding agent:

ollama launch claude --model gemma4:12b-mlx

The MTP draft model runs alongside the main model, proposing multiple tokens at once. The main model verifies the entire proposal in a single pass — no configuration needed.

Using MLX directly (without Ollama)

For developers who want fine-grained control, MLX-LM provides direct Python access:

pip install mlx mlx-lm mlx-vlm

from mlx_lm import generate, load

model, tokenizer = load("mlx-community/gemma-4-e2b-it-4bit")
response = generate(model, tokenizer, "What is MLX?", verbose=True)

Or start an OpenAI-compatible server:

mlx_lm.server --model mlx-community/gemma-4-e2b-it-4bit --port 8080

This serves the model at http://localhost:8080/v1, compatible with any OpenAI client library.

Choosing the right variant

Tag	Disk	RAM Needed	Best For
`gemma4:e2b-mlx`	6.5 GB	8 GB	Older/entry-level Macs
`gemma4:e4b-mlx`	8.8 GB	12 GB	Most MacBooks
`gemma4:12b-mlx`	7.7 GB	16 GB	Best balance (256K context)
`gemma4:26b-mlx`	18 GB	24 GB	High quality, MoE efficiency
`gemma4:31b-mlx`	19 GB	32 GB	Maximum reasoning power

⚡ MTP Speedup Note

If you downloaded a Gemma 4 MLX variant before Ollama 0.31, re-pull it with ollama pull gemma4:12b-mlx to get the version with multi-token prediction support. The speedup (up to 90% on coding benchmarks) is automatic — no flags or config changes needed.

With MLX, Gemma 4 transforms Apple Silicon Macs into first-class local AI workstations. Whether you use Ollama's convenience or MLX-LM's flexibility, you get the full power of Google's most capable open models — completely private, offline, and running at native Metal speeds.