Gemma 4 on Apple Silicon: running Google's latest open models with MLX and Ollama
Google's Gemma 4 family — released under Apache 2.0 in April 2026 — delivers frontier-level intelligence across five sizes: E2B (2.3B effective), E4B (4.5B effective), 12B, 26B MoE (3.8B active), and 31B Dense. While these models run everywhere from Raspberry Pis to server GPUs, Apple Silicon users have a unique advantage: MLX, Apple's open-source array framework built specifically for the Mac's unified memory architecture.
Since June 2026, Ollama ships MLX-optimized Gemma 4 variants (gemma4:e2b-mlx, gemma4:12b-mlx, gemma4:26b-mlx, etc.), and with Ollama 0.31+, multi-token prediction (MTP) delivers nearly 90% faster generation on Apple Silicon.
What is MLX?
MLX is an open-source array framework for machine learning on Apple Silicon, developed by Apple's ML research team. Unlike general-purpose frameworks, MLX is purpose-built to exploit the strengths of Apple's M-series chips:
- Unified Memory Architecture — The CPU and GPU share a single pool of memory, eliminating costly data copies between devices. Operations simply declare their target device, and the framework handles the rest.
- Lazy Computation — MLX builds a computation graph that executes only when a result is needed, enabling automatic optimization of operation ordering and fusion.
- Function Transformations — Automatic differentiation (autograd), compilation, and graph optimization are applied as function transforms, making it simple to experiment with gradient-based techniques without manual plumbing.
- Metal GPU Acceleration — All compute is accelerated via Apple's Metal GPU framework, and on M5 chips, MLX leverages the Neural Accelerators via Metal 4's Tensor Operations for up to 4x faster time-to-first-token compared to M4.
The MLX ecosystem includes:
- MLX — Core array framework (Python, Swift, C++, C APIs)
- MLX-LM — High-level package for loading, running, quantizing, and fine-tuning language models
- MLX-VLM — Vision-language model support (used by Gemma 4's multimodal capabilities)
- MLX-LM Server — OpenAI-compatible HTTP server for local inference
MLX performance on Gemma 4
The combination of Gemma 4's efficient architecture and MLX's Apple Silicon optimization yields impressive results:
- E2B / E4B variants run comfortably on any M-series Mac with 8 GB+ unified memory, achieving 30-50 tok/s on M4 hardware.
- 12B model runs on 16 GB M-series machines at ~25 tok/s with 4-bit quantization.
- 26B MoE — despite 25B total parameters, only 3.8B are active per token, making it viable on 24 GB Macs with excellent quality-per-GB.
- 31B Dense requires 32 GB+ but ranks #3 on LMArena among all open models.
Ollama's contributed MLX kernel for batched speculation reads and unpacks each block of weights once, reusing them across the entire MTP batch rather than re-reading for every token. On M5 Max with nvfp4, this makes Gemma 4's largest matrix multiplications 2× to 2.5× faster.
Running Gemma 4 with Ollama on macOS
1. Install Ollama 0.31+
Ensure you have the latest Ollama:
ollama --version
If below 0.31, download the latest from ollama.com.
2. Pull an MLX-optimized Gemma 4 variant
# Recommended for most M-series Macs (12B, balanced quality/speed)
ollama pull gemma4:12b-mlx
# Lightweight edge model (runs on 8 GB Macs)
ollama pull gemma4:e2b-mlx
# MoE model — high quality, moderate memory
ollama pull gemma4:26b-mlx
# Maximum quality (needs 32 GB+)
ollama pull gemma4:31b-mlx
3. Run inference
# Text prompt
ollama run gemma4:12b-mlx "Explain how MLX leverages unified memory."
# With an image (Gemma 4 is natively multimodal)
ollama run gemma4:12b-mlx "Describe this diagram." /path/to/diagram.png
4. Use with the OpenAI-compatible API
Ollama runs a local server on port 11434 automatically:
curl http://localhost:11434/api/generate \
-d '{"model": "gemma4:12b-mlx", "prompt": "What is MLX?"}'
Or use any OpenAI SDK by pointing the base URL to http://localhost:11434/v1.
5. Launch a coding agent with MTP speedups
Ollama 0.31+ enables multi-token prediction by default. To use Gemma 4 as a coding agent:
ollama launch claude --model gemma4:12b-mlx
The MTP draft model runs alongside the main model, proposing multiple tokens at once. The main model verifies the entire proposal in a single pass — no configuration needed.
Using MLX directly (without Ollama)
For developers who want fine-grained control, MLX-LM provides direct Python access:
pip install mlx mlx-lm mlx-vlm
from mlx_lm import generate, load
model, tokenizer = load("mlx-community/gemma-4-e2b-it-4bit")
response = generate(model, tokenizer, "What is MLX?", verbose=True)
Or start an OpenAI-compatible server:
mlx_lm.server --model mlx-community/gemma-4-e2b-it-4bit --port 8080
This serves the model at http://localhost:8080/v1, compatible with any OpenAI client library.
Choosing the right variant
| Tag | Disk | RAM Needed | Best For |
|---|---|---|---|
gemma4:e2b-mlx |
6.5 GB | 8 GB | Older/entry-level Macs |
gemma4:e4b-mlx |
8.8 GB | 12 GB | Most MacBooks |
gemma4:12b-mlx |
7.7 GB | 16 GB | Best balance (256K context) |
gemma4:26b-mlx |
18 GB | 24 GB | High quality, MoE efficiency |
gemma4:31b-mlx |
19 GB | 32 GB | Maximum reasoning power |
If you downloaded a Gemma 4 MLX variant before Ollama 0.31, re-pull it with ollama pull gemma4:12b-mlx to get the version with multi-token prediction support. The speedup (up to 90% on coding benchmarks) is automatic — no flags or config changes needed.
With MLX, Gemma 4 transforms Apple Silicon Macs into first-class local AI workstations. Whether you use Ollama's convenience or MLX-LM's flexibility, you get the full power of Google's most capable open models — completely private, offline, and running at native Metal speeds.