Gemma 4
by Google DeepMind
Google’s open-weight model family delivering near-frontier performance on consumer and edge hardware — multimodal, efficient, and Apache 2.0 licensed.
See https://deepmind.google/models/gemma/gemma-4/
Model specs
- Parameters: multiple sizes including 26B and 31B variants (Mixture of Experts architecture)
- Context window: 256K tokens
- Modalities: text, vision (images), audio, video input; text output
- Release date: April 2026
Capabilities
- Multimodal reasoning: vision + text tasks (demonstrated with F1 Donut simulation, product image analysis)
- Frontend and UI code generation — competitive with frontier models in practical demos
- Agentic tool use and function calling
- Reasoning and instruction following
- Local deployment on consumer hardware — tested at ~3,005 tokens/s on Mac Studio M2
- Fine-tuning support (LoRA and full fine-tune); serverless inference on Google Cloud
- Edge and on-device AI deployment
- Open-weight: download and self-host under Apache 2.0 license
Benchmark highlights
- Achieves near-frontier performance while being significantly more token-efficient than models of comparable quality
- 31B vs 26B: 31B produces higher-quality frontend generation outputs; 26B is faster for iteration
- Surprised the AI industry with quality-to-size ratio at April 2026 launch
- Runs locally on Mac M2 at speeds previously requiring cloud infrastructure for comparable models
- Strong multimodal benchmark results across visual reasoning and document understanding tasks
Access
- Open-weight download via HuggingFace and Google’s model hub (Apache 2.0)
- Local inference: Ollama (
ollama pull gemma4), llama.cpp, Open WebUI - Google Cloud: serverless inference via Vertex AI
- Google AI Studio: free web access for testing
- No per-token API cost for self-hosted deployments