Gemma 4

by Google DeepMind

Google’s open-weight model family delivering near-frontier performance on consumer and edge hardware — multimodal, efficient, and Apache 2.0 licensed.

See https://deepmind.google/models/gemma/gemma-4/

Model specs

  • Parameters: multiple sizes including 26B and 31B variants (Mixture of Experts architecture)
  • Context window: 256K tokens
  • Modalities: text, vision (images), audio, video input; text output
  • Release date: April 2026

Capabilities

  • Multimodal reasoning: vision + text tasks (demonstrated with F1 Donut simulation, product image analysis)
  • Frontend and UI code generation — competitive with frontier models in practical demos
  • Agentic tool use and function calling
  • Reasoning and instruction following
  • Local deployment on consumer hardware — tested at ~3,005 tokens/s on Mac Studio M2
  • Fine-tuning support (LoRA and full fine-tune); serverless inference on Google Cloud
  • Edge and on-device AI deployment
  • Open-weight: download and self-host under Apache 2.0 license

Benchmark highlights

  • Achieves near-frontier performance while being significantly more token-efficient than models of comparable quality
  • 31B vs 26B: 31B produces higher-quality frontend generation outputs; 26B is faster for iteration
  • Surprised the AI industry with quality-to-size ratio at April 2026 launch
  • Runs locally on Mac M2 at speeds previously requiring cloud infrastructure for comparable models
  • Strong multimodal benchmark results across visual reasoning and document understanding tasks

Access

  • Open-weight download via HuggingFace and Google’s model hub (Apache 2.0)
  • Local inference: Ollama (ollama pull gemma4), llama.cpp, Open WebUI
  • Google Cloud: serverless inference via Vertex AI
  • Google AI Studio: free web access for testing
  • No per-token API cost for self-hosted deployments