Mixtral 8x7B

Overview

Mixtral 8x7B is a high-quality Sparse Mixture of Experts (SMoE) model released by Mistral AI on December 11, 2023. It was the first open-source Mixture-of-Experts model to match or exceed the performance of much larger dense models while maintaining fast inference speeds. Released under Apache 2.0 license.

Key Specifications

  • Total Parameters: 45 billion
  • Active Parameters per Token: ~13 billion (live parameters)
  • Number of Experts: 8 experts per MLP layer
  • Active Experts per Token: 2 (top_k = 2)
  • Context Window: 32,768 tokens
  • License: Apache 2.0
  • Release Date: December 11, 2023
  • Architecture Type: Decoder-only Transformer with MoE

Architecture

Core Parameters

  • Dimensions: 4,096
  • Layers: 32
  • Attention Heads: 32
  • Head Dimension: 128
  • Hidden Dimension: 14,336
  • Vocabulary Size: 32,000

Mixture-of-Experts Design

The model uses a Sparse MoE architecture where:

  • Each layer contains 8 expert networks
  • For each token, a router selects the top 2 most relevant experts
  • Only the selected experts process the token, keeping inference efficient
  • Total of 45B parameters, but only ~13B active per token
  • Uses SwiGLU activation function for expert networks

Performance

Benchmark Highlights

vs LLaMA 2 70B:

  • Outperforms on most benchmarks
  • 6x faster inference despite comparable performance
  • Mathematics: 58.4% (GSM8K) vs 53.6%
  • Code: 60.7% (MBPP) vs 49.8%

vs GPT-3.5:

  • Matches or outperforms on most standard benchmarks

Multilingual Capabilities

Mixtral excels in multiple European languages:

  • Supported Languages: English, French, German, Spanish, Italian
  • Consistently matches or outperforms LLaMA 2 70B on multilingual benchmarks:
    • ARC-c (multilingual)
    • HellaSwag (multilingual)
    • MMLU (multilingual)

Instruction-Following

Mixtral 8x7B Instruct:

  • MT-Bench Score: 8.3
  • Ranks above GPT-3.5 Turbo on LMSys Leaderboard (late 2023)
  • Optimized for conversational and instruction-following tasks

Technical Advantages

  1. Efficiency: 6x faster inference than comparable dense models (LLaMA 2 70B)
  2. Scalability: MoE architecture allows scaling without proportional compute increase
  3. Flexibility: Handles 32K token contexts vs typical 4K-8K
  4. Cost-Effective: Lower inference costs due to sparse activation
  5. Multilingual: Native support for 5 European languages

Use Cases

  • Code Generation: Strong performance on programming tasks
  • Mathematical Reasoning: Excellent on GSM8K and other math benchmarks
  • Multilingual Applications: Production-ready for European languages
  • Long-Context Tasks: 32K context enables document analysis
  • Real-Time Applications: Fast inference suitable for interactive use

Deployment

Integration Options

  • Cloud Platforms: AWS, GCP, Azure
  • Hugging Face: Full integration with transformers library
  • Ollama: Local deployment support
  • vLLM: Optimized inference server
  • API: Available via Mistral AI’s API platform

Significance

Mixtral 8x7B was groundbreaking because it:

  • Demonstrated that open-source MoE models could compete with proprietary systems
  • Proved that sparse architectures offer superior efficiency vs dense models
  • Established Mistral AI as a leader in efficient model design
  • Made high-performance multilingual AI accessible under permissive licensing
  • Showed that European AI companies could innovate at the frontier

The model’s success led to widespread adoption and influenced subsequent MoE architectures across the industry.

Resources


Status: OK
Last Updated: 2025-12-25
Review: Completed and approved for publication