Mixtral 8x7B
Overview
Mixtral 8x7B is a high-quality Sparse Mixture of Experts (SMoE) model released by Mistral AI on December 11, 2023. It was the first open-source Mixture-of-Experts model to match or exceed the performance of much larger dense models while maintaining fast inference speeds. Released under Apache 2.0 license.
Key Specifications
- Total Parameters: 45 billion
- Active Parameters per Token: ~13 billion (live parameters)
- Number of Experts: 8 experts per MLP layer
- Active Experts per Token: 2 (top_k = 2)
- Context Window: 32,768 tokens
- License: Apache 2.0
- Release Date: December 11, 2023
- Architecture Type: Decoder-only Transformer with MoE
Architecture
Core Parameters
- Dimensions: 4,096
- Layers: 32
- Attention Heads: 32
- Head Dimension: 128
- Hidden Dimension: 14,336
- Vocabulary Size: 32,000
Mixture-of-Experts Design
The model uses a Sparse MoE architecture where:
- Each layer contains 8 expert networks
- For each token, a router selects the top 2 most relevant experts
- Only the selected experts process the token, keeping inference efficient
- Total of 45B parameters, but only ~13B active per token
- Uses SwiGLU activation function for expert networks
Performance
Benchmark Highlights
vs LLaMA 2 70B:
- Outperforms on most benchmarks
- 6x faster inference despite comparable performance
- Mathematics: 58.4% (GSM8K) vs 53.6%
- Code: 60.7% (MBPP) vs 49.8%
vs GPT-3.5:
- Matches or outperforms on most standard benchmarks
Multilingual Capabilities
Mixtral excels in multiple European languages:
- Supported Languages: English, French, German, Spanish, Italian
- Consistently matches or outperforms LLaMA 2 70B on multilingual benchmarks:
- ARC-c (multilingual)
- HellaSwag (multilingual)
- MMLU (multilingual)
Instruction-Following
Mixtral 8x7B Instruct:
- MT-Bench Score: 8.3
- Ranks above GPT-3.5 Turbo on LMSys Leaderboard (late 2023)
- Optimized for conversational and instruction-following tasks
Technical Advantages
- Efficiency: 6x faster inference than comparable dense models (LLaMA 2 70B)
- Scalability: MoE architecture allows scaling without proportional compute increase
- Flexibility: Handles 32K token contexts vs typical 4K-8K
- Cost-Effective: Lower inference costs due to sparse activation
- Multilingual: Native support for 5 European languages
Use Cases
- Code Generation: Strong performance on programming tasks
- Mathematical Reasoning: Excellent on GSM8K and other math benchmarks
- Multilingual Applications: Production-ready for European languages
- Long-Context Tasks: 32K context enables document analysis
- Real-Time Applications: Fast inference suitable for interactive use
Deployment
Integration Options
- Cloud Platforms: AWS, GCP, Azure
- Hugging Face: Full integration with transformers library
- Ollama: Local deployment support
- vLLM: Optimized inference server
- API: Available via Mistral AI’s API platform
Significance
Mixtral 8x7B was groundbreaking because it:
- Demonstrated that open-source MoE models could compete with proprietary systems
- Proved that sparse architectures offer superior efficiency vs dense models
- Established Mistral AI as a leader in efficient model design
- Made high-performance multilingual AI accessible under permissive licensing
- Showed that European AI companies could innovate at the frontier
The model’s success led to widespread adoption and influenced subsequent MoE architectures across the industry.
Resources
- Official Announcement: https://mistral.ai/news/mixtral-of-experts
- Paper: Mixtral of Experts (arXiv:2401.04088)
- Hugging Face: Mixtral Documentation
- Developer: Mistral AI
- Related Models: Mistral 7B, Mistral Large
Status: OK
Last Updated: 2025-12-25
Review: Completed and approved for publication