7 Coding LLMs, 1 Prompt—Here’s What I Found
AI Summary
This video compares 7 coding LLMs using the same prompt to test their capabilities: Claude 4 (Opus and Sonnet), O3, Gemini 2.5 Pro, Qwen 2.5 Max, and DeepSeek R1.
Key Findings:
Model Performance Overview:
- Claude 4 Opus leads in agentic terminal coding tasks but Sonnet 4 performs better on SU bench verified
- On Aider LLM leaderboard: Opus 4 ranks #5 at 72%, while Sonnet 4 surprisingly lags behind Sonnet 3.5
- All models showed hit-and-miss results despite using identical prompts
Pricing Comparison (per million output tokens):
- Claude Opus 4: $75 (most expensive)
- O3: $40
- Claude Sonnet 4: $15
- Gemini 2.5 Pro: $15 (best value, even cheaper under 200k tokens)
Test Methodology:
- Task: Create a web app dashboard displaying LLM information using web search tools
- Evaluated instruction following, tool usage, and information synthesis
- All models had web search capabilities enabled
Key Capabilities Differences:
- Sequential tool calling: Only O3 and Claude 4 models can make multiple web searches during reasoning
- Limited tool usage: Gemini 2.5 Pro, Qwen, and DeepSeek only search at the beginning, can’t update searches mid-reasoning
Test Results:
- Gemini 2.5 Pro: Best performer - found latest models (missed O3), accurate benchmarks, good UI with visual charts
- Sonnet 4: Colorful interface, only found Claude 4 (not Opus), non-functional benchmark tabs
- Opus 4: Professional look, found Claude 3.5 and O3, included filtering options but had accuracy issues
- Sonnet 3.7: Similar to #1, included GPT 4.5, hallucinated some benchmark scores
- Qwen 2.5 Max: Poor performance, limited model info, incorrect specifications
- O3: Found Claude Opus 4, correct release dates, poor visual presentation
- DeepSeek R1: Failed to render correctly, had to be discarded
Conclusions:
- No single model significantly outperformed others
- All models (except R1) created functional UIs but had information accuracy issues
- For multi-agent systems, combining multiple models is recommended over relying on one
- Author favors Gemini 2.5 Pro for its cost-effectiveness
- Rate limits are a concern for premium models like Opus and Sonnet
Recommendation: Choose based on priorities - Gemini 2.5 Pro for cost-efficiency, Claude models for advanced agentic capabilities, but expect to verify information regardless of model choice.