7 Coding LLMs, 1 Prompt—Here’s What I Found

AI Summary

This video compares 7 coding LLMs using the same prompt to test their capabilities: Claude 4 (Opus and Sonnet), O3, Gemini 2.5 Pro, Qwen 2.5 Max, and DeepSeek R1.

Key Findings:

Model Performance Overview:

Claude 4 Opus leads in agentic terminal coding tasks but Sonnet 4 performs better on SU bench verified

On Aider LLM leaderboard: Opus 4 ranks #5 at 72%, while Sonnet 4 surprisingly lags behind Sonnet 3.5

All models showed hit-and-miss results despite using identical prompts

Pricing Comparison (per million output tokens):

Claude Opus 4: $75 (most expensive)

O3: $40

Claude Sonnet 4: $15

Gemini 2.5 Pro: $15 (best value, even cheaper under 200k tokens)

Test Methodology:

Task: Create a web app dashboard displaying LLM information using web search tools

Evaluated instruction following, tool usage, and information synthesis

All models had web search capabilities enabled

Key Capabilities Differences:

Sequential tool calling: Only O3 and Claude 4 models can make multiple web searches during reasoning

Limited tool usage: Gemini 2.5 Pro, Qwen, and DeepSeek only search at the beginning, can’t update searches mid-reasoning

Test Results:

Gemini 2.5 Pro: Best performer - found latest models (missed O3), accurate benchmarks, good UI with visual charts

Sonnet 4: Colorful interface, only found Claude 4 (not Opus), non-functional benchmark tabs

Opus 4: Professional look, found Claude 3.5 and O3, included filtering options but had accuracy issues

Sonnet 3.7: Similar to #1, included GPT 4.5, hallucinated some benchmark scores

Qwen 2.5 Max: Poor performance, limited model info, incorrect specifications

O3: Found Claude Opus 4, correct release dates, poor visual presentation

DeepSeek R1: Failed to render correctly, had to be discarded

Conclusions:

No single model significantly outperformed others

All models (except R1) created functional UIs but had information accuracy issues

For multi-agent systems, combining multiple models is recommended over relying on one

Author favors Gemini 2.5 Pro for its cost-effectiveness

Rate limits are a concern for premium models like Opus and Sonnet

Recommendation: Choose based on priorities - Gemini 2.5 Pro for cost-efficiency, Claude models for advanced agentic capabilities, but expect to verify information regardless of model choice.

ThirdBrAIn.tech

Explorer

7 Coding LLMs, 1 Prompt—Here’s What I Found

7 Coding LLMs, 1 Prompt—Here’s What I Found

Graph View