Most LLMs are Bad at this Simple Benchmark Test!
AI Summary
The video introduces a new benchmark called “solo bench” designed to evaluate the abilities of language models (LMs) using a rigorous task: generating 250 unique four-word sentences following a specific grammatical structure (verb + adjective + noun + noun) only from a provided word list. This benchmark tests the model’s understanding, memory, reasoning ability, and adherence to instructions without external tools or programming languages. Current results indicate that even top models face challenges, with Gemini 2.5 Pro achieving the highest score at 75%. The video emphasizes the benchmark’s open-source nature and lack of bias, encouraging developers to try it via a provided GitHub repository link.