OpenAI has just revealed some eye-opening results in a new study—and the outcome is not what most expected. Claude just beat GPT-5, Gemini, and Grok in real-world job tasks, according to OpenAI’s own study. This finding challenges assumptions about which AI models are best suited for actual workplace productivity.
For years, AI benchmarks have been criticized for focusing on academic or artificial tests that don’t match everyday use. To address this, OpenAI has introduced a new system called GDPval, designed to measure how AI models perform in real-world work scenarios.
Instead of abstract puzzles or coding-only challenges, GDPval evaluates models on 44 job-related tasks—everything from writing a customer service email to drafting legal documents or analyzing software bugs.
The biggest surprise? Claude Opus 4.1 from Anthropic outperformed every other model, including OpenAI’s own GPT-5, Google’s Gemini, and Elon Musk’s Grok.
In the results:
Claude Opus 4.1 ranked highest for handling realistic work tasks.
GPT-5 (high variant) came in second place.
Gemini and Grok trailed further behind.
This suggests that when it comes to practical, job-focused AI performance, Claude may currently be the best option for professionals.
Unlike standard benchmarks, this study was designed to reflect real-world workflows. That means instead of checking whether an AI can ace an exam or summarize academic papers, GDPval looks at tasks people actually rely on AI for at work.
Examples include:
Writing a polite but firm reply to a dissatisfied customer
Drafting HR communications
Reviewing legal language
Assisting in engineering documentation
This approach offers a more accurate picture of how AI tools might replace or complement human workers in daily jobs.
The fact that Claude just beat GPT-5, Gemini, and Grok in real-world job tasks, according to OpenAI’s own study raises important questions. If OpenAI’s own benchmarking shows a competitor’s model outperforming ChatGPT, it could shift user trust and enterprise adoption.
For businesses, the takeaway is clear: choosing an AI assistant isn’t just about brand recognition—it’s about performance in the tasks that matter most.
As AI adoption grows across industries, these new evaluations will likely become the standard for deciding which models to use.
OpenAI’s new GDPval benchmark has revealed a surprising twist in the AI race. Despite the hype around GPT-5, Google Gemini, and Grok, Claude Opus 4.1 came out as the best at real-world job performance.
If this trend continues, Claude may emerge as the top choice for professionals who rely on AI to handle everyday work tasks.
𝗦𝗲𝗺𝗮𝘀𝗼𝗰𝗶𝗮𝗹 𝗶𝘀 𝘄𝗵𝗲𝗿𝗲 𝗿𝗲𝗮𝗹 𝗽𝗲𝗼𝗽𝗹𝗲 𝗰𝗼𝗻𝗻𝗲𝗰𝘁, 𝗴𝗿𝗼𝘄, 𝗮𝗻𝗱 𝗯𝗲𝗹𝗼𝗻𝗴. We’re more than just a social platform — from jobs and blogs to events and daily chats, we bring people and ideas together in one simple, meaningful space.