Agenci

Jak dobrzy są agenci AI w prawdziwych badaniach? Wewnątrz raportu Deep Research Bench

AI-Powered Research Assistants: Soaring to New Heights

Imagine having an assistant that can manage complex research tasks, interpreting conflicting information, sourcing data from across the web, and synthesizing it into actionable insights. That’s what large language models (LLMs) have been evolving into. No longer limited to answering simple factual queries, developers are marketing them as tools adept at carrying out “deep research.” And in the AI world, this capability seems to have a lot of names. OpenAI brands it as “Deep Research,” Anthropic prefers “Extended Thinking,” for Google’s Gemini, it’s “Search + Pro,” and Perplexity uses phrases like “Pro Search” and “Deep Research.”

A FutureSearch study called the Deep Research Bench (DRB) gave these systems a thorough examination, providing the most comprehensive insights yet.

Evaluating AI’s Research Prowess: Deep Research Bench and its Findings

The Deep Research Bench (DRB) developed by FutureSearch is an evaluation tool designed to assess how well AI agents can handle complex, web-based research tasks. Think of it as a simulated arena for the vexing problems encountered by researchers, analysts, and decision-makers in the real world. The benchmark includes an array of real-world tasks, like “Find Number” problems or tasks that require to “Validate Claim” or “Compile Dataset.”

To make comparisons fair and consistent, human-verified answers accompany each task, and RetroSearch, a static archive of web pages, is employed. This tactic eliminates the unpredictable nature of live web data, providing an even footing for different AI agents.

In the heart of DRB is the ReAct (Reason + Act) architecture. It replicates how a human researcher operates—contemplating a problem, taking action such as a web search, scrutinizing the results, and then iterating. Although newer LLMs have integrated this loop into a smoother process, the ReAct setup still offers a valuable structure for AI reasoning. For tasks like “Gather Evidence,” RetroSearch, used in DRB, includes up to 189,000 web pages, all frozen in time to ensure repeatability, thanks to tools such as Serper, Playwrightoraz ScraperAPI.

When it came to performance, OpenAI’s o3 led the pack with a score of 0.51 out of 1.0. This might seem low, but considering the benchmark’s complexity, it’s a significant achievement. The researchers estimate that even a perfect agent would likely top out around 0.8 due to the inherent vagueness in task definitions and scoring.

Trailing close behind were Anthropic’s Claude 3.7 Sonnet and Google’s Gemini 2.5 Pro; Claud demonstrated adeptness in structured and flexible thinking, while Gemini excelled in step-by-step planning tasks. Interestingly, DeepSeek-R1 exhibited performance almost matching GPT-4 Turbo indices, indicating a closing gap between open and closed-source models.

The Hurdles Hitherto Unsurmounted: Where AI still lags

Despite enormous progress, AI models do struggle with certain aspects. Memory degradation is a significant issue; as tasks become longer, models tend to overlook crucial details, lose sight of targets, and provide disjointed or irrelevant responses. Other common weaknesses include repetitive tool use, unproductive search queries, and jumping to premature conclusions.

Even the top-performing models suffer certain susceptibilities. For example, GPT-4 Turbo often drops earlier steps from memory, and DeepSeek-R1 has a tendency to generate false yet plausible-sounding insights. A common fault across all models is their frequent failure to confirm findings or cross-reference sources, which is critical in serious research tasks.

The report also looked at “toolless” agents—language models that rely merely on their internal training data, devoid of any external tools like search tools. Surprisingly, these agents managed almost as well as tool-enabled ones on some tasks. This finding suggests that some LLMs have robust internal priors and can efficiently judge the plausibility of common claims. However, their limitations become transparent on more challenging tasks where-up-to-date, exhaustive information is indispensable.

The comprehensive Deep Research Bench report underscores one thing: while today’s AI agents are gaining ground, they are still playing catch-up with skilled human researchers, especially in tasks that demand strategic planning, flexible thinking, and subtle reasoning.

These gaps become particularly noticeable during longer or more complex research sessions, where agents often lose coherence or wander off track. Yet, DRB’s beauty lies in its ability to evaluate not just rudimentary knowledge but also the deeper interplay of memory, reasoning, and tool use. As LLMs continue integrating into professional workflows, tools like the DRB from FutureSearch will be crucial for measuring real-world AI performance.

For those captivated by the advancement of AI research capabilities, the full Deep Research Bench report is an absolute must-read.

Jaka jest twoja reakcja?

Podekscytowany
0
Szczęśliwy
0
Zakochany
0
Nie jestem pewien
0
Głupi
0

Komentarze są zamknięte.