{"id":5665,"date":"2025-06-03T01:14:00","date_gmt":"2025-06-02T23:14:00","guid":{"rendered":"https:\/\/aitrends.center\/how-good-are-ai-agents-at-real-research-inside-the-deep-research-bench-report\/"},"modified":"2025-06-03T01:14:00","modified_gmt":"2025-06-02T23:14:00","slug":"jak-dobrzy-sa-agenci-ai-w-prawdziwych-badaniach-w-raporcie-deep-research-bench","status":"publish","type":"post","link":"https:\/\/aitrendscenter.eu\/pl\/how-good-are-ai-agents-at-real-research-inside-the-deep-research-bench-report\/","title":{"rendered":"Jak dobrzy s\u0105 agenci AI w prawdziwych badaniach? Wewn\u0105trz raportu Deep Research Bench"},"content":{"rendered":"<h3>AI-Powered Research Assistants: Soaring to New Heights<\/h3>\n<p>Imagine having an assistant that can manage complex research tasks, interpreting conflicting information, sourcing data from across the web, and synthesizing it into actionable insights. That\u2019s what large language models (LLMs) have been evolving into. No longer limited to answering simple factual queries, developers are marketing them as tools adept at carrying out \u201cdeep research.\u201d And in the AI world, this capability seems to have a lot of names. OpenAI brands it as \u201cDeep Research,\u201d Anthropic prefers \u201cExtended Thinking,\u201d for Google\u2019s Gemini, it&#8217;s \u201cSearch + Pro,\u201d and Perplexity uses phrases like \u201cPro Search\u201d and \u201cDeep Research.\u201d<\/p>\n<p>A <a href=\"https:\/\/futuresearch.ai\/\" target=\"_blank\" rel=\"noopener\">FutureSearch<\/a> study called the <a href=\"https:\/\/futuresearch.ai\/deep-research-bench\" target=\"_blank\" rel=\"noopener\">Deep Research Bench (DRB)<\/a> gave these systems a thorough examination, providing the most comprehensive insights yet.<\/p>\n<h3>Evaluating AI&#8217;s Research Prowess: Deep Research Bench and its Findings<\/h3>\n<p>The Deep Research Bench (DRB) developed by FutureSearch is an evaluation tool designed to assess how well AI agents can handle complex, web-based research tasks. Think of it as a simulated arena for the vexing problems encountered by researchers, analysts, and decision-makers in the real world. The benchmark includes an array of real-world tasks, like &#8220;Find Number&#8221; problems or tasks that require to &#8220;Validate Claim&#8221; or &#8220;Compile Dataset.&#8221;<\/p>\n<p>To make comparisons fair and consistent, human-verified answers accompany each task, and RetroSearch, a static archive of web pages, is employed. This tactic eliminates the unpredictable nature of live web data, providing an even footing for different AI agents.<\/p>\n<p>In the heart of DRB is the ReAct (Reason + Act) architecture. It replicates how a human researcher operates\u2014contemplating a problem, taking action such as a web search, scrutinizing the results, and then iterating. Although newer LLMs have integrated this loop into a smoother process, the ReAct setup still offers a valuable structure for AI reasoning. For tasks like \u201cGather Evidence,\u201d RetroSearch, used in DRB, includes up to 189,000 web pages, all frozen in time to ensure repeatability, thanks to tools such as <a href=\"https:\/\/serper.dev\/\" target=\"_blank\" rel=\"noopener\">Serper<\/a>, <a href=\"https:\/\/playwright.dev\/\" target=\"_blank\" rel=\"noopener\">Playwright<\/a>oraz <a href=\"https:\/\/www.scraperapi.com\/\" target=\"_blank\" rel=\"noopener\">ScraperAPI<\/a>.<\/p>\n<p>When it came to performance, OpenAI\u2019s o3 led the pack with a score of 0.51 out of 1.0. This might seem low, but considering the benchmark&#8217;s complexity, it&#8217;s a significant achievement. The researchers estimate that even a perfect agent would likely top out around 0.8 due to the inherent vagueness in task definitions and scoring.<\/p>\n<p>Trailing close behind were Anthropic\u2019s Claude 3.7 Sonnet and Google\u2019s Gemini 2.5 Pro; Claud demonstrated adeptness in structured and flexible thinking, while Gemini excelled in step-by-step planning tasks. Interestingly, DeepSeek-R1 exhibited performance almost matching GPT-4 Turbo indices, indicating a closing gap between open and closed-source models.<\/p>\n<h3>The Hurdles Hitherto Unsurmounted: Where AI still lags<\/h3>\n<p>Despite enormous progress, AI models do struggle with certain aspects. Memory degradation is a significant issue; as tasks become longer, models tend to overlook crucial details, lose sight of targets, and provide disjointed or irrelevant responses. Other common weaknesses include repetitive tool use, unproductive search queries, and jumping to premature conclusions.<\/p>\n<p>Even the top-performing models suffer certain susceptibilities. For example, GPT-4 Turbo often drops earlier steps from memory, and DeepSeek-R1 has a tendency to generate false yet plausible-sounding insights. A common fault across all models is their frequent failure to confirm findings or cross-reference sources, which is critical in serious research tasks.<\/p>\n<p>The report also looked at \u201ctoolless\u201d agents\u2014language models that rely merely on their internal training data, devoid of any external tools like search tools. Surprisingly, these agents managed almost as well as tool-enabled ones on some tasks. This finding suggests that some LLMs have robust internal priors and can efficiently judge the plausibility of common claims. However, their limitations become transparent on more challenging tasks where-up-to-date, exhaustive information is indispensable.<\/p>\n<p>The comprehensive Deep Research Bench report underscores one thing: while today\u2019s AI agents are gaining ground, they are still playing catch-up with skilled human researchers, especially in tasks that demand strategic planning, flexible thinking, and subtle reasoning.<\/p>\n<p>These gaps become particularly noticeable during longer or more complex research sessions, where agents often lose coherence or wander off track. Yet, DRB\u2019s beauty lies in its ability to evaluate not just rudimentary knowledge but also the deeper interplay of memory, reasoning, and tool use. As LLMs continue integrating into professional workflows, tools like the DRB from <a href=\"https:\/\/futuresearch.ai\/\" target=\"_blank\" rel=\"noopener\">FutureSearch<\/a> will be crucial for measuring real-world AI performance.<\/p>\n<p>For those captivated by the advancement of AI research capabilities, the full <a href=\"https:\/\/www.unite.ai\/how-good-are-ai-agents-at-real-research-inside-the-deep-research-bench-report\/\" target=\"_blank\" rel=\"noopener\">Deep Research Bench report<\/a> is an absolute must-read.<\/p>","protected":false},"excerpt":{"rendered":"<p>AI-Powered Research Assistants: Soaring to New Heights Imagine having an assistant that can manage complex research tasks, interpreting conflicting information, sourcing data from across the web, and synthesizing it into actionable insights. That\u2019s what large language models (LLMs) have been evolving into. No longer limited to answering simple factual queries, developers are marketing them as tools adept at carrying out \u201cdeep research.\u201d And in the AI world, this capability seems to have a lot of names. OpenAI brands it as \u201cDeep Research,\u201d Anthropic prefers \u201cExtended Thinking,\u201d for Google\u2019s Gemini, it&#8217;s \u201cSearch + Pro,\u201d and Perplexity uses phrases like \u201cPro Search\u201d [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":5666,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[43],"tags":[],"class_list":["post-5665","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-agents","post--single"],"_links":{"self":[{"href":"https:\/\/aitrendscenter.eu\/pl\/wp-json\/wp\/v2\/posts\/5665","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aitrendscenter.eu\/pl\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aitrendscenter.eu\/pl\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aitrendscenter.eu\/pl\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/aitrendscenter.eu\/pl\/wp-json\/wp\/v2\/comments?post=5665"}],"version-history":[{"count":0,"href":"https:\/\/aitrendscenter.eu\/pl\/wp-json\/wp\/v2\/posts\/5665\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/aitrendscenter.eu\/pl\/wp-json\/wp\/v2\/media\/5666"}],"wp:attachment":[{"href":"https:\/\/aitrendscenter.eu\/pl\/wp-json\/wp\/v2\/media?parent=5665"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aitrendscenter.eu\/pl\/wp-json\/wp\/v2\/categories?post=5665"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aitrendscenter.eu\/pl\/wp-json\/wp\/v2\/tags?post=5665"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}