MCP-Universe-benchmark shows that GPT-5 fails more than half of the Real-World orchestration tasks

MCP-Universe-benchmark shows that GPT-5 fails more than half of the Real-World orchestration tasks

5 minutes, 35 seconds Read

Do you want smarter insights into your inbox? Register for our weekly newsletters to get only what is important for Enterprise AI, data and security leaders. Subscribe now


The acceptance of interoperability standards, such as the Context Protocol (MCP) model, can provide companies insight into how agents and models function outside their walled limits. However, many benchmarks cannot record real interactions with MCP.

Salesforce AI Research has developed a new open-source benchmark that calls the MCP universe, that LLMs wants to follow as this interaction with MCP servers in the real world, with the argument that it will paint a better picture of real-life and real-time interactions of models with tools that actually use. In his first tests it turned out that models like Openirecently Released GPT-5 are strong, but still do not perform so well in real-life scenarios.

“Existing benchmarks focus primarily on isolated aspects of LLM performance, such as following instructions, mathematics reasoning or job calls, without giving an extensive assessment of how models interact with Real-World MCP servers in different scenarios,” Salesforce said in a paper.

MCP-Universe records model performance by means of tool use, multi-turn tool calls, long context windows and large tool spaces. It is based on existing MCP servers with access to actual data sources and environments.


Ai -scale distribution touches its limits

Power caps, rising token costs and inference inference reform Enterprise AI. Become a member of our exclusive salon to discover how top teams are:

  • Change energy into a strategic advantage
  • Architecting efficient conclusion for real transit profits
  • Unlocking competitive ROI with sustainable AI systems

Secure your place to stay ahead: https://bit.ly/4mwgngo


Junnan Li, director of AI Research at Salesforce, told Venturebeat that many models “are still confronted with limitations that stop them on tasks of companies.”

“Two of the largest are: long context challenges, models can lose sight of information or have difficulty reasoning to reason when handling very long or complex inputs,” Li said. “And, unknown tool challenges, models can often not use unknown tools or systems in the way people can adjust immediately.

MCP universe joins other MCP-based proposed benchmarksas MCP radar from the University of Massachusetts Amherst and Xi’an Jiaotong University, as well as the Beijing University of Posts and Telecommunications’ MCPWORLD. It also builds on McPevals, which Salesforce released in July, which mainly focuses on agents. Li said that the biggest difference between MCP universe and MCPEVALS is that the latter is evaluated with synthetic tasks.

How it works

MCP-Universe evaluates how well each model performs a series of tasks that mimic them through companies. Salesforce said that the MCP Universe has designed to include six core domains used by companies: location, repository management, financial analysis, 3D design, browser automation and web search. It had access to 11 MCP servers for a total of 231 tasks.

  • Locationavigation focuses on geographical reasoning and the performance of spatial tasks. The researchers ticked the Google Maps MCP server for this process.
  • The domain of repository management looks at codebase operations and connects to the Github MCP to uncover version control tools such as the search for repo, tracking and code operation.
  • Financial analysis makes a connection with the Yahoo Finance MCP server to evaluate quantitative reasoning and decision-making for the financial market.
  • 3D design evaluates the use of computer-supported design tools via the Blender MCP.
  • Browser automation, connected to Playwright’s MCP, test browser interaction.
  • The Web Searching Domain uses the Google Search MCP server and the Fetch MCP to check “Open-domain Information Search” and is structured as a more open task.

Salesforce said the new MCP tasks had to design that reflect real use cases. For each domain they have made four to five types of tasks that according to the researchers can easily complete LLMS. For example, the researchers have assigned the models a goal that included route planning, identifying the optimum stops and then finding the destination.

Each model is evaluated on how they have completed the tasks. Li and his team chose to follow an execution-based evaluation paradigm instead of the most common LLM-AS-A-Judge system. The researchers noted that the LLM-As-a-Judge paradigm “is not well suited for our MCP universe scenario, because some tasks are designed to use real-time data, while the knowledge of the LLM judge is static.”

Salesforce researchers used three types of evaluators: format evaluators to see whether the agents and models are following format requirements, static evaluators to assess the correctness over time and dynamic evaluators for fluctuating answers such as flight prices or github problems.

“MCP-Universe focuses on creating challenging Real-World tasks with execution-based evaluators, who can test the agent in complex scenarios. Moreover, MCP-Universe offers an expandable framework/code base for building and evaluating agents,” said Li.

Even the big models have problems

To test MCP universe, Salesforce evaluated various popular own and open-source models. These include Grok-4 of Xai” Anthropic’S-Claude-4 Sunt and Claude 3.7 Sunt, Oenai’s GPT-5, O4-Mini, O3, GPT-4.1, GPT-OOP, Google’s Gemini 2.5 Pro and Gemini 2.5 Fkash, GLM-4.5 of Would” Crematum’s Kimi-K2, Qwen’S Qwen3 Coder and QWEN3-235B-A22B-instruct-25507 and Deepseek-V3-0304 from Deep. Each tested model had at least 120B parameters.

In his tests, Salesforce found that GPT-5 had the best success rate, especially for financial analysis tasks. Grok-4 followed, defeated all models for browser automation and Claude-4.0-Sonnet completed the top three, although it has not placed no performance numbers that are higher than one of the models that follows. GLM-4.5 performed the best among open-source models.

MCP-Universe, however, showed that the models had difficulty treating long contexts, especially for location, browser automation and financial analysis, where efficiency decreased considerably. The moment the LLMS encounters unknown tools, their performance also drops. The LLMS showed difficulty to complete more than half of the tasks that companies usually perform.

“These findings emphasize that the current frontier LLMS is still inadequate in reliable executive tasks in different MCP tasks in practice. Our MCP-Universe-Benchmark therefore offers a challenging and necessary test bed for evaluating LLM performance in areas that are served by existing benchmarks.

Li told Venturebeat that he hopes that Enterprises will use MCP universe to gain a deeper insight into where agents and models fail at tasks, so that they can improve their frameworks or the implementation of their MCP tools.

#MCPUniversebenchmark #shows #GPT5 #fails #RealWorld #orchestration #tasks

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *