Alibaba’s AgentEvolver improves tooling model performance by ~30% using synthetic, auto-generated tasks

Alibaba’s AgentEvolver improves tooling model performance by ~30% using synthetic, auto-generated tasks

Researchers at Alibaba’s Tongyi Lab have developed a new framework for self-development agents that create their own training data by exploring their application environments. The framework, AgentEvolverleverages the knowledge and reasoning power of large language models for autonomous learning, addressing the high cost and manual effort typically required to collect task-specific datasets.

Experiments show that, compared to traditional reinforcement learning frameworks, AgentEvolver is more efficient at exploring its environment, makes better use of data, and adapts more quickly to application environments. For enterprises, this is important because it lowers the barrier to training agents for custom applications, making powerful, custom AI assistants more accessible to a wider range of organizations.

The high cost of training AI agents

Reinforcement learning has become an important paradigm for training LLMs to act as agents who can interact with digital environments and learn from feedback. However, developing agents with RL faces fundamental challenges. First, collecting the necessary training datasets is often prohibitively expensive, requiring significant manual labor to create sample tasks, especially in new or proprietary software environments where ready-made datasets are not available.

Second, the RL techniques commonly used for LLMs require the model to go through a large number of trial and error attempts to learn effectively. This process is computationally expensive and inefficient. As a result, training capable LLM agents through RL remains labor-intensive and expensive, limiting their deployment in customized business environments.

How AgentEvolver works

The main idea behind AgentEvolver is to give models more autonomy in their own learning process. The researchers describe it as a “self-evolving agent system,” designed to “achieve autonomous and efficient capability evolution through interaction with the environment.” It uses the reasoning power of an LLM to create a self-training loop, allowing the agent to continuously improve through direct interaction with its target environment without needing predefined tasks or reward functions.

“We envision an agent system in which the LLM actively guides exploration, task generation, and performance refinement,” the researchers wrote in their paper.

The self-evolution process is driven by three core mechanisms working together.

The first is question oneselfwhere the agent explores its environment to discover the limits of its functions and identify useful states. It’s like a new user clicking through an application to see what’s possible. Based on this exploration, the agent generates its own diverse set of tasks that match a user’s general preferences. This reduces the need for hand-crafted datasets and allows the agent and its tasks to evolve together, tackling increasingly complex challenges.

According to Yunpeng Zhai, researcher at Alibaba and co-author of the paper, who spoke to VentureBeat, the self-interrogation mechanism effectively changes the model from a “data consumer to a data producer,” dramatically reducing the time and cost required to deploy an agent in a native environment.

The second mechanism is self-navigatingwhich improves the efficiency of exploration by reusing and generalizing past experiences. AgentEvolver extracts insights from both successful and failed attempts and uses them to guide future actions. For example, if an agent tries to use an API function that doesn’t exist in an application, it logs it as an experience and learns to verify the existence of functions before trying to use them in the future.

The third mechanism, self-attributingimproves learning efficiency by providing more detailed feedback. Instead of just a definitive success or failure signal (a common practice in RL that can result in scarce rewards), this mechanism uses an LLM to assess the contribution of each individual action in a multi-step task. It determines afterwards whether each step contributed positively or negatively to the final result, giving the agent fine-grained feedback that accelerates learning.

This is critical for regulated industries, where the way an agent solves a problem is just as important as the outcome. “Rather than just reward a student for the final answer, we also evaluate the clarity and correctness of each step in his or her reasoning,” Zhai explains. This improves transparency and encourages the agent to adopt more robust and auditable problem-solving patterns.

“By shifting the training initiative from human-designed pipelines to LLM-guided self-improvement, AgentEvolver establishes a new paradigm that paves the way for scalable, cost-effective, and continuously improving intelligent systems,” the researchers said.

The team has also developed a practical, end-to-end training framework that integrates these three mechanisms. An important part of this foundation is the Context managera component that controls the agent’s memory and interaction history. While current benchmarks test a limited number of tools, real enterprise environments can contain thousands of APIs.

Zhai acknowledges that this is a core challenge for the field, but notes that AgentEvolver is designed to be expandable. “Retrieving extremely large action spaces will always present computational challenges, but AgentEvolver’s architecture provides a clear path to scalable reasoning tools in enterprise environments,” he says.

A more efficient path to agent training

To measure the effectiveness of their framework, the researchers tested it AppWorld And BFCLv3two benchmarks that require agents to perform long, multi-step tasks using external tools. They used models from Alibaba Qwen2.5 family (7B and 14B parameters) and compared their performance against a base model trained with GRPO, a popular RL technique used to develop reasoning models such as DeepSeek-R1.

The results showed that integrating all three mechanisms into AgentEvolver led to significant performance improvements. For the 7B model, the average score improved by 29.4%, and for the 14B model, it increased by 27.8% from baseline. The framework consistently improved the reasoning and task execution capabilities of the models in both benchmarks. The most important improvement came from the self-examination module, which autonomously generates various training tasks and directly addresses the problem of data scarcity.

The experiments also demonstrated that AgentEvolver can efficiently synthesize a large amount of high-quality training data. The tasks generated by the self-questioning module turned out to be diverse enough to achieve good training efficiency even with a small amount of data.

For enterprises, this provides the ability to create agents for custom applications and internal workflows while minimizing the need for manual data annotation. By providing high-level goals and letting the agent generate its own training experiences, organizations can more easily and cost-effectively develop custom AI assistants.

“This combination of algorithmic design and technical pragmatics positions AgentEvolver as both a research vehicle and a reusable foundation for building adaptive, tool-enabled agents,” the researchers concluded.

Looking ahead, the ultimate goal is much greater. “A true ‘singular model’ that can be installed in any software environment and mastered overnight is certainly the holy grail of agentic AI,” said Zhai. “We see AgentEvolver as a necessary step in that direction.” While that future still requires breakthroughs in model reasoning and infrastructure, self-evolving approaches are paving the way.

#Alibabas #AgentEvolver #improves #tooling #model #performance #synthetic #autogenerated #tasks

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *