Trust in agent AI: Why Eval -infrastructure should come first

Trust in agent AI: Why Eval -infrastructure should come first

5 minutes, 33 seconds Read

As AI agents use in the Real-World, organizations are under pressure to determine where they belong, how they can build effectively and how they can be operationalized on a scale. With Venturebeats Transform 2025Tech leaders met to talk about how they transform their business with agents: Joanne Chen, general partner at Foundation Capital; Shailesh Nalawadi, VP of project management with Sendbird; Thys Waanders, SVP of AI transformation in Cogigy; and Shawn Malhotra, CTO, rocket companies.

https://www.youtube.com/watch?v=DCHZGCF1POO

A few top -agentic AI user cases

“The first attraction of one of these implementations for AI agents is usually around saving human capital – mathematics is fairly simple,” said Nalawadi. “However, that underlines the transformational power that you get with AI agents.”

At Rocket, AI agents have proven to be powerful tools when increasing the website conversion.

“We have discovered that with our agent-based experience the conversation experience on the website, customers are three times more likely to convert when they get through that channel,” said Malhotra.

But that is just scratching the surface. For example, a rocket engineer built an agent in just two days to automate a highly specialized task: calculating transfer taxes while insuring mortgage.

“Those two days of effort saved us a million dollars a year,” said Malhotra. “In 2024 we saved more than a million team member hours, usually behind the AI ​​solutions. That is not only saving costs. It also enables our team members to concentrate their time on people who often make the greatest financial transaction of their lives.”

Agents are essentially supercharging of individual team members. That saved million hours is not that a person’s task is often replicated. They are fractures of the work that are things that employees do not like to do or do not add value to the customer. And that saved a million hours Rocket gives the capacity to cope with more things.

“Some of our team members were able to process 50% more customers last year than the year before,” Malhotra added. “It means that we can have a higher transit, stimulate more companies, and again, we see higher conversion rates because they spend the time understanding the needs of the customer versus doing much more bad work that the AI ​​can do now.”

Tackle complexity agent

“Part of the journey for our technical teams is of the mindset of software – engineering – write once and test it and it runs and gives the same answer 1000 times – to the more probabilistic approach, where you ask the same thing about an LLM and it gives various answers by a probability,” said Nalawadi. “Much of them has taken people. Not only software, but product managers and UX designers.”

What helped is that LLMS has taken a long way, said Waanders. If they built something 18 months or two years ago, they really had to choose the right model, otherwise the agent would not perform as expected. Now, he says, we are now at a stage where most regular models behave very well. They are more predictable. But today is the challenge of combining models, guaranteeing responsiveness, orchestrating the right models in the correct order and weaving in the right data.

“We have customers who push tens of millions of conversations a year,” said Waanders. “For example, if you automate 30 million conversations in a year, how scale then in the LLM world? Those are all things we had to discover, simple things, from getting the model availability with the cloud providers. Having enough quota with a chatgpt model for example. Those are all new worlds, and our customers too.”

A layer above the orchestrate of the LLM orchestrates a network of agents, Malhotra said. A conversation experience has a network of agents under the hood, and the orchestrator decides to which agent the request to farmers of those who are available.

“If you play that ahead and think of having hundreds or thousands of agents who are capable of various things, you get some really interesting technical problems,” he said. “It will be a bigger problem, because latency and time matter. That agentrouiation will be a very interesting problem to solve in the coming years.”

Tapping sellers’ relationships

Until now, the first step has built in -house for most companies that launched Agentic AI, because there were not yet specialized tools. But you cannot differentiate and create value by building generic LLM infrastructure or AI infrastructure, and you need specialized expertise to go beyond the first build, and Debug, repeat and improve what was built, and maintain the infrastructure.

“We often find the most successful conversations we have with potential customers, are usually someone who has already built something in -house,” said Nalawadi. “They quickly realize that reaching a 1.0 is good, but as the world evolves and as the infrastructure evolves and because they have to exchange technology for something new, they do not have the opportunity to orkestrate all these things.”

Preparation for agent AI complexity

Theoretical AI will only grow in complexity – the number of agents in an organization will rise and they will learn from each other, and the number of use cases will explode. How can organizations prepare for the challenge?

“It means that the checks and balances in your system will become more stressed,” said Malhotra. “For something that has a regulation process, you have a person in the loop to ensure that someone signs about this. For critical internal processes or data access, you have perceptibility? Do you have the right warning and monitoring so that if something goes wrong, it is to enlarge yourself, it is to come if you have a person there?”

So how can you be confident that an AI agent will behave reliably as it evolves?

“That part is really difficult if you didn’t think about it in the beginning,” said Nalawadi. “The short answer is, before you even start building it, you should have an evalinfrastructure. Make sure you have a rigorous environment in which you know how good looks, from an AI agent, and that you have this test set. Keep referring while you make improvements. A very simplistic way of thinking about Eval.”

The problem is that it is non-deterministic, added Waanders. Unity tests is crucial, but the biggest challenge is that you do not know what you do not know – what incorrect behavior an agent can display, how it can react in a certain situation.

“You can only discover that by simulating conversations on scale, by pushing it under thousands of different scenarios and then analyzing how it persists and how it reacts,” Waanders said.

#Trust #agent #Eval #infrastructure

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *