Become a member of the event that is trusted by business leaders for almost two decades. VB Transform brings together the people who build the real Enterprise AI strategy. Leather
Computer Vision projects rarely go exactly as planned, and this was no exception. The idea was simple: build a model that could look at a photo of a laptop and identify physical damage – things such as cracked screens, missing keys or broken hinges. It seemed like a simple use case for image models and large language models (LLMS), but it soon became slightly more complicated.
Along the way we got into problems with hallucinations, unreliable output and images that were not even laptops. To solve this, we ultimately applied an agentic framework in an atypical way – not for task automation, but to improve the performance of the model.
In this message we will continue what we have tried, what did not work and how a combination of approaches ultimately helped us to build something reliable.
Where we started: monolithic prompt
Our first approach was fairly standard for a multimodal model. We have used a single, large promptly to pass on an image to an LLM with images chosen and asked to identify visible damage. This monolithic prompt strategy is easy to implement and works decently for clean, well-defined tasks. But data from practice rarely plays along.
We came across three major problems early:
- Hallucinations: The model would sometimes find damage that did not exist or a wrong label what it saw.
- Junk image detection: It did not have a reliable way to mark images that were not even laptops, such as photos of desks, walls or humans occasionally shines and received nonsensical damage reports.
- Inconsistent accuracy: The combination of these problems made the model too unreliable for operational use.
This was the point that became clear that we should repeat.
First solution: image -resolutions mix
One thing that struck us was how much image quality the output of the model influenced. Users have uploaded all kinds of images, ranging from sharp and high resolution to blurry. This has led us to refer to research Emphasize how image resolution influences the deep learning models.
We trained and tested the model using a mix of high and low resolution images. The idea was to make the model more resilient against the wide range of image qualities that it would encounter in practice. This helped improve consistency, but the core problems of hallucination and the handling of junk images remained.
The multimodal detour: only text LLM goes multimodal
Encouraged by recent experiments in combining image adhesion with only text LLMS such as the technology being treated BatchWhere captions are generated from images and then interpreted by a language model, we have decided to give it a try.
Here is how it works:
- The LLM starts with generating multiple possible captions for an image.
- Another model, called a multimodal embedding model, checks how well each caption fits the image. In this case we used Siglip to score the parable between the image and the text.
- The system keeps the top a few captions based on these scores.
- The LLM uses that top caption to write new ones and try closer to what the image can actually be seen.
- It repeats this process until the captions stop improving, or it becomes a set limit.
Although smart in theory, this approach introduced new problems for our use case:
- Persistent hallucinations: The captions themselves sometimes contain imaginary damage, which the LLM subsequently reported with confidence.
- Incomplete cover: Even with multiple captions, some problems were completely missed.
- Increased complexity, little benefit: The added steps made the system more complicated without reliably performing the previous setup.
It was an interesting experiment, but ultimately not a solution.
A creative use of agent frameworks
This was the turning point. Although agent frameworks are usually used for orchestrating task flows (think of agents who coordinate calendar invitations or promotions of customer service), we wondered whether the breakdown of the image interpretation task in smaller, specialized agents can help.
We have built an agent framework that, if this is structured:
- Orchestrator agent: It checked the image and identified which laptop components were visible (screen, keyboard, chassis, ports).
- Component agents: Devoted agents inspected each component for specific types of damage; For example, one for cracked screens, another for missing keys.
- Junk Detection Agent: A separate agent marked whether the image was even a laptop in the first place.
This modular, task -driven approach yielded much more precise and explanatory results. Hallucinations decreased dramatically, unwanted images were reliably marked and the task of each agent was simple and was sufficiently focused to properly control the quality.
The blind spots: considerations of an agentic approach
No matter how effective this was, it was not perfect. Two main restrictions showed up:
- Increased latency: Perform multiple sequential agents added to the total inference time.
- Huts: Agents could only detect issues that they were explicitly programmed to look for. If an image showed something unexpected that no agent had the task of identifying, it would go unnoticed.
We needed a way to balance precision with coverage.
The hybrid solution: combination of means and monolithic approaches
To bridge the holes, we have made a hybrid system:
- The Agent First walked and used precise detection of known damage types and junk images. We have limited the number of agents to the most essential to improve latency.
- Then one then one Monolithic Image LLM -Prompt Scanned the image on something else that the agents may have missed.
- Finally, we The model refined The use of a composite set of images for use cases with high priority, such as often reported damage scenarios, to further improve accuracy and reliability.
This combination gave us the precision and explanation of the agent attitude, the broad coverage of monolithic prompt and the trust of trust of targeted refinement.
What we have learned
A few things became clear by the time we completed this project:
- Agentic frameworks are more versatile than they get credit: Although they are usually associated with workflow management, we thought they could meaningly stimulate the model performance when they were applied in a structured, modular way.
- Combining different approaches of beats that are just one trust: The combination of precise, agent -based detection in addition to the wide coverage of LLMs, plus a little refinement that mattered the most, gave us much more reliable results than a few methods in itself.
- Visual models are susceptible to hallucinations: Even the more advanced setups can jump to conclusions or see things that are not there. A well -considered system design is needed to keep those errors under control.
- Variation of image quality makes a difference: Training and testing with both clear images with high resolution and daily, lower quality helped to remain resilient when confronted with unpredictable, real-world photos.
- You need a way to catch junk images: A special check for junk or not -related images was one of the simplest changes we made, and it had a major impact on the overall system reliability.
Last thoughts
What started as a simple idea, using an LLM prompt to detect physical damage in laptop images, quickly turned into a much deeper experiment when combining different AI techniques to tackle unpredictable, real problems. Along the way we realized that some of the most useful tools were not originally designed for this type of work.
Agentic frameworks, often seen as workflow utensils, proved surprisingly effective when they are reused for tasks such as structured damage detection and image filtering. With a little creativity they helped us to build a system that was not only more accurate, but easier to understand and manage in practice.
Shruti Tiwari is an AI product manager at Dell Technologies.
Vadiraj Kulkarni is a data scientist at Dell Technologies.