I can’t even browse LinkedIn without seeing some product manager hype about agents coming “just around the corner”.
And before you jump into the comment’s section, I am not biased. I’ve worked with large language models since before ChatGPT, when it was GPT-3 in the OpenAI website and it only predicted the next words in the sentence (as opposed to the now-familiar chat interface).
I’ve built AI applications from scratch and trained all types of AI models. I’ve taken Deep Learning courses at the best AI and Computer Science school in the world, Carnegie Mellon, and obtained my Master’s Degree there.
And yet, when I see yet another video on my TikTok feed, I can’t help but cringe and think about how “Web 3 was going to transform the internet”.
Like I swear, this must be bot farms, ignorant non-technical people, and manufactured hype from OpenAI so that they can receive more funding. I mean, how many software engineers do you know that released production-ready agents?
That’s right. None.
Here’s why all of this manufactured hype is nonsense.
What is an “AI Agent”?
Agents actually have a long history within artificial intelligence. In recent times, since the invention of ChatGPT, it has come to mean a large language model structured to perform reasoning and complete tasks autonomously.
This model MIGHT be fine-tuned with reinforcement learning, but in practice people tend to just use OpenAI’s GPT, Google Gemini, or Anthropic’s Claude.
The difference between an agent and a language model is that agents complete task autonomously.
Here’s an example.
I have an algorithmic trading and financial research platform, NexusTrade.
NexusTrade — AI-Powered Algorithmic Trading Platform
Learn to conquer the market by deploying no-code algorithmic trading strategies.
Let’s say I wanted to stop paying an external data provider to get fundamental data for US companies.
With traditional language models, I would have to write code that interacts with them. This would look like the following:
- Build a script that scrapes the SEC website or use a GitHub repo to fetch company information (conforming to the 10 requests per second guideline in their terms of service)
- Use a Python library like pypdf to transform the PDFs to text
- Send it to a large language model to format the data
- Validate the response
- Save it in the database
- Repeat for all companies
With an AI agent, you should theoretically just be able to say.
Scrape the past and future historical data for all US companies and save it to a MongoDB database
Maybe it’ll ask you some clarifying questions. It might ask if you have an idea for what the schema should look like or which information is most important.
But the idea is you give it a goal and it will complete the task fully autonomously.
Sounds too good to be true, right?
That’s because it is.
The Problem with AI Agents in Practice
Now if the cheapest, small language model was free, as strong as Claude 3.5, and could be ran locally on any AWS T2 instance, then this article would be in a completely different tone.
It wouldn’t be a critique. It’d be a warning.
However, as it stands, AI agents do not work in the real world and here’s why.
1. Smaller Models are not NEARLY strong enough
The core problem of agents is that they rely on large language models.
More specifically, they rely on a GOOD model.
GPT-4o mini, the cheapest, large language model, other than Flash, is AMAZING for the price.
GPT-4o mini vs GPT-3.5 Turbo. I just tried out the new model and am BEYOND Impressed
We are entering a new era of inexpensive language models
But it is quite simply not strong enough to complete real world agentic tasks.
It will steer off, forget its goals, or just make simple mistakes no matter how good you prompt it.
And if deployed live, your business will pay the price. When the large language model makes a mistake, it’s not super easy to detect unless you also build (likely an LLM-based) validation framework. One small error made at the beginning, and everything downstream from that is cooked.
In practice, here’s how this works.
2. Compounding of Errors
Let’s say you’re using GPT-4o-mini for agentic work.
Your agent breaks the task of extracting financial information for a company into smaller subtasks. Let’s say the probability it does each subtask correct is 90%.
With this, the errors compound. If a task is even moderately difficult with four subtasks, the probability of the final output being good is extremely low.
For example, if we break this down:
- The probability of completing one subtask is 90%
- The probability of completing two subtasks is 0.9*0.9 = 81%
- The probability of completing four subtasks is 66%
See where I’m headed?
To mitigate this, you will want to use a better language model. The stronger model might increase the accuracy of each subtask to 99%. After four subtasks, the final accuracy is 96%. A whole lot better (but still not perfect).
Most importantly, changing to these stronger models comes with an explosion of costs.