Building Effective "Agents"
I put the agents in quotes because anthropic actually talks more about what they call "workflows". And imo this is where the real value of LLMs currently lies, workflow automation.
They also say that using LangChain and other frameworks is mostly unnecessary and does more harm than good. They instead argue to use some simple patterns, directly on the API level. Not dis-similar to the old-school Gang of Four software engineering patterns.
Really like this post as a guidance for how to actually build useful tools with LLMs. Keep it simple, stupid.
My personal view is that the roadmap to AGI requires an LLM acting as a prefrontal cortex: something designed to think about thinking.
It would decide what circumstances call for double-checking facts for accuracy, which would hopefully catch hallucinations. It would write its own acceptance criteria for its answers, etc.
It's not clear to me how to train each of the sub-models required, or how big (or small!) they need to be, or what architecture works best. But I think that complex architectures are going to win out over the "just scale up with more data and more compute" approach.
> Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents' autonomy makes them ideal for scaling tasks in trusted environments.
The questions then become:
1. When can you (i.e. a person who wants to build systems with them) trust them to make decisions on their own?
2. What type of trusted environments are we talking about? (Sandboxing?)
So, that all requires more thought -- perhaps by some folks who hang out at this site. :)
I suspect that someone will come up with a "real-world" application at a non-tech-first enterprise company and let us know.
Couldn’t agree more with this - too many people rush to build autonomous agents when their problem could easily be defined as a DAG workflow. Agents increase the degrees of freedom in your system exponentially making it so much more challenging to evaluate systematically.
Slightly off topic, but does anyone have a suggestion for a tool to make the visualizations of the different architectures like in this post?
My wish list for LLM APIs to make them more useful for 'agentic' workflows:
Finer grained control over the tools the LLM is supposed to use. The 'tool_choice' should allow giving a list of tools to choose. The point is that the list of all available tools is needed to interpret the past tool calls - so you cannot use it to also limit the LLM choice at a particular step. See also: https://zzbbyy.substack.com/p/two-roles-of-tool-schemas
Control over how many tool calls can go in one request. For stateful tools multiple tool calls in one request leads to confusion.
By the way - is anyone working with stateful tools? Often they seem very natural and you would think that the LLM at training should encounter lots of stateful interactions and be skilled in using them. But there aren't many examples and the libraries are not really geared towards that.
While I agree with the premise of keeping it simple (especially when it comes to using opaque and overcomplicated frameworks like LangChain/LangGraph!) I do believe there’s a lot more to building agentic systems than this article covers.
I recently wrote[1] about the 4 main components of autonomous AI agents (Profile, Memory, Planning & Action) and all of that can still be accomplished with simple LLM calls, but there’s simply a lot more to think about than simple workflow orchestration if you are thinking of building production-ready autonomous agentic systems.
[1] https://melvintercan.com/p/anatomy-of-an-autonomous-ai-agent
Agents are still a misaligned concept in AI. While this article offers a lot in orchestration, memory (only mentioned once in the post) and governance are not really mentioned. The latter is important to increase reliability -- something Ilya Sutskever mentioned to be important as agents can be less deterministic in their responses. Interestingly, "agency" i.e., the ability of the agent to make own decisions is not mentioned once.
I work on CAAs and document my journey on my substack (https://jdsmerau.substack.com)
This was an excellent writeup - felt a bit surprised at how much they considered "workflow" instead of agent but I think it's good to start to narrow down the terminology
I think these days the main value of the LLM "agent" frameworks is being able to trivially switch between model providers, though even that breaks down when you start to use more esoteric features that may not be implemented in cleanly overlapping ways
It looks like Agents are less about DAG workflows and fully autonomous "networks of agents", but more of a stateful network:
* A "network of agents" is a system of agents and tools
* That run and build up state (both "memory" and actual state via tool use)
* Which is then inspected when routing as a kind of "state machine".
* Routing should specify which agent (or agents, in parallel) to run next, via that state.
* Routing can also use other agents (routing agents) to figure out what to do next, instead of code.
We're codifying this with durable workflows in a prototypical library — AgentKit: https://github.com/inngest/agent-kit/ (docs: https://agentkit.inngest.com/overview).
It took less than a day to get a network of agents to correctly fix swebench-lite examples. It's super early, but very fun. One of the cool things is that this uses Inngest under the hood, so you get all of the classic durable execution/step function/tracing/o11y for free, but it's just regular code that you write.
When thinking about AI agents, there is still conflation between how to decide the next step to take vs what information is needed to decide the next step.
If runtime information is insufficient, we can use AI/ML models to fill that information. But deciding the next step could be done ahead of time assuming complete information.
Most AI agent examples short circuit these two steps. When faced with unstructured or insufficient information, the program asks the LLM/AI model to decide the next step. Instead, we could ask the LLM/AI model to structure/predict necessary information and use pre-defined rules to drive the process.
This approach will translate most [1] "Agent" examples into "Workflow" examples. The quotes here are meant to imply Anthropic's definition of these terms.
[1] I said "most" because there might be continuous world systems (such as real world simulacrum) that will require a very large number of rules and is probably impractical to define each of them. I believe those systems are an exception, not a rule.
Good article. I think it can emphasize a bit more on supporting human interactions in agentic workflows. While composing workflows isn't new, involving human-in-the-loop introduces huge complexity, especially for long-running, async processes. Waiting for human input (which could take days), managing retries, and avoiding errors like duplicate refunds or missed updates require careful orchestration.
I think this is where durable execution shines. By ensuring every step in an async processing workflow is fault-tolerant and durable, even interruptions won't lose progress. For example, in a refund workflow, a durable system can resume exactly where it left off—no duplicate refunds, no lost state.
Note how much the principles here resemble general programming principles: keep complexity down, avoid frameworks if you can, avoid unnecessary layers, make debugging easy, document, and test.
It’s as if AI took over the writing-the-program part of software engineering, but sort of left all the rest.
The whole Agent thing can easily blow in complexity.
Here some challenges I personally faced recently
- Durable Execution Paradigm: You may need the system to operate in a "durable execution" fashion like Temporal, Hatchet, Inngest, and Windmill. Your processes need to run for months, be upgraded and restarted. Links below
- FSM vs. DAG: Sometimes, a Finite State Machine (FSM) is more appropriate than a Directed Acyclic Graph (DAG) for my use cases. FSMs support cyclic behavior, allowing for repeated states or loops (e.g., in marketing sequences). FSM done right is hard. If you need FSM, you can't use most tools without "magic" hacking
- Observability and Tracing - takes time to put it everything nice in Grafana (Alloy, Tempo, Loki, Prometheus) or whatever you prefer. Attention switch between multiple systems is not an option during to limited attention span and "skills" issue. Most of "out of box" functionality or new Agents frameworks quickly becomes a liability
- Token/Inference Economy - token consumption and identifying edge cases with poor token management is a challenge, similar to Ethereum's gas consumption issues. Building a billing system based on actual consumption on the top of Stripe was a challenge. This is even 10x harder ... at least for me ;)
- Context Switching - managing context switching is akin to handling concurrency and scheduling with async/await paradigms, which can become complex. Simple prompts is a ok, but once you start joggling documents or screenshots or screen reading it's another game.
What I like about the all above it's nothing new - all design patterns, architecture are known for a while.
It's just hard to see it through AI/ML buzzwords storm ... but once you start looking at source code ... the fog of mind wars become clear.
Durable Execution / Workflow Engines
- Temporal https://github.com/temporalio - https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
- Hatchet https://news.ycombinator.com/item?id=39643136
- Inngest https://news.ycombinator.com/item?id=36403014
- Windmill https://news.ycombinator.com/item?id=35920082
Any comments and links on the above challenges and solutions are greatly appreciated!
I’m part of a team that is currently #1 at the SWEBench-lite benchmark. Interesting times!
Have been building agents for past 2 years, my tl;dr is that:
Agents are Interfaces, Not Implementations
The current zeitgeist seems to think of agents as passthrough agents: e.g. a lite wrapper around a core that's almost 100% a LLM.
The most effective agents I've seen, and have built, are largely traditional software engineering with a sprinkling of LLM calls for "LLM hard" problems. LLM hard problems are problems that can ONLY be solved by application of an LLM (creative writing, text synthesis, intelligent decision making). Leave all the problems that are amenable to decades of software engineering best practice to good old deterministic code.
I've been calling system like this "Transitional Software Design." That is, they're mostly a traditional software application under the hood (deterministic, well structured code, separation of concerns) with judicious use of LLMs where required.
Ultimately, users care about what the agent does, not how it does it.
The biggest differentiator I've seen between agents that work and get adoption, and those that are eternally in a demo phase, is related to the cardinality of the state space the agent is operating in. Too many folks try and "boil the ocean" and try and implement a generic purpose capability: e.g. Generate Python code to do something, or synthesizing SQL based on natural language.
The projects I've seen that work really focus on reducing the state space of agent decision making down to the smallest possible set that delivers user value.
e.g. Rather than generating arbitrary SQL, work out a set of ~20 SQL templates that are hyper-specific to the business problem you're solving. Parameterize them with the options for select, filter, group by, order by, and the subset of aggregate operations that are relevant. Then let the agent chose the right template + parameters from a relatively small finite set of options.
^^^ the delta in agent quality between "boiling the ocean" vs "agent's free choice over a small state space" is night and day. It lets you deploy early, deliver value, and start getting user feedback.
Building Transitional Software Systems:
1. Deeply understand the domain and CUJs,
2. Segment out the system into "problems that traditional software is good at solving" and "LLM-hard problems",
3. For the LLM hard problems, work out the smallest possible state space of decision making,
4. Build the system, and get users using it,
5. Gradually expand the state space as feedback flows in from users.
Claude api lacks structured output, without uniformity in output, it's not useful as agent. I've had agents system broke down suddenly due to degradation in output, which leads to the previous suggested json output hacks (from official cookbook) stopped working.
I have always voted for the Unix style multiple do one thing good blackboxes as the plumbing in the ruling agent.
Divide and conquer me hearties.
Tangent but anyone know what software is used to draw those workflow diagrams?
indeed, we've seen this approach as well. All these "frameworks" in real business cases become too complicated.
Does any one have a solid examples of a real agent, deployed in production?
Anthropic keeps advertising its MCP (Model Context Protocol), but to the extent it doesn't support other LLMs, e.g. GPT, it couldn't possibly gain adoption. I have yet to see any example of MCP that can be extended to use a random LLM.
Key to understanding the power of agentic workflows is tool usage. You don't have to write logic anymore, you simply give an agent the tools it needs to accomplish a task and ask it to do so. Models like the latest Sonnet have gotten so advanced now that coding abilities are reaching superhuman levels. All the hallucinations and "jitter" of models from 1-2 years ago has gone away. They can be reasoned on now and you can build reliable systems with them.
[dead]
This is by far the most practical piece of writing I've seen on the subject of "agents" - it includes actionable definitions, then splits most of the value out into "workflows" and describes those in depth with example applications.
There's also a cookbook with useful code examples: https://github.com/anthropics/anthropic-cookbook/tree/main/p...
Blogged about this here: https://simonwillison.net/2024/Dec/20/building-effective-age...