AI Agents: What Actually Happens Internally

This is fully written by me without any help from AI, except for learning the concepts. This blog consists of all the things that everyone should be aware of in the AI world, which took me weeks to learn to make it easy for you.

> Practice agentic AI concepts hands-on for interviews and real-world development on AgenticPrep.io.

Overview

In the world of AI, LLMs are not sufficient.

LLMs are simple(and yet complex) until you convert them to agents.

This is where all the powers are unlocked, and LLMs can be used to do anything and everything in the world.

This blog explains how a chatbot(or simple LLM) is actually converted into an AI Agent to do real-world, productive things.

I'll explain from the basics what LLMs are, how they should be used, and how agents are created, but on a high level:

LLMs(or large language models) are models trained(like Gemini, Opus, etc) on a huge amount of data and can be considered as "a brain in a glass jar" that can dictate what to do.
Agents(or wrappers) are tools written which used LLMs internally to do real-world tasks.
Chatbot - a common way to interact with LLM, either through a UI interface or LLM APIs. In this blog, we will mostly assume API based interaction.

The Basics: How LLM works

This section only describes the things that someone needs to know while building LLM-based agents and not how LLMs are created or trained.

Think of LLMs as a black box that takes in inputs as tokens and predicts the output tokens.

A token = a unit used in LLMs, for example, a byte of disk space or RAM, a centimeter of length, etc.

One token nearly equals 4 English characters.

The tokens predicted by LLMs are sequential in nature.

A T+1th token can only be predicted once it predicts the Tth token and not before that.

This is exactly how we chat with an LLM.

But it does have its own limitations that we need to take care of while using LLMs

Fixed context window - Input Tokens + Output Tokens <= Total Context Window.

If we pass more input or expect more output, it will either start hallucinating(saying random shit) or error out.

Input tokens are cheaper than output tokens - this is one thing to take care of while building AI agents. This is also one thing that due to which LLMs have a max output tokens limit defined even though there is a big context window left after input.
Session management - not a limitation, but by design, LLMs are stateless. Once you send some data(via API calls), get the output back, it will forget what you told it instantly. In the next call, you will have to remind the LLM of all the things.

The state management is a very big topic in LLM-based agents, so it will be covered separately in Part 2.

But one thing we should be aware of is that LLMs are stateless, and the state management should be taken care of by the agent(aka the wrapper) itself.

The most important things to taken care of while developing LLM-based agents:

System prompts(dynamic, based on user input with extra information added internally), like:

User inputs - define the intent of the user and are taken as input
User's name, location, time, etc - internally defined
Environment - prod, test - internally defined
Past context of the user(if any) - internally defined from the user's past conversations

Restrictions it should follow - these are the guardrails and safety mechanisms defined to prevent cases of prompt injection. These are in the form of prompts like:

The output should not contain any secrets
Remember the RBAC for the given user

Formatting constraints

If the output is expected in JSON format, ensure it is a correct JSON
If a user asks that LLM should behave like a gardener, reply like a gardener.
What should be the fallback if the LLM doesn't give output in the format that the user expects?

Forcing predictability - the building block of an Agent

The difference between a chatbot and an agent is that the agent can do things, but more predictably and less randomly.

This can be achieved by various mechanisms:

Native LLM APIs(provided by OpenAI, Gemini, etc) using prompting and decreasing their probability distribution to 0.

The LLMs are smart; if you provide them constraints with an example structure of output like JSON, they will try to take care of it.
Decrease the probability distribution(or temperature) to 0 for structured outputs so they are less "creative" - this is a parameter that can be tuned while working with LLMs so we can control its output.
Ask it to throw an error in case it is not able to do that, instead of letting it hallucinate and return random things.

Use an instructor, which is an abstraction layer between your LLM and your agent, to take care of structured outputs. This will parse and validate the output of the LLM, and if incorrect, pass it back to the LLM along with why it was incorrect to let it fix. Examples are PydanticAI for Python or Zod for TypeScript.

This is like a self-healing or smart auto-retry mechanism used to force predictability of output so your code actually compiles.
This is what takes care of syntax correctness.

Sandboxed testing - for generating code, this is what takes care of semantics rather than syntax correctness, so your code has fewer logical errors.

Executes the code in a sandbox environment not visible to the user and doesn't have access to the outside world, so if something goes wrong, it can be contained inside the sandbox. This can be a Docker container, WASM sandbox etc which executes the code, returns the errors back to the LLM for fixing, which works as a feedback loop until it is sure that the code is correct.
This is why you might have observed tools like Claude and Codex don't write code in a streaming manner like they do for their chatbots on UI.

This is very crucial for an AI agent wrapper, so we don't use a lot of tokens due to a high number of retries or hallucinations, which will increase customer cost and output time required - both of which will upset the user.

Tools: the one which convert a chatbot into an agent

An LLM tool is nothing but a standard, deterministic piece of software in the form of a bash command, a script, a Python code, a REST API, a GQL API, or an MCP tool(a little special one), which makes LLM aware of the outside world, the one that "dumb" LLM is not fully aware of.

All AI "wrappers" or agents are simply an LLM + tools + system prompts combination and nothing more.

If you take LLMs as a black box and something not in your control, the better you build your tools and system prompts, the better the agent will behave.

One thing you should take care of is NOT to write your tools and system prompts in such a way that the LLM restricts its thinking.

For example: Instead of writing a system prompt to contain the exact DB query for "what is the difference between my cloud architecture from last Monday", just provide the prompts with enough context, the DB schema, how to connect with DB or REST APIs, what the query inputs and output parameters are, etc.

If you give it the exact query, it will make the LLM dumber and restrict its natural "out of the box" thinking.

Every tool has a name, a description, and the input-output format, which enables LLM to figure out which one to use.

The duty of LLM is, with the above information, to figure out what tool(s) to call.

It returns the list of tools to the caller and forgets everything(remember that the LLMs are stateless) and now comes the magic of the agent.

The "wrapper" or the Agent

By design, LLMs do not remember things. Once you call an LLM API with input, it provides you with the output and forgets everything. LLM cannot fetch the data from the external environment; it always relies on the data that you provided it in the input and the data they are trained on.

Now, the agent's duty is to call the tools, get the data and give the whole context back to LLM so LLMs are more aware of the context. Once we get the additional data, we perform another LLM API call with the updated additional details.

The magic(or "agentic" tech) is in the while loop inside the agent - "while there are no tools to call":

build the context using input, tools output, system prompts, etc
call the LLM APIs with the context
wait for LLM to return either the list of another set of tools or the final output

Every step of the agent, whether it is input or an interrupt from the user, output from the LLM or a tool call, should be persisted in a suitable DB or a local file, based on the use case for how it should be used later.

But some persistence layer should mostly be present(stay tuned for part 2 of this blog on how to choose one) for crashes, history maintenance, and resumability.

Though all the concepts and designs are straightforward which are explained above, there are still some unanswered questions:

Should the while loop be sync or async? If async, why and how?
What persistence layer to choose for maintaining history, context, etc.?
How to handle stale tool results in case of resumability?

For example, today you asked "if you should take an umbrella to work", we will have stale data from the tool about the weather conditions. In case you resume the same session tomorrow, LLM should decide if it should re-call those tools in the history and use their old results and move ahead.

Preamble or agentic chatter problem - making LLM more "human".

Thanks for reading through it, and stay tuned for the next set of blogs for things that are left unanswered.

Search This Blog

A Cup Of Code