18 min read

Step-by-Step Guide to Building Advanced LLM Agents

A Large Language Model, at its core, is just a next-token predictor. While powerful, this fundamental architecture does not natively support the complex, multi-step reasoning and tool execution that characterize advanced agentic AI systems like ChatGPT Agent mode, Manus or Perplexity. The transformation from a raw probabilistic model into an autonomous agent requires a layered engineering approach.

This tutorial provides a practical, code-driven guide to building such an agent from first principles. We will incrementally construct the necessary components, demonstrating with functional code how each layer adds critical capabilities. The progression will be as follows:

  1. Core LLM Abstraction: We'll begin with standard LLM APIs.
  2. Stateful Memory: To enable coherent, multi-turn dialogue, we will implement a session memory system to track conversational history.
  3. Tool Integration: We will then grant the agent external capabilities by instructing it to call tools, like a web search API, through a structured output format.
  4. The Agentic Loop: The central component of any agent is the reason-act loop. We'll construct this loop, allowing the model to make a plan, execute a tool, observe the outcome, and repeat the cycle.
  5. Secure Sandboxed Execution: Finally, to handle real life tasks requiring code execution or file system access, we will integrate a secure sandbox environment, enabling the agent to safely perform advanced operations.

By the end of this post, you will have a complete, functional codebase for a minimal but powerful agent capable of orchestrating tools to solve non-trivial problems.

Here is a demo of the agent you will build. For the chat client you see in the demo, we use assistant-ui. We shall not be covering the UI building aspect in this post although it is quite straightforward with this library.

0:00
/1:31

Demo Video for AI Agent querying the web, downloading content, spawning containers, installing libraries, creating scripts and executing them and finally giving back download links

LLM API

An LLM model, irrespective of how advanced it seems, takes a sequence of tokens as inputs and predicts the next token - just one next token.

To get complete, coherent responses that you are used to, you need an LLM that is trained to respond in a chat like fashion. Then, given an input sequence of tokens, you generate the output first token and append it to the input sentence. Then autoregressively feed the new sentence as input to get the second output token. You loop until the LLM outputs a special token that represents end of response.

All LLM API’s and chat interfaces hide this loop from you because you would rarely need to control the generation token by token.

LLM API’s do a lot more than that. For instance, the agent we are building needs to serve a lot of users simultaneously. Therefore, we will need to intelligently batch inputs of multiple users. Every query in the batch may differ in length. Even the outputs will differ in lengths. Your agent may have a long system prompt that is appended to every first query in the conversation. Such common prefixes can be cached. Your large model’s weights would not fit in a single GPU’s memory and would need multi-GPU distributed setup for inference. You might want to use specialized releases (for instance quantized versions, or hardware adapted special attention kernels) of the models to speed up the responses. Modern LLM inference API’s will handle all of that for you and more so that you can only focus on building your agent.

For building our agent, let's use an API provided by OpenRouter (this can help us change models easily).

import os
import requests
import json

OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")
OPENROUTER_MODEL = os.environ.get("OPENROUTER_MODEL", "x-ai/grok-code-fast-1")

def llm(messages, model=OPENROUTER_MODEL, **opts):
    url = "https://openrouter.ai/api/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "Content-Type": "application/json",
        "HTTP-Referer": "https://your-app.example",
        "X-Title": "My Minimal Agent",
    }
    body = {"model": model, "messages": messages} | opts
    r = requests.post(url, headers=headers, data=json.dumps(body), timeout=60)
    
    if not r.ok:
        print(f"API Error {r.status_code}: {r.text}")
    
    r.raise_for_status()
    
    data = r.json()
    return data["choices"][0]["message"]

Multi-turn Conversations with the LLM

Now our agent can take in text input and respond with text. But to have multi-turn conversations, we also need to pass prior messages and responses along with the current message as input to the LLM. This means we need to have a memory that stores every conversation’s messages as sessions. Let’s build a simple memory system.

from collections import defaultdict
import uuid
import time

class Memory:
    def __init__(self):
        self.sessions = defaultdict(list)
    
    def new_session(self):
        """Create a new conversation session."""
        sid = str(uuid.uuid4())
        self.sessions[sid] = []
        return sid
    
    def add(self, sid, role, content):
        """Add a message to session history."""
        self.sessions[sid].append({
            "role": role, 
            "content": content, 
            "ts": time.time()
        })
    
    def history(self, sid):
        """Get session history"""
        return [{"role": m["role"], "content": m["content"]} 
                for m in self.sessions[sid]]

The output of the history method will be passed to the LLM for each call. This will help our LLM to have long nice, conversations with us. But how can we make it do more? Like find information from the internet.

Tool Calling

LLM APIs just return text and have no ability to perform tasks like web browsing, calling APIs, running scripts or executing any sort of action on a computer.

To solve for that, we use the system prompt to tell the LLM that there are certain tools that we can run on its behalf and if any of those tools is to be run to get results, return a structured response with the tool name and its parameters. Here is how a simple system prompt would look.

You are an assistant with one tool: search_web.
If a user query benefits from web info, return a JSON object:
{"type":"TOOL_CALL","tool":"search_web","args":{"query":"..."}}.
Otherwise, answer directly with:
{"type":"RESPONSE","content":"..."}.
Keep tool args minimal.

Here we ask the LLM to always return response as a JSON with keys type and content. The type can take two values - TOOL_CALL or RESPONSE. Its value determines whether we make the tool call or return the response to the user. The value in content key is then either the text to return to the user or the tool name and the parameter values to invoke it with.

Let’s try to add support for a web search tool in our agent. For implementing the web search tool, we will directly use a standard web search API like Exa. We will implement our agent with a simple if-else condition.

import json
import requests

EXA_API_KEY = os.environ.get("EXA_API_KEY", "YOUR_EXA_KEY")

def exa_search(query, num_results=5):
    url = "https://api.exa.ai/search"
    headers = {"Authorization": f"Bearer {EXA_API_KEY}", "Content-Type": "application/json"}
    payload = {"query": query, "numResults": num_results}
    resp = requests.post(url, headers=headers, data=json.dumps(payload), timeout=30)
    resp.raise_for_status()
    data = resp.json()
    return [{"title": r.get("title"), "url": r.get("url"), "snippet": r.get("text", "")[:240]} for r in data.get("results", [])]


def respond_or_search(sid, user_text):
    mem.add(sid, "user", user_text)
    msgs = [{"role": "system", "content": SYSTEM_PROMPT}] + mem.history(sid)[1:]
    m = llm(msgs, temperature=0.2)
    content = m.get("content", "").strip()

    if content.startswith("{") and '"TOOL_CALL"' in content:
        tool_req = json.loads(content)
        if tool_req.get("tool") == "search_web":
            results = exa_search(tool_req["args"]["query"])
            mem.add(sid, "tool", json.dumps({"tool":"search_web","results":results}))
            # give results back to LLM to compose answer
            follow_msgs = msgs + [{"role":"tool","content":json.dumps(results)}]
            final = llm(follow_msgs, temperature=0.2)
            mem.add(sid, "assistant", final["content"])
            return final["content"]
    else:
        mem.add(sid, "assistant", content)
        return content

The Reason-Act Agent Loop

Now we have an agent that can either respond to us or search the web and return results. But what if we wanted to support multi-step tasks? Like “Find me top 5 cities in the USA ordered by population desc”. Here the agent has to first search the web, then analyze the results to generate a response containing the list.

To enable this, we need a loop which we call the agent loop. The loop consists of an LLM call followed by an optional tool call. Like before, the LLM call is forced to return a structured JSON response with a key type. The value of type, for now, can take two values - RESPONSE or TOOL_CALL. We end the agent loop when the LLM call returns a response of type RESPONSE.

# generic loop: LLM step -> optional tool step -> repeat
import json
import requests

def fetch_webpage_tool(url):
    r = requests.get(url, timeout=30, headers={"User-Agent": "agent/0.1"})
    r.raise_for_status()
    # Limit to 128000 characters to avoid context length issues
    return r.text[:128000]

TOOLS = {
    "search_web": lambda args: exa_search(args["query"], args.get("k", 5)),
    "fetch_webpage": lambda args: {"html": fetch_webpage_tool(args["url"])}
}

def agent_loop(sid, user_text, max_steps=6):
    mem.add(sid, "user", user_text)
    messages = [{"role":"system","content":SYSTEM_PROMPT}] + mem.history(sid)

    for _ in range(max_steps):
        step = llm(messages, temperature=0)
        content = step.get("content","").strip()
        if content.startswith("{") and '"TOOL_CALL"' in content:
            req = json.loads(content)
            tool_name, args = req["tool"], req.get("args", {})
            tool_fn = TOOLS.get(tool_name)
            if not tool_fn:
                messages.append(
                {"role":"tool",
                "content":json.dumps({"error":f"unknown tool {tool_name}"})})
                continue
                
            result = tool_fn(args)
            messages.append(
            {"role":"tool",
            "content":json.dumps({"tool":tool_name,"result":result})})
        else:
            try:
                maybe = json.loads(content)
                if maybe.get("type") == "RESPONSE":
                    final = maybe["content"]
                else:
                    final = content
            except Exception:
                final = content
            mem.add(sid, "assistant", final)
            return final

    return "I hit the max steps limit."

Before we run this, we need to add another response type. This is because many LLMs will respond with a pretext instead of tool call. For example, LLM might return "alright, let me search the web first". Since this is a RESPONSE type, it will be returned to the user and the agent will loop stop.

Also note that, we added another tool fetch_webpage to fetch the webpage that Exa search returns (Exa search will return URLs with snippets only).

If you run the above with some print statements, you shall see

User Query: Give me top 5 cities in India ordered by population

Agent Response:
Step 1: {"type":"TOOL_CALL","tool":"search_web","args":{"query":"top 5 most populous cities in India 2024","k":5}}

Step 2: {"type":"TOOL_CALL","tool":"fetch_webpage","args":{"url":"https://en.wikipedia.org/wiki/List_of_cities_in_India_by_population"}}

Step 3: {"type":"RESPONSE","content":"Based on the latest data from the 2011 Census of India (as listed on Wikipedia), here are the top 5 cities in India ordered by population:

1. **Mumbai** - 12,442,373
2. **Delhi** - 11,034,555
3. **Bangalore** - 8,443,675
4. **Hyderabad** - 6,731,790
5. **Ahmedabad** - 5,577,940

Note: These figures are from the 2011 census; more recent estimates may vary slightly, but this is the official data used for ranking."}

You now have a minimal agent ready (albeit just single turn and with only access to web search and summarization capability).

Tools for Code Execution and Filesystem Operations

Now let’s try to support more complex scenarios like “Visit airbnb.com and download all images in the homepage”. To execute this, the agent needs to first be able to open the URL with its entire HTML content. Then detect all images in the HTML. Then download them in a folder, zip the folder, upload it somewhere and share its link to the user. The last part is the main challenge. To download images, you would need a filesystem. To compress a folder, you would need a computer attached to that filesystem.

For security reasons, execution of any code or any file system operations need to be isolated from the computer that our agent is running on. This means, for such cases, our agent first has to launch a container with a filesystem, certain CLI tools and environments for different programming languages pre-installed.

Additionally, such a sandboxed environment should also be cut off from the internet to avoid unintended/malicious side effects.

Therefore, safe and restricted tools like web search and fetch webpage run on the agent’s computer (or invoked via API call on the agent’s computer) while code execution or file system operation tools run in the offline sandboxed environment.

For creating sandboxes, executing code and managing filesystem, we shall use sandbox as a service rather than building this infrastructure on our own. Here, we shall use e2b which is used by many popular agents like perplexity, manus, etc. Here is the minimal interface we need from this sandbox.

class Sandbox:
    def __init__(self):
        self.id = None

    def create(self, image="python:3.11"):
        # create and boot an isolated VM/container; return sandbox id/handle
        self.id = "sandbox-123"  # placeholder
        return self.id

    def exec(self, cmd, workdir="/workspace"):
        # run a command inside the sandbox and return stdout/stderr/exit_code
        return {"stdout":"", "stderr":"", "code":0}

    def write_file(self, path, content_bytes):
        # upload bytes into the sandbox filesystem
        return True

    def read_file(self, path):
        # download bytes from sandbox
        return b"..."

Then, to execute the query “Visit airbnb.com and download all images in the homepage”, the LLM has to

  1. Launch secure sandbox
  2. Pass raw query (with system prompt) to the LLM. Based on instructions in the system prompt, the LLM will likely respond with a TOOL_CALL type with the fetch webpage tool name and parameter https://airbnb.com
  3. Invoke fetch webpage tool with parameter https://airbnb.com
  4. Pass resulting HTML file to LLM with context of past messages. The LLM will most likely return another TOOL_CALL response to invoke download tool with the image URL
  5. The download tool will fetch the image contents and then save the contents on the sandboxed filesystem as image file. The download function would happen outside the sandbox but would use the sandbox’s API the save the file.
  6. The download tool will return a success or a failure which shall be passed back to the LLM and the loop will continue till all images are downloaded
  7. Finally, after the last download’s result is passed to the LLM, it should return another TOOL_CALL response to invoke exec tool to run folder compression command. This tool will again be executed using the sandbox environment’s API
  8. E2B also provides API to upload files (in our case the compressed folder) to a blob storage and return a download link with access control. This will also be exposed as a tool call in the system prompt and will likely be invoked next
  9. Return the downloadable link of the compressed folder to the LLM
  10. LLM composes the final response with the link

The steps might look complex but it is a single loop with alternating LLM calls and tool calls. This is how it would look in code

def agent_loop(mem, llm_func, sid, user_text, max_steps=6):
    """Main agent loop handling tool calls and responses."""
    mem.add(sid, "user", user_text)

    messages = [{"role": "system", "content": SYSTEM_WITH_TOOLS}] + mem.history(sid)[1:]
    
    for step_num in range(max_steps):
        try:
            step = llm_func(messages, temperature=0)
            content = step.get("content", "").strip()
            print(f"Step {step_num + 1}: {content}")
        except Exception as e:
            # If LLM call fails, add error to conversation and try to continue
            error_msg = f"LLM API error: {str(e)}"
            print(f"Step {step_num + 1}: {error_msg}")
            messages.append({"role": "user", "content": error_msg})
            continue
        
        if content.startswith("{") and '"TOOL_CALL"' in content:
            req = json.loads(content)
            tool_name, args = req["tool"], req.get("args", {})
            tool_fn = TOOLS.get(tool_name)
            
            if not tool_fn:
                error_msg = f"Error: unknown tool {tool_name}"
                messages.append({"role": "assistant", "content": content})
                messages.append({"role": "user", "content": error_msg})
                continue
            
            # Execute the tool with error handling
            try:
                if tool_name == "subtask":
                    result = tool_fn(args, mem=mem, llm_func=llm_func)
                else:
                    result = tool_fn(args)
                
                # Append tool call and result to conversation
                messages.append({"role": "assistant", "content": content})
                messages.append({"role": "user", "content": f"Tool result: {json.dumps(result)}"})
            except Exception as e:
                # Send error to LLM instead of crashing
                error_msg = f"Tool {tool_name} failed with error: {str(e)}"
                messages.append({"role": "assistant", "content": content})
                messages.append({"role": "user", "content": error_msg})
            # Continue the loop to get next response from LLM
        else:
            # Handle different response types
            try:
                maybe = json.loads(content)
                response_type = maybe.get("type")
                
                if response_type == "RESPONSE":
                    # Final response - exit the loop
                    final = maybe["content"]
                    mem.add(sid, "assistant", final)
                    return final
                elif response_type == "UPDATE":
                    # Intermediate update - add to conversation and continue
                    update_content = maybe["content"]
                    messages.append({"role": "assistant", "content": content})
                    print(f"Update: {update_content}")
                    continue
                else:
                    # Unknown type, treat as final response
                    final = content
                    mem.add(sid, "assistant", final)
                    return final
            except Exception:
                # Not valid JSON, treat as final response
                final = content
                mem.add(sid, "assistant", final)
                return final
    
    return "I hit the step limit—try narrowing the task."

And here is a simple system prompt to be passed to the LLM at the beginning of the conversation. Note how long it is and structured into sections using XML tags. Bullet points provide additional structure. Formatting like bold text is used to assign importance.

You are Pumpkin AI Agent, an AI agent created to excel at programming, problem-solving, and task automation.

<intro>
You excel at the following tasks:
1. Web scraping, data processing, and analysis
2. Information gathering and research using web search 
3. Task decomposition and multi-step problem solving
4. Creating tools, scripts, and automation solutions
</intro>

<language_settings>
- Working language: **English**
- Use the language specified by user in messages as the working language when explicitly provided
- All responses must be in the working language
- Natural language arguments in tool calls must be in the working language
</language_settings>

<system_capability>
- Execute commands in a secure E2B sandbox environment with internet access
- Search the web for current information and research
- Create, read, update, and delete files and folders
- Download files from URLs and process web content
- Run Python, shell commands, and other programming languages
- Install packages and dependencies as needed
- Break down complex tasks into manageable sub-problems
</system_capability>

<agent_loop>
You are operating in an agent loop, iteratively completing tasks through these steps:
1. Analyze Request: Understand user needs and current task state
2. Select Tools: Choose appropriate tool calls based on the task requirements
3. Execute Action: Use selected tool and process the results
4. Iterate: Continue with additional tool calls if needed to complete the task
5. Provide Results: Return final response with deliverables and explanations
6. Enter Standby: Wait for new tasks when current work is completed
</agent_loop>

<tool_use_rules>
- Must respond with either a tool call or final response in JSON format
- Choose tools strategically based on task requirements
- Use web search for current information and research
- Use sandbox tools for code execution, file operations, and system tasks
- Chain multiple tool calls when complex tasks require several steps
- Always provide clear, actionable results to the user
</tool_use_rules>

<available_tools>
**Web & Research Tools:**
- search_web(query: string, k: integer = 5): Search the web for information using Exa API
- fetch_webpage(url: string): Retrieve HTML content from a specific URL

**File Operations (CRUD):**
- create_file(path: string, content: string): Create or write content to a file in sandbox
- read_file(path: string): Read and return contents of a file from sandbox
- update_file(path: string, content: string): Update/overwrite existing file with new content
- delete_file(path: string): Delete a file from the sandbox filesystem

**Folder Operations (CRUD):**
- create_folder(path: string): Create a new directory in the sandbox
- read_folder(path: string = "/workspace"): List contents of a directory
- delete_folder(path: string): Delete a folder and all its contents

**System & Execution Tools:**
- exec(command: string, workdir: string = "/"): Execute shell commands in sandbox
- download_file(url: string, path: string): Download file from URL to sandbox location
- get_download_link(path: string, expiration_seconds: integer = 3600): Get a downloadable URL for a file in sandbox

**Subtask Tool:**
- subtask(goal: string, max_steps: integer = 4): Delegate a sub-problem to a recursive agent instance
</available_tools>

<sandbox_environment>
Sandbox Environment Details:
- Secure E2B sandbox with internet access
- Default working directory: /
- Do not assume a directory exists
- Supports Python, shell commands, and package installation
- Nothing apart from python is installed. You have to first install all dependencies
- Persistent during task execution, automatically cleaned up after completion
- Use absolute paths for file operations (e.g., /workspace/data/file.txt)
</sandbox_environment>

<response_format>
Return responses in one of these formats:

For tool calls:
{"type":"TOOL_CALL","tool":"tool_name","args":{"param":"value"}}

Example tool call:
{"type":"TOOL_CALL","tool":"search_web","args":{"query":"latest AI developments 2024","k":3}}

For intermediate update responses before finishing the entire task:
{"type":"UPDATE","content":"your detailed response"}

For final responses (chat messages to the user AFTER entire task is completed):
{"type":"RESPONSE","content":"your natural, helpful response to the user"}

For error handling:
{"type":"RESPONSE","content":"I encountered an issue: [explanation and next steps]"}
</response_format>

<task_completion_guidelines>
- Provide thorough, actionable responses
- Ask clarifying questions when needed
- Include relevant code, data, or files when applicable
- Explain your approach and reasoning
- Offer suggestions for next steps or improvements
- Handle errors gracefully with clear explanations
- Ensure all deliverables are accessible and well-documented. To return downloadable files from sandbox, use the get_download_link tool to get its link
- All responses **MUST** be in the format specified in response_format section above
- If response type is "RESPONSE", it should be the final response. Do not return it unless the task is completed or you have to ask clarifying question.
</task_completion_guidelines>

Now let's run this agent on a query

User Query: download the first image on webpage https://en.wikipedia.org/wiki/India and save it to a folder and compress the folder and give me download link

Agent:
Step 1: {"type":"UPDATE","content":"I will start by fetching the webpage content from the provided URL to locate the first image. Then, I will download the image, save it to a folder, and compress that folder."}

Step 2: {"type":"TOOL_CALL","tool":"fetch_webpage","args":{"url":"https://en.wikipedia.org/wiki/India"}}

Step 3: {"type":"UPDATE","content":"I have successfully fetched the webpage content. The first image on the page is the flag of India, located at the following URL: `https://upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/1200px-Flag_of_India.svg.png`. I will now download this image, save it to a folder, and then compress that folder."}

Step 4: {"type":"TOOL_CALL","tool":"download_file","args":{"url":"https://upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/1200px-Flag_of_India.svg.png","path":"/workspace/Flag_of_India.png"}}

Step 5: {"type":"UPDATE","content":"The image has been successfully downloaded and saved as `Flag_of_India.png`. I will now create a folder, move the image into that folder, and then compress the folder."}

Step 6: {"type":"TOOL_CALL","tool":"create_folder","args":{"path":"/workspace/India_Flag"}}

Step 7: {"type":"TOOL_CALL","tool":"exec","args":{"command":"mv /workspace/Flag_of_India.png /workspace/India_Flag/","workdir":"/workspace"}}

Step 8: {"type":"TOOL_CALL","tool":"exec","args":{"command":"zip -r /workspace/India_Flag.zip /workspace/India_Flag","workdir":"/workspace"}}

Step 9: {"type":"UPDATE","content":"It seems that the `zip` command is not available in the environment. I will use an alternative method to compress the folder."}

Step 10: {"type":"TOOL_CALL","tool":"exec","args":{"command":"tar -czf /workspace/India_Flag.tar.gz -C /workspace India_Flag","workdir":"/workspace"}}

Step 11: {"type":"TOOL_CALL","tool":"get_download_link","args":{"path":"/workspace/India_Flag.tar.gz"}}

Step 12: {"type":"RESPONSE","content":"The folder containing the image has been successfully compressed. You can download the compressed file using the link below:\n\n[Download India_Flag.tar.gz](https://link-redacted-for-tutorial)\n\nPlease note that the download link is only valid while the sandbox is running."}

Final Response Sent To User:
The folder containing the image has been successfully compressed. You can download the compressed file using the link below:

[Download India_Flag.tar.gz](https://link-redacted-for-tutorial)

This is an almost fully fledged AI agent. Let's try on another complex scenario

User Query: find a picture of taj mahal from wikipedia and then download and convert it to black and white then give me download link of resulting image

Agent Response:
Step 1: {"type":"TOOL_CALL","tool":"fetch_webpage","args":{"url":"https://en.wikipedia.org/wiki/Taj_Mahal"}}

Step 2: {"type":"TOOL_CALL","tool":"download_file","args":{"url":"https://upload.wikimedia.org/wikipedia/commons/1/1d/Taj_Mahal_%28Edited%29.jpeg","path":"/workspace/taj.jpg"}}

Step 3: {"type":"TOOL_CALL","tool":"exec","args":{"command":"pip install pillow"}}

Step 4: {"type":"TOOL_CALL","tool":"exec","args":{"command":"python3 -c \"from PIL import Image; img = Image.open('/workspace/taj.jpg').convert('L'); img.save('/workspace/taj_bw.jpg')\""}}

Step 5: {"type":"TOOL_CALL","tool":"get_download_link","args":{"path":"/workspace/taj_bw.jpg"}}

Step 6: {"type":"RESPONSE","content":"I successfully found a picture of the Taj Mahal from Wikipedia, downloaded it, converted it to black and white, and created a download link for the resulting image.        

Here's the download link for the black and white version of the Taj Mahal image: [Download Black and White Taj Mahal Image](https://link-redacted-for-tutorial)     

Note: This link is temporary and valid for about 1 hour while the sandbox is running."}

Here is the image at the returned link

Note that our secure sandbox here is not cut off from the internet (As evident in the pip install command it ran). In production scenario, you would use a pre-configured container that has all necessary libraries installed and then isolate it from the internet.

Sub-Tasks

Notice that large tasks containing loops within loops can be decomposed into subtasks. This abstraction can be achieved easily by just adding a new tool named subtask. Subtask is a special tool that is just another instance of the agent itself. Basically, the agent is executing the subtask by recursively calling itself with the subtask description. Let’s see how this would work.

First, we need to modify the agent system prompt to expose another tool called sub task. Then, the code for the tool is as simple as

def run_subtask(goal: str, max_steps=4):
    # recursively run the same agent loop
    sid2 = mem.new_session()
    mem.sessions[sid2][0] = {"role":"system","content":SYSTEM_WITH_TOOLS}
    return agent_loop(sid2, f"Subtask: {goal}", max_steps=max_steps)

Long Context and Planning

We are almost there. In the previous example, we saw that the LLM manages looping over all images in the webpage (by suggesting next tool call and parameters). But such loops could get long or there could be multiple loops. In general, agentic tasks, when decomposed, can transform into quite long and complex DAGs (Directed Acyclic Graphs). At every step, the agent passes the context of all previous messages to the LLM.

This presents two problems. First, the context could go beyond the LLM's maximum supported context length. Second, as context gets longer, LLM fails to execute the sequence of tasks correctly. Moreover, the plan might change depending on the outputs of previous tool calls. For instance, if an image fails to download, the LLM has to invoke download tool on the same file again to retry till maximum retry limit is reached.

To address this, agents use two tricks. First, when the context goes beyond a certain limit, the agent summarizes the past conversation and uses that as context going forward. It can also store certain important information separately so that it is not lost in summarization. For the second problem, agents use something that we as humans do too - create a to-do list at the start. For both, the agent could use variables or use filesystem as memory. If using file system to store dynamic but important plans and instructions, the agent is then designed to read this file and send it to the LLM whenever necessary (for instance, when a to-do list is updated).

For example, when the user asks a complex query, the LLM's system prompt is designed such that the LLM first creates a to-do list of tasks for the query. This to-do list is stored in a temporary file (or in an in-memory variable). Before each subsequent LLM call, if there is a change to the list (due to a task item completed or changed), this to-do list is appended to the context. The agent also exposes additional tools definitions to the LLM to edit the to-do list so that items can be marked as complete or items can be added/removed. This enables the LLM to faithfully execute dynamic plans for complex queries.

We shall implement these and some more common cases like online ordering (which involves interaction with a web browser) and document upload (which could need some additional AI models document to markdown conversion) in the next blog post. With that, we shall have a fully functioning agent that can do most of your digital tasks!