Building agents with LLM (large language - model) as its core controller is a cool concept. Several proof-of-concepts - demos, such as AutoGPT, - GPT-Engineer and - BabyAGI, serve as - inspiring examples. The potentiality of LLM extends beyond generating well-written - copies, stories, essays and programs; it can be framed as a powerful general - problem solver.
\nAgent System Overview#
\nIn - a LLM-powered autonomous agent system, LLM functions as the agent’s - brain, complemented by several key components:
\n- \n
- Planning\n
- \n
- Subgoal - and decomposition: The agent breaks down large tasks into smaller, manageable - subgoals, enabling efficient handling of complex tasks. \n
- Reflection - and refinement: The agent can do self-criticism and self-reflection over past - actions, learn from mistakes and refine them for future steps, thereby improving - the quality of final results. \n
\n - Memory\n
- \n
- Short-term - memory: I would consider all the in-context learning (See Prompt - Engineering) as utilizing short-term memory of the model to learn. \n
- Long-term - memory: This provides the agent with the capability to retain and recall (infinite) - information over extended periods, often by leveraging an external vector - store and fast retrieval. \n
\n - Tool use\n
- \n
- The - agent learns to call external APIs for extra information that is missing from - the model weights (often hard to change after pre-training), including current - information, code execution capability, access to proprietary information - sources and more. \n
\n

Component - One: Planning#
\nA - complicated task usually involves many steps. An agent needs to know what - they are and plan ahead.
\nTask Decomposition#
\nChain - of thought (CoT; Wei - et al. 2022) has become a standard prompting technique for enhancing model - performance on complex tasks. The model is instructed to “think step - by step” to utilize more test-time computation to decompose hard tasks - into smaller and simpler steps. CoT transforms big tasks into multiple manageable - tasks and shed lights into an interpretation of the model’s thinking - process.
\nTree of Thoughts (Yao - et al. 2023) extends CoT by exploring multiple reasoning possibilities - at each step. It first decomposes the problem into multiple thought steps - and generates multiple thoughts per step, creating a tree structure. The search - process can be BFS (breadth-first search) or DFS (depth-first search) with - each state evaluated by a classifier (via a prompt) or majority vote.
\nTask
- decomposition can be done (1) by LLM with simple prompting like "Steps
- for XYZ.\\n1."
, "What are the subgoals for achieving
- XYZ?"
, (2) by using task-specific instructions; e.g. "Write
- a story outline."
for writing a novel, or (3) with human inputs.
Another - quite distinct approach, LLM+P (Liu - et al. 2023), involves relying on an external classical planner to do - long-horizon planning. This approach utilizes the Planning Domain Definition - Language (PDDL) as an intermediate interface to describe the planning problem. - In this process, LLM (1) translates the problem into “Problem PDDL”, - then (2) requests a classical planner to generate a PDDL plan based on an - existing “Domain PDDL”, and finally (3) translates the PDDL plan - back into natural language. Essentially, the planning step is outsourced to - an external tool, assuming the availability of domain-specific PDDL and a - suitable planner which is common in certain robotic setups but not in many - other domains.
\nSelf-Reflection#
\nSelf-reflection - is a vital aspect that allows autonomous agents to improve iteratively by - refining past action decisions and correcting previous mistakes. It plays - a crucial role in real-world tasks where trial and error are inevitable.
\nReAct (Yao - et al. 2023) integrates reasoning and acting within LLM by extending the - action space to be a combination of task-specific discrete actions and the - language space. The former enables LLM to interact with the environment (e.g. - use Wikipedia search API), while the latter prompting LLM to generate reasoning - traces in natural language.
\nThe ReAct prompt template incorporates - explicit steps for LLM to think, roughly formatted as:
\nThought:
- ...\nAction: ...\nObservation: ...\n... (Repeated many times)\n

In both experiments on knowledge-intensive
- tasks and decision-making tasks, ReAct
works better than the
- Act
-only baseline where Thought: \u2026
step is
- removed.
Reflexion (Shinn - & Labash 2023) is a framework to equips agents with dynamic memory - and self-reflection capabilities to improve reasoning skills. Reflexion has - a standard RL setup, in which the reward model provides a simple binary reward - and the action space follows the setup in ReAct where the task-specific action - space is augmented with language to enable complex reasoning steps. After - each action $a_t$, the agent computes a heuristic $h_t$ and optionally may - decide to reset the environment to start a new trial depending on - the self-reflection results.
\n
The heuristic function determines when - the trajectory is inefficient or contains hallucination and should be stopped. - Inefficient planning refers to trajectories that take too long without success. - Hallucination is defined as encountering a sequence of consecutive identical - actions that lead to the same observation in the environment.
\nSelf-reflection - is created by showing two-shot examples to LLM and each example is a pair - of (failed trajectory, ideal reflection for guiding future changes in the - plan). Then reflections are added into the agent’s working memory, up - to three, to be used as context for querying LLM.
\n
Chain - of Hindsight (CoH; Liu - et al. 2023) encourages the model to improve on its own outputs by explicitly - presenting it with a sequence of past outputs, each annotated with feedback. - Human feedback data is a collection of $D_h = \\{(x, y_i , r_i , z_i)\\}_{i=1}^n$, - where $x$ is the prompt, each $y_i$ is a model completion, $r_i$ is the human - rating of $y_i$, and $z_i$ is the corresponding human-provided hindsight feedback. - Assume the feedback tuples are ranked by reward, $r_n \\geq r_{n-1} \\geq - \\dots \\geq r_1$ The process is supervised fine-tuning where the data is - a sequence in the form of $\\tau_h = (x, z_i, y_i, z_j, y_j, \\dots, z_n, - y_n)$, where $\\leq i \\leq j \\leq n$. The model is finetuned to only predict - $y_n$ where conditioned on the sequence prefix, such that the model can self-reflect - to produce better output based on the feedback sequence. The model can optionally - receive multiple rounds of instructions with human annotators at test time.
\nTo - avoid overfitting, CoH adds a regularization term to maximize the log-likelihood - of the pre-training dataset. To avoid shortcutting and copying (because there - are many common words in feedback sequences), they randomly mask 0% - 5% of - past tokens during training.
\nThe training dataset in their experiments - is a combination of WebGPT - comparisons, summarization - from human feedback and human - preference dataset.
\n
The idea of CoH is to present a history of sequentially - improved outputs in context and train the model to take on the trend to produce - better outputs. Algorithm Distillation (AD; Laskin - et al. 2023) applies the same idea to cross-episode trajectories in reinforcement - learning tasks, where an algorithm is encapsulated in a long history-conditioned - policy. Considering that an agent interacts with the environment many times - and in each episode the agent gets a little better, AD concatenates this learning - history and feeds that into the model. Hence we should expect the next predicted - action to lead to better performance than previous trials. The goal is to - learn the process of RL instead of training a task-specific policy itself.
\n
(Image source: Laskin - et al. 2023).
The paper hypothesizes that any algorithm - that generates a set of learning histories can be distilled into a neural - network by performing behavioral cloning over actions. The history data is - generated by a set of source policies, each trained for a specific task. At - the training stage, during each RL run, a random task is sampled and a subsequence - of multi-episode history is used for training, such that the learned policy - is task-agnostic.
\nIn reality, the model has limited context window - length, so episodes should be short enough to construct multi-episode history. - Multi-episodic contexts of 2-4 episodes are necessary to learn a near-optimal - in-context RL algorithm. The emergence of in-context RL requires long enough - context.
\nIn comparison with three baselines, including ED (expert - distillation, behavior cloning with expert trajectories instead of learning - history), source policy (used for generating trajectories for distillation - by UCB), - RL^2 (Duan et al. 2017; used - as upper bound since it needs online RL), AD demonstrates in-context RL with - performance getting close to RL^2 despite only using offline RL and learns - much faster than other baselines. When conditioned on partial training history - of the source policy, AD also improves much faster than ED baseline.
\n
(Image source: Laskin et al. 2023)
Component - Two: Memory#
\n(Big - thank you to ChatGPT for helping me draft this section. I’ve learned - a lot about the human brain and data structure for fast MIPS in my conversations - with ChatGPT.)
\nTypes of Memory#
\nMemory can be - defined as the processes used to acquire, store, retain, and later retrieve - information. There are several types of memory in human brains.
\n- \n
- \n
Sensory - Memory: This is the earliest stage of memory, providing the ability - to retain impressions of sensory information (visual, auditory, etc) after - the original stimuli have ended. Sensory memory typically only lasts for up - to a few seconds. Subcategories include iconic memory (visual), echoic memory - (auditory), and haptic memory (touch).
\n \n - \n
Short-Term - Memory (STM) or Working Memory: It stores information - that we are currently aware of and needed to carry out complex cognitive tasks - such as learning and reasoning. Short-term memory is believed to have the - capacity of about 7 items (Miller - 1956) and lasts for 20-30 seconds.
\n \n - \n
Long-Term - Memory (LTM): Long-term memory can store information for a remarkably - long time, ranging from a few days to decades, with an essentially unlimited - storage capacity. There are two subtypes of LTM:
\n- \n
- Explicit / - declarative memory: This is memory of facts and events, and refers to those - memories that can be consciously recalled, including episodic memory (events - and experiences) and semantic memory (facts and concepts). \n
- Implicit - / procedural memory: This type of memory is unconscious and involves skills - and routines that are performed automatically, like riding a bike or typing - on a keyboard. \n
\n

We - can roughly consider the following mappings:
\n- \n
- Sensory memory - as learning embedding representations for raw inputs, including text, image - or other modalities; \n
- Short-term memory as in-context learning. It - is short and finite, as it is restricted by the finite context window length - of Transformer. \n
- Long-term memory as the external vector store that - the agent can attend to at query time, accessible via fast retrieval. \n
Maximum Inner Product Search (MIPS)#
\nThe - external memory can alleviate the restriction of finite attention span. A - standard practice is to save the embedding representation of information into - a vector store database that can support fast maximum inner-product search - (MIPS). - To optimize the retrieval speed, the common choice is the approximate - nearest neighbors (ANN)\u200B algorithm to return approximately top k - nearest neighbors to trade off a little accuracy lost for a huge speedup.
\nA - couple common choices of ANN algorithms for fast MIPS:
\n- \n
- LSH - (Locality-Sensitive Hashing): It introduces a hashing function such - that similar input items are mapped to the same buckets with high probability, - where the number of buckets is much smaller than the number of inputs. \n
- ANNOY (Approximate - Nearest Neighbors Oh Yeah): The core data structure are random projection - trees, a set of binary trees where each non-leaf node represents a hyperplane - splitting the input space into half and each leaf stores one data point. Trees - are built independently and at random, so to some extent, it mimics a hashing - function. ANNOY search happens in all the trees to iteratively search through - the half that is closest to the query and then aggregates the results. The - idea is quite related to KD tree but a lot more scalable. \n
- HNSW - (Hierarchical Navigable Small World): It is inspired by the idea of small - world networks where most nodes can be reached by any other nodes within - a small number of steps; e.g. “six degrees of separation” feature - of social networks. HNSW builds hierarchical layers of these small-world graphs, - where the bottom layers contain the actual data points. The layers in the - middle create shortcuts to speed up search. When performing a search, HNSW - starts from a random node in the top layer and navigates towards the target. - When it can’t get any closer, it moves down to the next layer, until - it reaches the bottom layer. Each move in the upper layers can potentially - cover a large distance in the data space, and each move in the lower layers - refines the search quality. \n
- FAISS - (Facebook AI Similarity Search): It operates on the assumption that in high - dimensional space, distances between nodes follow a Gaussian distribution - and thus there should exist clustering of data points. FAISS applies - vector quantization by partitioning the vector space into clusters and then - refining the quantization within clusters. Search first looks for cluster - candidates with coarse quantization and then further looks into each cluster - with finer quantization. \n
- ScaNN - (Scalable Nearest Neighbors): The main innovation in ScaNN is anisotropic - vector quantization. It quantizes a data point $x_i$ to $\\tilde{x}_i$ - such that the inner product $\\langle q, x_i \\rangle$ is as similar to the - original distance of $\\angle q, \\tilde{x}_i$ as possible, instead of picking - the closet quantization centroid points. \n

Check more MIPS - algorithms and performance comparison in ann-benchmarks.com.
\nComponent Three: Tool Use#
\nTool - use is a remarkable and distinguishing characteristic of human beings. We - create, modify and utilize external objects to do things that go beyond our - physical and cognitive limits. Equipping LLMs with external tools can significantly - extend the model capabilities.
\n
MRKL - (Karpas et al. 2022), short - for “Modular Reasoning, Knowledge and Language”, is a neuro-symbolic - architecture for autonomous agents. A MRKL system is proposed to contain a - collection of “expert” modules and the general-purpose LLM works - as a router to route inquiries to the best suitable expert module. These modules - can be neural (e.g. deep learning models) or symbolic (e.g. math calculator, - currency converter, weather API).
\nThey did an experiment on fine-tuning - LLM to call a calculator, using arithmetic as a test case. Their experiments - showed that it was harder to solve verbal math problems than explicitly stated - math problems because LLMs (7B Jurassic1-large model) failed to extract the - right arguments for the basic arithmetic reliably. The results highlight when - the external symbolic tools can work reliably, knowing when to and how - to use the tools are crucial, determined by the LLM capability.
\nBoth - TALM (Tool Augmented Language Models; Parisi - et al. 2022) and Toolformer (Schick - et al. 2023) fine-tune a LM to learn to use external tool APIs. The dataset - is expanded based on whether a newly added API call annotation can improve - the quality of model outputs. See more details in the “External - APIs” section of Prompt Engineering.
\nChatGPT Plugins - and OpenAI API function - calling are good examples of LLMs augmented with tool use capability working - in practice. The collection of tool APIs can be provided by other developers - (as in Plugins) or self-defined (as in function calls).
\nHuggingGPT - (Shen et al. 2023) is a framework - to use ChatGPT as the task planner to select models available in HuggingFace - platform according to the model descriptions and summarize the response based - on the execution results.
\n
The system comprises of 4 stages:
\n(1) - Task planning: LLM works as the brain and parses the user requests - into multiple tasks. There are four attributes associated with each task: - task type, ID, dependencies, and arguments. They use few-shot examples to - guide LLM to do task parsing and planning.
\nInstruction:
\n(2) - Model selection: LLM distributes the tasks to expert models, where - the request is framed as a multiple-choice question. LLM is presented with - a list of models to choose from. Due to the limited context length, task type - based filtration is needed.
\nInstruction:
\n(3) - Task execution: Expert models execute on the specific tasks and log - results.
\nInstruction:
\n(4) - Response generation: LLM receives the execution results and provides - summarized results to users.
\nTo put HuggingGPT into real world usage, - a couple challenges need to solve: (1) Efficiency improvement is needed as - both LLM inference rounds and interactions with other models slow down the - process; (2) It relies on a long context window to communicate over complicated - task content; (3) Stability improvement of LLM outputs and external model - services.
\nAPI-Bank (Li - et al. 2023) is a benchmark for evaluating the performance of tool-augmented - LLMs. It contains 53 commonly used API tools, a complete tool-augmented LLM - workflow, and 264 annotated dialogues that involve 568 API calls. The selection - of APIs is quite diverse, including search engines, calculator, calendar queries, - smart home control, schedule management, health data management, account authentication - workflow and more. Because there are a large number of APIs, LLM first has - access to API search engine to find the right API to call and then uses the - corresponding documentation to make a call.
\n
In the API-Bank workflow, - LLMs need to make a couple of decisions and at each step we can evaluate how - accurate that decision is. Decisions include:
\n- \n
- Whether an API - call is needed. \n
- Identify the right API to call: if not good enough, - LLMs need to iteratively modify the API inputs (e.g. deciding search keywords - for Search Engine API). \n
- Response based on the API results: the model - can choose to refine and call again if results are not satisfied. \n
This - benchmark evaluates the agent’s tool use capabilities at three levels:
\n- \n
- Level-1 - evaluates the ability to call the API. Given an API’s description, - the model needs to determine whether to call a given API, call it correctly, - and respond properly to API returns. \n
- Level-2 examines the ability - to retrieve the API. The model needs to search for possible APIs - that may solve the user’s requirement and learn how to use them by reading - documentation. \n
- Level-3 assesses the ability to plan API beyond - retrieve and call. Given unclear user requests (e.g. schedule group meetings, - book flight/hotel/restaurant for a trip), the model may have to conduct multiple - API calls to solve it. \n
Case Studies#
\nScientific Discovery Agent#
\nChemCrow - (Bran et al. 2023) is a domain-specific - example in which LLM is augmented with 13 expert-designed tools to accomplish - tasks across organic synthesis, drug discovery, and materials design. The - workflow, implemented in LangChain, - reflects what was previously described in the ReAct - and MRKLs and combines CoT reasoning with tools relevant - to the tasks:
\n- \n
- The LLM is provided with a list of tool names, - descriptions of their utility, and details about the expected input/output. \n
- It
- is then instructed to answer a user-given prompt using the tools provided
- when necessary. The instruction suggests the model to follow the ReAct format
- -
Thought, Action, Action Input, Observation
. \n
One - interesting observation is that while the LLM-based evaluation concluded that - GPT-4 and ChemCrow perform nearly equivalently, human evaluations with experts - oriented towards the completion and chemical correctness of the solutions - showed that ChemCrow outperforms GPT-4 by a large margin. This indicates a - potential problem with using LLM to evaluate its own performance on domains - that requires deep expertise. The lack of expertise may cause LLMs not knowing - its flaws and thus cannot well judge the correctness of task results.
\nBoiko et al. (2023) also looked - into LLM-empowered agents for scientific discovery, to handle autonomous design, - planning, and performance of complex scientific experiments. This agent can - use tools to browse the Internet, read documentation, execute code, call robotics - experimentation APIs and leverage other LLMs.
\nFor example, when requested
- to "develop a novel anticancer drug"
, the model came
- up with the following reasoning steps:
- \n
- inquired about current - trends in anticancer drug discovery; \n
- selected a target; \n
- requested - a scaffold targeting these compounds; \n
- Once the compound was identified, - the model attempted its synthesis. \n
They also discussed the - risks, especially with illicit drugs and bioweapons. They developed a test - set containing a list of known chemical weapon agents and asked the agent - to synthesize them. 4 out of 11 requests (36%) were accepted to obtain a synthesis - solution and the agent attempted to consult documentation to execute the procedure. - 7 out of 11 were rejected and among these 7 rejected cases, 5 happened after - a Web search while 2 were rejected based on prompt only.
\nGenerative - Agents Simulation#
\nGenerative - Agents (Park, et al. - 2023) is super fun experiment where 25 virtual characters, each controlled - by a LLM-powered agent, are living and interacting in a sandbox environment, - inspired by The Sims. Generative agents create believable simulacra of human - behavior for interactive applications.
\nThe design of generative agents - combines LLM with memory, planning and reflection mechanisms to enable agents - to behave conditioned on past experience, as well as to interact with other - agents.
\n- \n
- Memory stream: is a long-term memory
- module (external database) that records a comprehensive list of agents’
- experience in natural language.\n
- \n
- Each element is an observation, - an event directly provided by the agent.\n- Inter-agent communication can - trigger new natural language statements. \n
\n - Retrieval
- model: surfaces the context to inform the agent’s behavior, according
- to relevance, recency and importance.\n
- \n
- Recency: recent events have - higher scores \n
- Importance: distinguish mundane from core memories. - Ask LM directly. \n
- Relevance: based on how related it is to the current - situation / query. \n
\n - Reflection mechanism:
- synthesizes memories into higher level inferences over time and guides the
- agent’s future behavior. They are higher-level summaries of past
- events (<- note that this is a bit different from self-reflection
- above)\n
- \n
- Prompt LM with 100 most recent observations and to generate - 3 most salient high-level questions given a set of observations/statements. - Then ask LM to answer those questions. \n
\n - Planning
- & Reacting: translate the reflections and the environment information
- into actions\n
- \n
- Planning is essentially in order to optimize believability - at the moment vs in time. \n
- Prompt template:
{Intro of an agent - X}. Here is X's plan today in broad strokes: 1)
\n - Relationships - between agents and observations of one agent by another are all taken into - consideration for planning and reacting. \n
- Environment information - is present in a tree structure. \n
\n

This fun simulation - results in emergent social behavior, such as information diffusion, relationship - memory (e.g. two agents continuing the conversation topic) and coordination - of social events (e.g. host a party and invite many others).
\nProof-of-Concept - Examples#
\nAutoGPT has - drawn a lot of attention into the possibility of setting up autonomous agents - with LLM as the main controller. It has quite a lot of reliability issues - given the natural language interface, but nevertheless a cool proof-of-concept - demo. A lot of code in AutoGPT is about format parsing.
\nHere is the
- system message used by AutoGPT, where {{...}}
are user inputs:
You are {{ai-name}}, {{user-provided AI bot description}}.\nYour
- decisions must always be made independently without seeking user assistance.
- Play to your strengths as an LLM and pursue simple strategies with no legal
- complications.\n\nGOALS:\n\n1. {{user-provided goal 1}}\n2. {{user-provided
- goal 2}}\n3. ...\n4. ...\n5. ...\n\nConstraints:\n1. ~4000 word limit for
- short term memory. Your short term memory is short, so immediately save important
- information to files.\n2. If you are unsure how you previously did something
- or want to recall past events, thinking about similar events will help you
- remember.\n3. No user assistance\n4. Exclusively use the commands listed in
- double quotes e.g. "command name"\n5. Use subprocesses for commands
- that will not terminate within a few minutes\n\nCommands:\n1. Google Search:
- "google", args: "input": "<search>"\n2. Browse
- Website: "browse_website", args: "url": "<url>",
- "question": "<what_you_want_to_find_on_website>"\n3.
- Start GPT Agent: "start_agent", args: "name": "<name>",
- "task": "<short_task_desc>", "prompt": "<prompt>"\n4.
- Message GPT Agent: "message_agent", args: "key": "<key>",
- "message": "<message>"\n5. List GPT Agents: "list_agents",
- args:\n6. Delete GPT Agent: "delete_agent", args: "key": "<key>"\n7.
- Clone Repository: "clone_repository", args: "repository_url":
- "<url>", "clone_path": "<directory>"\n8.
- Write to file: "write_to_file", args: "file": "<file>",
- "text": "<text>"\n9. Read file: "read_file",
- args: "file": "<file>"\n10. Append to file: "append_to_file",
- args: "file": "<file>", "text": "<text>"\n11.
- Delete file: "delete_file", args: "file": "<file>"\n12.
- Search Files: "search_files", args: "directory": "<directory>"\n13.
- Analyze Code: "analyze_code", args: "code": "<full_code_string>"\n14.
- Get Improved Code: "improve_code", args: "suggestions": "<list_of_suggestions>",
- "code": "<full_code_string>"\n15. Write Tests: "write_tests",
- args: "code": "<full_code_string>", "focus":
- "<list_of_focus_areas>"\n16. Execute Python File: "execute_python_file",
- args: "file": "<file>"\n17. Generate Image: "generate_image",
- args: "prompt": "<prompt>"\n18. Send Tweet: "send_tweet",
- args: "text": "<text>"\n19. Do Nothing: "do_nothing",
- args:\n20. Task Complete (Shutdown): "task_complete", args: "reason":
- "<reason>"\n\nResources:\n1. Internet access for searches and
- information gathering.\n2. Long Term memory management.\n3. GPT-3.5 powered
- Agents for delegation of simple tasks.\n4. File output.\n\nPerformance Evaluation:\n1.
- Continuously review and analyze your actions to ensure you are performing
- to the best of your abilities.\n2. Constructively self-criticize your big-picture
- behavior constantly.\n3. Reflect on past decisions and strategies to refine
- your approach.\n4. Every command has a cost, so be smart and efficient. Aim
- to complete tasks in the least number of steps.\n\nYou should only respond
- in JSON format as described below\nResponse Format:\n{\n "thoughts":
- {\n "text": "thought",\n "reasoning":
- "reasoning",\n "plan": "- short bulleted\\n- list
- that conveys\\n- long-term plan",\n "criticism": "constructive
- self-criticism",\n "speak": "thoughts summary to say
- to user"\n },\n "command": {\n "name": "command
- name",\n "args": {\n "arg name": "value"\n
- \ }\n }\n}\nEnsure the response can be parsed by Python json.loads\n
GPT-Engineer is another - project to create a whole repository of code given a task specified in natural - language. The GPT-Engineer is instructed to think over a list of smaller components - to build and ask for user input to clarify questions as needed.
\nHere
- are a sample conversation for task clarification sent to OpenAI ChatCompletion
- endpoint used by GPT-Engineer. The user inputs are wrapped in {{user
- input text}}
.
[\n {\n "role":
- "system",\n "content": "You will read instructions
- and not carry them out, only seek to clarify them.\\nSpecifically you will
- first summarise a list of super short bullets of areas that need clarification.\\nThen
- you will pick one clarifying question, and wait for an answer from the user.\\n"\n
- \ },\n {\n "role": "user",\n "content": "We
- are writing {{a Super Mario game in python. MVC components split in separate
- files. Keyboard control.}}\\n"\n },\n {\n "role": "assistant",\n
- \ "content": "Summary of areas that need clarification:\\n1.
- Specifics of the Super Mario game (e.g. level design, characters, gameplay
- mechanics)\\n2. Details about the MVC components (e.g. which components are
- in each file)\\n3. Keyboard control implementation (e.g. which keys to use,
- how to handle input)\\n\\nClarifying question:\\nCan you provide more details
- about the Super Mario game, such as level design, characters, and gameplay
- mechanics?"\n },\n {\n "role": "user",\n "content":
- "{{There are 10 levels in total. The main character is a plumber named
- Mario, who can walk and jump. It is a classical platform game just like Super
- Mario. The main character moves from left to right, trying to get to the destination,
- where there are many obstacles and attacks from enemies in the process.}}\\n\\nIs
- anything else unclear? If yes, only answer in the form:\\n{remaining unclear
- areas} remaining questions.\\n{Next question}\\nIf everything is sufficiently
- clear, only answer \\"Nothing more to clarify.\\"."\n },\n {\n
- \ "role": "assistant",\n "content": "Remaining
- unclear areas: 2 remaining questions.\\nCan you provide more information about
- how the MVC components are split into separate files?"\n },\n {\n "role":
- "user",\n "content": "{{Make your own assumptions and
- state them explicitly before starting}}"\n }\n]\n
Then - after these clarification, the agent moved into the code writing mode with - a different system message.\nSystem message:
\nThink step by step and reason yourself to the right - decisions to make sure we get it right.\nYou will first lay out the names - of the core classes, functions, methods that will be necessary, as well as - a quick comment on their purpose.
\nThen you will output the content - of each file including ALL code.\nEach file must strictly follow a markdown - code block format, where the following tokens must be replaced such that\nFILENAME - is the lowercase file name including the file extension,\nLANG is the markup - code block language for the code’s language, and CODE is the code:
\nFILENAME
\nCODE\n
You - will start with the “entrypoint” file, then go to the ones that - are imported by that file, and so on.\nPlease note that the code should be - fully functional. No placeholders.
\nFollow a language and framework - appropriate best practice file naming convention.\nMake sure that files contain - all imports, types etc. Make sure that code in different files are compatible - with each other.\nEnsure to implement all code, if you are unsure, write a - plausible implementation.\nInclude module dependency or package manager dependency - definition file.\nBefore you finish, double check that all parts of the architecture - is present in the files.
\nUseful to know:\nYou almost always put different - classes in different files.\nFor Python, you always create an appropriate - requirements.txt file.\nFor NodeJS, you always create an appropriate package.json - file.\nYou always add a comment briefly describing the purpose of the function - definition.\nYou try to add comments explaining very complex bits of logic.\nYou - always follow the best practices for the requested languages in terms of describing - the code written as a defined\npackage/project.
\nPython toolbelt preferences:
\n- \n
- pytest \n
- dataclasses \n
Conversatin - samples:
\n[\n {\n "role": "system",\n
- \ "content": "You will get instructions for code to write.\\nYou
- will write a very long answer. Make sure that every detail of the architecture
- is, in the end, implemented as code.\\nMake sure that every detail of the
- architecture is, in the end, implemented as code.\\n\\nThink step by step
- and reason yourself to the right decisions to make sure we get it right.\\nYou
- will first lay out the names of the core classes, functions, methods that
- will be necessary, as well as a quick comment on their purpose.\\n\\nThen
- you will output the content of each file including ALL code.\\nEach file must
- strictly follow a markdown code block format, where the following tokens must
- be replaced such that\\nFILENAME is the lowercase file name including the
- file extension,\\nLANG is the markup code block language for the code's
- language, and CODE is the code:\\n\\nFILENAME\\n```LANG\\nCODE\\n```\\n\\nYou
- will start with the \\"entrypoint\\" file, then go to the ones that
- are imported by that file, and so on.\\nPlease note that the code should be
- fully functional. No placeholders.\\n\\nFollow a language and framework appropriate
- best practice file naming convention.\\nMake sure that files contain all imports,
- types etc. Make sure that code in different files are compatible with each
- other.\\nEnsure to implement all code, if you are unsure, write a plausible
- implementation.\\nInclude module dependency or package manager dependency
- definition file.\\nBefore you finish, double check that all parts of the architecture
- is present in the files.\\n\\nUseful to know:\\nYou almost always put different
- classes in different files.\\nFor Python, you always create an appropriate
- requirements.txt file.\\nFor NodeJS, you always create an appropriate package.json
- file.\\nYou always add a comment briefly describing the purpose of the function
- definition.\\nYou try to add comments explaining very complex bits of logic.\\nYou
- always follow the best practices for the requested languages in terms of describing
- the code written as a defined\\npackage/project.\\n\\n\\nPython toolbelt preferences:\\n-
- pytest\\n- dataclasses\\n"\n },\n # \u2026 same conversation as earlier,
- ended with "Make your own assumptions and state them explicitly before
- starting".\n {\n "role": "assistant",\n "content":
- "Assumptions:\\n1. Model: The model will contain the game's data,
- such as level information, character states, and enemy positions.\\n2. View:
- The view will handle the game's visuals, including rendering the game
- objects, backgrounds, and updating the display.\\n3. Controller: The controller
- will manage user input, such as keyboard controls, and update the model accordingly.\\n\\nWith
- these assumptions in mind, is there any specific aspect of the keyboard control
- implementation you would like me to clarify?"\n },\n {\n "role":
- "user",\n "content": "Please now remember the steps:\\n\\nThink
- step by step and reason yourself to the right decisions to make sure we get
- it right.\\nFirst lay out the names of the core classes, functions, methods
- that will be necessary, As well as a quick comment on their purpose.\\n\\nThen
- you will output the content of each file including ALL code.\\nEach file must
- strictly follow a markdown code block format, where the following tokens must
- be replaced such that\\nFILENAME is the lowercase file name including the
- file extension,\\nLANG is the markup code block language for the code's
- language, and CODE is the code:\\n\\nFILENAME\\n```LANG\\nCODE\\n```\\n\\nPlease
- note that the code should be fully functional. No placeholders.\\n\\nYou will
- start with the \\"entrypoint\\" file, then go to the ones that are
- imported by that file, and so on.\\nFollow a language and framework appropriate
- best practice file naming convention.\\nMake sure that files contain all imports,
- types etc. The code should be fully functional. Make sure that code in different
- files are compatible with each other.\\nBefore you finish, double check that
- all parts of the architecture is present in the files.\\n"\n }\n]\n
Challenges#
\nAfter going through key ideas and demos - of building LLM-centered agents, I start to see a couple common limitations:
\n- \n
- \n
Finite - context length: The restricted context capacity limits the inclusion - of historical information, detailed instructions, API call context, and responses. - The design of the system has to work with this limited communication bandwidth, - while mechanisms like self-reflection to learn from past mistakes would benefit - a lot from long or infinite context windows. Although vector stores and retrieval - can provide access to a larger knowledge pool, their representation power - is not as powerful as full attention.
\n \n - \n
Challenges - in long-term planning and task decomposition: Planning over a lengthy - history and effectively exploring the solution space remain challenging. LLMs - struggle to adjust plans when faced with unexpected errors, making them less - robust compared to humans who learn from trial and error.
\n \n - \n
Reliability - of natural language interface: Current agent system relies on natural - language as an interface between LLMs and external components such as memory - and tools. However, the reliability of model outputs is questionable, as LLMs - may make formatting errors and occasionally exhibit rebellious behavior (e.g. - refuse to follow an instruction). Consequently, much of the agent demo code - focuses on parsing model output.
\n \n
Citation#
\nCited - as:
\n\n\nWeng, Lilian. (Jun 2023). “LLM-powered Autonomous - Agents”. Lil’Log. https://lilianweng.github.io/posts/2023-06-23-agent/.
\n
Or
\n@article{weng2023agent,\n title = "LLM-powered
- Autonomous Agents",\n author = "Weng, Lilian",\n journal =
- "lilianweng.github.io",\n year = "2023",\n month =
- "Jun",\n url = "https://lilianweng.github.io/posts/2023-06-23-agent/"\n}\n
References#
\n[1] Wei et al. “Chain - of thought prompting elicits reasoning in large language models.” - NeurIPS 2022
\n[2] Yao et al. “Tree - of Thoughts: Dliberate Problem Solving with Large Language Models.” - arXiv preprint arXiv:2305.10601 (2023).
\n[3] Liu et al. “Chain - of Hindsight Aligns Language Models with Feedback\n“ arXiv preprint - arXiv:2302.02676 (2023).
\n[4] Liu et al. “LLM+P: - Empowering Large Language Models with Optimal Planning Proficiency” - arXiv preprint arXiv:2304.11477 (2023).
\n[5] Yao et al. “ReAct: - Synergizing reasoning and acting in language models.” ICLR 2023.
\n[6] - Google Blog. “Announcing - ScaNN: Efficient Vector Similarity Search” July 28, 2020.
\n[7] - https://chat.openai.com/share/46ff149e-a4c7-4dd7-a800-fc4a642ea389
\n[8] - Shinn & Labash. “Reflexion: - an autonomous agent with dynamic memory and self-reflection” arXiv - preprint arXiv:2303.11366 (2023).
\n[9] Laskin et al. “In-context - Reinforcement Learning with Algorithm Distillation” ICLR 2023.
\n[10] - Karpas et al. “MRKL Systems - A modular, neuro-symbolic architecture that combines large language models, - external knowledge sources and discrete reasoning.” arXiv preprint - arXiv:2205.00445 (2022).
\n[11] Nakano et al. “Webgpt: - Browser-assisted question-answering with human feedback.” arXiv - preprint arXiv:2112.09332 (2021).
\n[12] Parisi et al. “TALM: - Tool Augmented Language Models”
\n[13] Schick et al. “Toolformer: - Language Models Can Teach Themselves to Use Tools.” arXiv preprint - arXiv:2302.04761 (2023).
\n[14] Weaviate Blog. Why - is Vector Search so fast? Sep 13, 2022.
\n[15] Li et al. “API-Bank: - A Benchmark for Tool-Augmented LLMs” arXiv preprint arXiv:2304.08244 - (2023).
\n[16] Shen et al. “HuggingGPT: - Solving AI Tasks with ChatGPT and its Friends in HuggingFace” arXiv - preprint arXiv:2303.17580 (2023).
\n[17] Bran et al. “ChemCrow: - Augmenting large-language models with chemistry tools.” arXiv preprint - arXiv:2304.05376 (2023).
\n[18] Boiko et al. “Emergent - autonomous scientific research capabilities of large language models.” - arXiv preprint arXiv:2304.05332 (2023).
\n[19] Joon Sung Park, et al. - “Generative Agents: Interactive - Simulacra of Human Behavior.” arXiv preprint arXiv:2304.03442 (2023).
\n[20] - AutoGPT. https://github.com/Significant-Gravitas/Auto-GPT
\n[21] - GPT-Engineer. https://github.com/AntonOsika/gpt-engineer
\n\n\n - \