Nascent AGI architectures like BabyAGI and AutoGPT have captured a great deal of public interest by demonstrating LLMs' agentic capabilities and capacity for introspective step-by-step reasoning. As proofs-of-concept, they make great strides, but leave a few things wanting. This document is an attempt to sketch an architecture which builds upon and extends the conceptual foundations of the aforementioned products.
The primary contributions I would like to make are twofold:
- Allowing an LLM to read from a corpus of information and write text according to that information.
- Enabling more robust reproducibility and modularity.
A few attributes of the imagined architecture:
- Must be modular and easily extensible (e.g., must be easy to swap out vector/data stores). See below usage.
- Is ideally compatible and interoperable with multiple, GPT3+-quality models. Practically speaking this means that whatever prompts are used are ideally very short and ask the LLM for single-token answers (e.g., from a multiple choice set of options). In practice, I think that ChatGPT and GPT-4 (and perhaps Claude) are still the best models for this sort of task.
- Must include mechanisms to receive and remember human feedback. A heavyweight (and undesirable approach) is to define a domain specific language which can be used to reproruce instructions.
- Related to the above point, must be interpretable and deterministic so that users can have some guarantee of reusability. That is, the next time I run an agent flow, I can ideally just load in a set of natural language instructions that it generated from a previous run and it will run in an identical manner.
- Must include some way for the agent to access a corpus of information, i.e., a "Memory" module. Accordingly, must allow the agent to take actions based on information in that memory (e.g., using information in its memory bank to fill out forms, or using information in its memory to reason about future actions to take).
- Must be able to re-evaluate its plan after every step, e.g., maybe an OODA loop.
- Must be easily debuggable, either by having a clean audit log of it's introspections and actions taken or otherwise.
An initial concept ontology would include:
-
Agents are instantiated to acccomplish text objectives. -
Actions are interfaces to non-LLM capabilities thatAgents can take. -
Memorys are things that theAgentcan remember and store that might be useful. -
Tasks are things that need to be completed to accomplish an objective. -
Statedescribes what is in the environment, giving theAgentclues about what to do next. -
Users instantiate
Agents.Agents generate and completeTasks using theActions at their disposal and theirMemory.Tasks are discrete objectives which can plausibly be achieved by using aActionandMemory.Agents can useActions, which are exposed toAgents via prompts like "You can use the following Actions: (A) Python interpreter (B) Google API" etc.- Users can define new
Actions. The simplest architecture for aActionis a function, but you would likely want more shared state between theActionand theAgent.
- Users can define new
Agents can have aMemory.Memoryfiles can either be structured or unstructured. If structured, they are machine readable and can be processed according to standard methods. If unstructured, then some form of retrieval must be implemented.
-
Stateof the world exposed to theAgent, giving it information about its current environment.
The agent:
- Comes up with a
Tasklist to accomplish the goal. - Progressively completes
Tasks, taking account of the state of the world after each task is complete. - Continues until finished.
Anyone can build arbitrary Actions according to their APIs. They can be open-sourced or proprietary. Some default Actions might include: querying memory, asking user for feedback and clarity.
Ideally all the user would have to do is define a set of Actions that they expose to the Agent, and the magic under the hood routes actions accordingly. I envision something like the below.
agent = Agent(
actions=[
Action(
name="google",
function=google_search,
function_signature="...",
description="Can be used to search Google. Returns a JSON list of results."
h ),
Action(
name="selenium_click",
function=selenium_click,
function_signature="selenium_click(element)",
description="Used to click on an element in the Selenium instance."
),
Action(
name="selenium_type",
function=selenium_type,
function_signature="selenium_type(text)"
description="Used to type something in the Selenium instance."
),
],
memory=directory_of_files
)
goal = "Reserve me a spot at restaurant X at 7pm."
agent.run(goal)which would set off a subroutine like:
- Search Google for
restaurant X. - Select the right restaurant link.
- Select the
reservebutton. - ... etc
Because restaurants' ordering flow can differ dramatically, the Agent must be able to reconsider its objectives after every step given new information that it is presented with. Probably a built in Action would be to pause and ask for user feedback.