学习笔记-Agents Course Unit1

status

Published

slug

agent-course-unit

type

Post

Unit1 Introduction to Agents

What is an Agent?

an AI model capable of reasoning, planning, and interacting with its environment.

Formal：An Agent is a system that leverages an AI model to interact with its environment in order to achieve a user-defined objective. It combines reasoning, planning, and the execution of actions (often via external tools) to fulfill tasks.

Two main parts

The Brain：AI model (LLM), reasoning and planning
The Body：Capabilities and Tools, acting

Summary

Understand natural language: Interpret and respond to human instructions in a meaningful way.
Reason and plan: Analyze information, make decisions, and devise strategies to solve problems.
Interact with its environment: Gather information, take actions, and observe the results of those actions.

What are LLMs

A brain of the Agent

a type of AI model that excels at understanding and generating human language.

trained on vast amounts of text data
consist of many millions of parameters

Mainstream Architecture：Transformer

Type	Example	Use Cases	Typical Size
Encoders	BERT from Google	Text classification, semantic search, Named Entity Recognition	Millions of parameters
Decoders	Llama from Meta	Text generation, chatbots, code generation	Billions (in the US sense, i.e., 10^9) of parameters
Seq2Seq（Encoder + Decoder）	T5, BART,	Translation, Summarization, Paraphrasing	Millions of parameters

Key Aspect: Attention

identify the most relevant words to predict the next token

token: the unit of information an LLM works with

eg: interest, ing, ed
special token:

The LLM uses these tokens to open and close the structured components of its generation.
End of sequence token (EOS)

prompt: the input sequence that is provided to an LLM

Careful design of the prompt makes it easier to guide the generation of the LLM toward the desired output.

Next token prediction

Autoregressive

the output will become the input for the next one

Decode strategies

Maximum: always select the token with the maximum score
Beam Search: explore multiple candidate sequences to find the one with the maximum total score

LLMs’ training

pre-training: train on large dataset, unsupervised learning

learns the structure of the language and underlying patterns in text, allowing the model to generalize to unseen data.

fine-tune: custom dataset, supervised learning

perform specific task

Messages and Special Tokens

how LLMs structure their generations through chat templates

Chart templates: the bridge between conversational messages (user and assistant turns) and the specific formatting requirements of your chosen LLM

Messages: The Underlying System of LLMs

System Messages: define how the model should behave.

persistent instructions
Effect: gives information about the available tools, provides instructions to the model on how to format the actions to take, and includes guidelines on how the thought process should be

Conversations: User and Assistant Messages

exchange between the user (Human) and the assistant (LLM)
different model have different templates

Chat-Templates

Base models：is trained on raw text data to predict the next token.

Instruct Models： is fine-tuned specifically to follow instructions and engage in conversations.

Chat templates：format prompts in a consistent way

Conversations → Prompts

What are Tools?

A Tool is a function given to the LLM.

common tools

Tool	Description
Web Search	Allows the agent to fetch up-to-date information from the internet.
Image Generation	Creates images based on text descriptions.
Retrieval	Retrieves information from an external source.
API Interface	Interacts with an external API (GitHub, YouTube, Spotify, etc.).

A Tool contain:

A textual description of what the function does.
A Callable (something to perform an action).
Arguments with typings.
(Optional) Outputs with typings.

Why need tools: LLMs predict the completion of a prompt based on their training data.

their internal knowledge only includes events prior to their training ⇒ if your agent needs up-to-date data you must provide it through some tool.
Hallucination problem: make up something randomly

How do tools work?

Teach the LLM about the existence of tools
LLM will generate text, in the form of code, to invoke that tool
The output from the tool will then be sent back to the LLM, which will compose its final response for the user.

Tool class:

core function:

to_string(): converts the tool’s attributes into a textual representation
__call__(): Calls the function when the tool instance is invoked.

Summary:

What Tools Are: Functions that give LLMs extra capabilities, such as performing calculations or accessing external data.
How to Define a Tool: By providing a clear textual description, inputs, outputs, and a callable function.
Why Tools Are Essential: They enable Agents to overcome the limitations of static model training, handle real-time tasks, and perform specialized actions.

Thought-Action-Observation Cycle

Agents work in a continuous cycle of: thinking (Thought) → acting (Act) and observing (Observe).

Thought: The LLM part of the Agent decides what the next step should be.

Action: The agent takes an action, by calling the tools with the associated arguments.

Observation: The model reflects on the response from the tool.

In many Agent frameworks, the rules and guidelines are embedded directly into the system prompt.

Thought: Internal Reasoning and the Re-Act Approach

What are thoughts?

The Agent’s internal reasoning and planning processes to solve the task

Re-Act approach: the concatenation of “Reasoning” (Think) with “Acting” (Act)

A simple technique that appends “Let’s think step by step”

Encourage to generate a plan
Encourage to decompose the problem into sub-tasks

Less errors than generate the final solution directly

Actions: Enabling the Agent to Engage with Its Environment

Types of Agent Actions

By action format

Type of Agent	Description
JSON Agent	The Action to take is specified in JSON format.
Code Agent	The Agent writes a code block that is interpreted externally.
Function-calling Agent	It is a subcategory of the JSON Agent which has been fine-tuned to generate a new message for each action.

By purpose

Type of Action	Description
Information Gathering	Performing web searches, querying databases, or retrieving documents.
Tool Usage	Making API calls, running calculations, and executing code.
Environment Interaction	Manipulating digital interfaces or controlling physical devices.
Communication	Engaging with users via chat or collaborating with other agents.

The Stop and Parse Approach

| A crucial ability to STOP generating new tokens when an action is complete
LLM only handles text and uses it to describe the action it wants take and the parameters to supply to the tool.

Generation in a Structured Format

eg: JSON

Halting Further Generation
Parsing the Output

Code Agents

Generate an executable code block

eg: Python

Advantages:

Expressiveness
Modularity and Reusability
Enhanced Debuggability
Direct Integration

Observe, Integrating Feedback to Reflect and Adapt

Observations are how an Agent perceives the consequences of its actions.

The observation phase

Collects Feedback: Receives data or confirmation that its action was successful (or not).
Appends Results: Integrates the new information into its existing context, effectively updating its memory.
Adapts its Strategy: Uses this updated context to refine subsequent thoughts and actions.

Dummy Agent Library

Notebook: ‣

First Agent Template

repo: ‣

What is smolagents?

a library that provides a framework for developing your agents with ease.