-
Notifications
You must be signed in to change notification settings - Fork 85
/
Copy pathAgentQ.txt
1270 lines (1066 loc) · 73.6 KB
/
AgentQ.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Agent Q: Advanced Reasoning and Learning
for Autonomous AI Agents
Pranav Putta1 , Edmund Mills1 , Naman Garg1 , Sumeet Motwani1 , Chelsea Finn2 , Divyansh Garg1 and Rafael
Rafailov1, 2
1 The AGI Company (MultiOn), 2 Stanford University
Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex
reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a
difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous
agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous
attempts to bridge this gap—through supervised fine-tuning on curated expert demonstrations——often suffer
from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome
these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search
with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the
Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both
successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning
tasks. We validate our approach in the WebShop environment—a simulated e-commerce platform—where
it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human
performance when equipped with the capability to do online search. In real-world booking scenarios, our
methodology boosts Llama-3 70B model’s zero-shot performance from 18.6% to 81.7% success rate (a 340%
relative increase) after a single day of data collection and further to 95.4% with online search. We believe
this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more
sophisticated and reliable decision-making in real-world settings.
1. Introduction
The recent advances in Large Language Models (LLMs) represent a significant leap in artificial
intelligence. Frontier models like ChatGPT (John Schulman et al., 2022), Gemini (Anil et al., 2023),
Opus (Anthropic, 2024), and LLaMA-3 (Touvron et al., 2023) demonstrate promising reasoning
capabilities that approach average human performance in a number of domains. These breakthroughs
have extended the utility of LLMs from traditional chat and text-based applications to more dynamic,
agentic roles, in which they do not just generate text but can take actions autonomously in a number
of environments including code and software engineering (Holt et al., 2024; Zhang et al., 2024d;
Jimenez et al., 2024; Yang et al., 2024), device control (Wang et al., 2024a; Zhang et al., 2023; Chen
and Li, 2024) and web applications (Hong et al., 2023; Deng et al., 2023; Zhou et al., 2024b; Lai
et al., 2024a; Gur et al., 2024) among others. However, despite these advancements, significant
challenges persist: LLMs still struggle to generalize effectively in interactive, multi-step environments,
since they are not native trained for such applications . This is true, even for some of the strongest
models of the current generation, such as GPT-4 (Achiam et al., 2023).
A growing literature on agentic formulation seeks to address these issues; however these works mostly
focus on building frameworks around prompt-based learning on existing models or limited fine-tuning
on static datasets, and are thus limited by the base models’ reasoning and decision making capabilities.
Reasoning and planning have indeed been highlighted as core challenges for current LLMs. Since
the seminal work on chain-of-thought reasoning (Wei et al., 2022), significant efforts have been
made to improve these capabilities via prompt-based strategies (Kojima et al., 2022; Wang et al.,
Corresponding author(s): [email protected], [email protected]
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Figure 1: We use Monte Carlo Tree Search (MCTS) to guide trajectory collection and iteratively
improve model performance using direct preference optimization (DPO). We begin on the left by
sampling a user query from the list of tasks in the dataset. We iteratively expand the search tree
using UCB1 as a heuristic to balance exploration and exploitation of different actions. We store the
accumulated reward obtained for each node in the tree, where in this image darker green indicates
higher reward and darker red indicates lower reward. To construct the preference dataset, we compute
a weighted score of the MCTS average Q-value and score generated by a feedback language model to
construct contrastive pairs for DPO. The policy is optimized and can be iteratively improved.
2023; Qiao et al., 2023; Yao et al., 2023a). While successful, these approaches are still bounded by
the base model’s performance. Another direction of research has explored fine-tuning approaches
(Zelikman et al., 2022; Pang et al., 2024), and more recently combining them with inference-time
search prompting (Yao et al., 2023a) to produce fine-grained feedback. Concurrent works (Xie et al.,
2024; Hwang et al., 2024; Zhang et al., 2024e; Tian et al., 2024) utilize the traces produced by search
algorithms and combine them with optimization approaches (Rafailov et al., 2023; Zelikman et al.,
2022) to achieve significant boost in capabilities, especially in mathematics problem solving and code
generation.
In this work we explore improving planning and reasoning capabilities of a web agent, which interacts
with a real world website. Our goal is to design an approach that allows the agent to improve with
autonomous experience and limited supervision. Indeed, prior works (Yao et al., 2023b; Zhang et al.,
2024c; Masterman et al., 2024; Sumers et al., 2024) have shown strong reasoning to be critical for
performance of autonomous agents, where challenges are even greater than during text generation,
as the model needs to further understand how its actions affect its environment. Towards this goal,
we introduce Agent Q—a novel approach that combines several key concepts in reasoning, search,
self-critique and reinforcement learning. Our method takes inspiration from Sutton’s The Bitter
Lesson on the power of general purpose methods that continue to scale with increased computation,
showing the significant benefits of combining search and learning.
Inspired by the success of search-based methods in prior game-playing settings (Silver et al., 2017a;
Brown and Sandholm, 2019; Gray et al., 2021) and mathematical reasoning (Yao et al., 2023a; Besta
et al., 2024), we deploy a Monte Carlo Tree Search (MCTS) based search routine over web pages to
guide agent exploration. Given the complexity of the environment, we use a base LLM for sampling
2
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
possible rationales and web actions to explore. While this simple search-strategy shows a meaningful
improvement in the success rate, it still struggles on long horizon tasks due to sparsity of environment
rewards. Indeed even a small mistake across the trajectory can cause the final agent output to be
wrong, creating significant credit assignment problems. To overcome this, we use AI feedback (Bai
et al., 2022) and self-criticism (Yuan et al., 2024) to further prompt the LLM to provide self-evaluation
feedback at each node, which serves as intermediate reward and helps guide the search steps. This
meaningfully improves the final agent success rate, but requires significant online interactions and
moreover the capability to rollback actions, which is not always possible in online realistic settings.
Such online autonomous search with little supervision on the web can result in a weak or unsafe
agent which can make many errors, resulting in risky behaviors in sensitive online settings like bank
transfers and sensitive information sharing.
To correct this, we use the traces generated by the search process to improve capabilities of the
model by learning from both the successful and unsuccessful trajectories with offline reinforcement
learning, utilizing the Direct Preference Optimization (DPO) algorithm. We create preferences over
different branches at the node level, which are scored using a mixture of the AI process feedback
rewards and the final success rate of the explored branch. We evaluate our approach on the simulated
WebShop benchmark (Yao et al., 2022)—a simulated e-commerce platform—as well as a real-world
reservations booking website. We utilize LLaMa 3-70B as the base model in our experiments. In the
WebShop environment, our approach consistently outperforms behavior cloning and reinforcement
learning fine-tuned baselines, and beats average human performance when equipped with the
capability to do online search.
In our real-world booking experiments, using our Agent Q framework we improve the model zero-shot
absolute success rate from 18.6% to 81.7% (a 340% relative increase), outperforming GPT-4’s
performance after a single day of autonomous data collection. When we equip Agent Q with online
search capability, our absolute success further improves to 95.4%. We believe that our approach
represents a significant step forward in the development of autonomous web agents through it’s
search and self-critique capabilities, setting a new benchmark for reliable multi-step decision-making
in interactive settings.
2. Related Work
Our work touches on a large number of research directions around agent design, self-improvement,
reasoning and reinforcement learning. We include a short overview of related works from those
various fields below.
2.1. Guided Search for Reasoning and Planning
The latest generation of Large Language Models (LLMs) have demonstrated promising emerging
properties around reasoning and planning. Moreover such behaviours can be directly elicited from
strong models only using simple prompting techniques (Wei et al., 2022; Kojima et al., 2022; Qiao
et al., 2023). These have also become an integral part of agentic design (Yao et al., 2023b; Zhang
et al., 2024c), which we also utilize for our approach. Another emerging research direction is based
around step-by-step verifiers or “Process Reward Models” (Uesato et al., 2022; Lightman et al., 2023),
specifically for mathematical reasoning. These have shown to improve performance beyond purely
outcome-based training, however they require a large amount of human effort to label individual steps.
Some recent approaches have proposed self-supervised methods for step-level supervision (Hwang
et al., 2024; Wang et al., 2024b; Setlur et al., 2024a). A number of concurrent works (Xie et al.,
2024; Zhang et al., 2024e; Tian et al., 2024) have further explored tree-based search approaches
3
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
(Yao et al., 2023a) in combination with DPO (Rafailov et al., 2023) training for math-based reasoning.
These algorithms optimize actions at the node level, using different branches produced by the search
algorithm to create preference pairs. Our approach shares similarities to the self-supervised search
proposed in (Yao et al., 2023a) with a combination of AI-based feedback (Bai et al., 2022; Yuan et al.,
2024) to guide intermediate search steps, but we are the first to scale this a realistic agent setting.
Similar approaches were proposed in (Zhou et al., 2024a; Hao et al., 2023), and other works (Koh
et al., 2024); however these works only use the base model’s zero-shot capability and do not train it
further. Moreover they are only evaluated on simulated environments. Beyond the search stage, our
work further adopts the training methodology of (Xie et al., 2024; Zhang et al., 2024e; Tian et al.,
2024), which significantly boosts our agent’s zero-shot capabilities.
2.2. Web Agents
The strength and capabilities of recent pretrained Large Language (Vision) Models LL(V)Ms has
significantly boosted progress in developing autonomous web-agents. Improved code understanding
and long context have allowed agents to represent environment state and action space with document
object model (DOM) allowing for deployment in complex and realistic domains. Moreover strong
reasoning (Yao et al., 2023b) and planning (Liu et al., 2023; Zhang et al., 2024c) capabilities have
also led to the development of a number of promising agents (Zhang and Zhang, 2023; Hong et al.,
2023; Zhou et al., 2024b; Deng et al., 2023; Gur et al., 2024). Beyond using LL(V)Ms as plug-and-play
planners/policies, recent works have sought to improve agentic-specific performance. Examples
include online exploration (Zhang et al., 2024a), planning (Zhang et al., 2024b), error-correction
(Wang et al., 2024a), and self- (Wu et al., 2024) or AI-critique (He et al., 2024; Pan et al., 2024).
However, with small exceptions (Nakano et al., 2022) (which is still limited in scope) these agents
mostly provide a framework around a strong pre-existing model like GPT4-V or deploy limited
fine-tuning and adaptation. In this work we show that model training is crucial for continuous
improvement. We combine a planning and reasoning agent with MCTS inference-time search and AI
self-critique for self-supervised data collection, which we then use for RL type training.
2.3. Reinforcement Learning for LLMs and Agents
Reinforcement Learning has become a significant component of training modern generative AI systems
(Ouyang et al., 2022; Bai et al., 2022; Touvron et al., 2023). Classical approaches have deployed the
PPO algorithm (Schulman et al., 2017)—or similar policy-gradient based methods—and have even
been scaled to autonomous web search agents (Nakano et al., 2022) as well as embodied applications
with vision-language models (Zhai et al., 2024) (in simulation). However, these algorithms are
challenging due to their complexity and the need for a high number of online samples from the
model. This is especially prominent in potentially risky situations, such as autonomous agentic models
that could make a number of impactful mistakes during training. Implicit Language Q-learning
(Snell et al., 2022) and the Q-transformer (Chebotar et al., 2023) are offline RL algorithms (Levine
et al., 2020) designed for auto-regressive transformer models, and hence can be safely trained on
pre-collected datasets; however they have not been successfully scaled to modern LLMs. While these
methods represent a token-level MDP, (Zhou et al., 2024c) has shown success formulating the RL
problem at a step level and these ideas have recently been scaled to a general device-control agent
(Bai et al., 2024). However, these algorithms still have high complexity and require auxiliary models,
such as value functions, so instead in our approach we opt to use the Direct Preference Optimization
(DPO) algorithm (Rafailov et al., 2023) due to it’s simplicity and natural fit for the branching nature
of tree-search based data.
4
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Agent Input
Agent Output
<SYSTEM PROMPT>
<EXECUTION HISTORY>
PLAN
Select the date (May 22, 2024) for the
reservation
Choose the time (7:00 PM) for the
reservation
Select the party size (4 people) for the
reservation
Click on the "Find a table" button to book
the reservation.
USER QUERY: Book a reservation for the
restaurant Ceconni’s on OpenTable for
2 people on June 17 2024 at 7:00pm
CURRENT OBSERVATION:
THOUGHT: I am currently on the Cecconi's
restaurant page on OpenTable, and I need to
select the date and time for the reservation
and choose the number of guests. I will focus
on selecting the date, time, and party size
for the reservation.
COMMANDS: CLICK <select>Date</select>
STATUS: CONTINUE
Figure 2: We provide the following input format to the Agent, consisting of the system prompt,
execution history, the current observation as a DOM representation, and the user query containing
the goal. We divide our Agent output format into an overall step-by-step plan, thought, a command,
and a status code.
3. Preliminaries
In this section we will outline the preliminaries of our agent training process.
3.1. Agent Formulation
We consider a general POMDP setup (𝒪, 𝒮, 𝒜, 𝑇, 𝑅, 𝜇0 , 𝛾) where 𝒪 denotes the observation space, 𝒮
the unobserved state space, 𝒜 the action space, 𝑇 (s𝑡+1 |s𝑡 , a𝑡 ) the transition distribution (in this case
the dynamics of a web browser), 𝑅(s, a) the reward function (in this work we use sparse rewards
of 1/0 representing success/failure), 𝜇0 (s0 ) the initial state distribution, and 𝛾 the discount factor,
which we set to 1. A POMDP is the most suitable framework to model web interactions for several
reasons - first novel environments, which the agent is unfamiliar with require exploration in order
to locate the task objective, consistent with the meta-reinforcement learning as task inference view
Humplik et al. (2019). Moreover, the real web is dynamic, which creates partial observability of
the current state each time the agent is deployed - i.e. it does not a priori know current booking
availability before attempting to do it. We will outline the main parts of our web agent below.
The agent observation o𝑡 ∈ 𝒪 are commands/information given by the user and the web browser.
The first observation o1 is a user text instruction, such as
"Book reservation for restaurant Cecconi’s on OpenTable for 4 people on May 22 2024 at 7:00 PM"
for example and a browser home page. Subsequent observations consist of web pages from the
browser, represented as a HTML DOM format. Occasionally for some tasks the agent might ask for
confirmation/feedback from the user, which then also becomes part of the observation.
5
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
The agent actions a𝑡 ∈ 𝒜 are composite, based on agent history h𝑡 . Our base approach is a ReAct
agent Yao et al. (2023b) with a preliminary planning step (PlanReAct) Liu et al. (2023) with few
additional components.
• Planning For the first action after the initial observation we leverage the base LLM’s planning
plan
plan
capabilities Huang et al. (2022a) and prompt the agent to generate a plan a1 ∼ 𝜋(a1 |h1 )
of sequential steps to execute in language.
∼ 𝜋(atht
• Reasoning Subsequently all actions consist of a thought action atht
𝑡
𝑡 |h𝑡 ), which is
reasoning step Wei et al. (2022).
tht
• Environment action Next we generate the browser interaction command aenv
∼ 𝜋(aenv
𝑡
𝑡 |h𝑡 , a𝑡 ),
which consists of a finite set of options like "CLICK [ELEMENT ID]", "SCROLL", "TYPE [CONTENT]" or "ASK USER [CONTENT]" etc.. This is the only part of the action generation, which
interacts with the environment.
• Explanation action After the environment interaction action has been generated, we additional
expl
expl
env
prompt the model for an explanation action a𝑡 ∼ 𝜋(a𝑡 |h𝑡 , atht
𝑡 , a𝑡 ).
We denote the step action a𝑡 as a tuple of plan, thought, environment and explanation actions for the
first step and thought, environment and explanation actions for subsequent steps. When optimizing
models we consider the joint likelihood
expl
plan
tht
log 𝜋(a1 |h1 ) = log 𝜋(a1 |h1 , aenv
1 , a1 , a1
plan
tht
) + log 𝜋(aenv
1 |h1 , a1 , a1
)+
plan
plan
log 𝜋(atht
1 |h1 , a1 ) + log 𝜋(a1 |h1 )
(1)
for the initial action and
expl
log 𝜋(a𝑡 |h𝑡 ) = log 𝜋(a𝑡
tht
env
tht
tht
|h𝑡 , aenv
𝑡 , a𝑡 ) + log 𝜋(a𝑡 |h𝑡 , a𝑡 ) + log 𝜋(a𝑡 |h𝑡 )
for subsequent actions, unlike some prior works Zhai et al. (2024), which down-weight the reasoning
likelihood.
The agent state is the current state of the web, which may mot be observable. In this POMDP
formulation we also need to build an agent memory component h𝑡 . Prior works have used the entire
trajectory of observations and actions, however HTML DOMs can be hundred of thousands of tokens
long. Moreover realistic web-tasks can require many more interactions than static benchmarks such
as WebShop Yao et al. (2022) and WebArena Zhou et al. (2024b), which most prior works use. This
makes it impractical to use full web trajectories due to limited context windows, potential out-ofdistribution issues and practical inference speed and cost. Instead, we build the history representation
of the agent as h𝑡 = (a1 , . . . , a𝑡−1 , o𝑡 ). That is, the agent history consists of the actions generated
so far and the current browser state. With some abuse of notation we will also refer to this as the
agent state. Even though only the environment action is used for interacting with the browser, we
construct the agent thought and explanation actions to act as a form of inner monologue Huang et al.
(2022b) and adequately represent its state and intentions. This allows us to use a significantly more
compact history representation. We should note that, while only the environment action affects the
browser state, the planning, reasoning and explanation components affect subsequent decisions due
to conditioning. Because of this reason, when we optimize the agent, we compute likelihoods over
the composite action.
3.2. Fine-Tuning Language Models From Feedback
Classical approaches to RLHF in foundation models Stiennon et al. (2022); Ouyang et al. (2022) use
the model as a policy 𝜋𝜃 and optimize an objective of the form:
Ea∼𝜋𝜃 (a|h) [𝑟(a, h)] − 𝛽D𝐾𝐿 [𝜋𝜃 (a|h)||𝜋ref (a|h)]
(2)
6
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
where 𝜋ref is some reference policy (usually the initial model). The goal of this formulation is to
optimize some target objective (expressed by the reward 𝑟(a, h)) while preventing out-of-distribution
drift. This objective can be extended to multi-step agentic problems, where the model interacts with
an external environment env such as in Nakano et al. (2021) which focuses on information retrieval
using web navigation. In this case we use an objective of the kind
[︃
]︃
∑︁
E𝜋𝜃 ,env
𝑟(a𝑡 , h𝑡 )] − 𝛽D𝐾𝐿 [𝜋𝜃 (a𝑡 )|h𝑡 )||𝜋ref (a𝑡 |h𝑡 )]
(3)
𝑡
Classical RLHF has used policy gradient type of algorithms, such as PPO Schulman et al. (2017),
however, they are complex and require online data, which can be costly/dangerous to collect autonomously in the agent setting. While PPO has shown some success in prior web agent applications
Nakano et al. (2021). The issues above largely make the approach not practical for general web tasks,
beyond information retrieval. In this work we utilize some recent alternatives, outlined below.
3.2.1. Reinforced Fine-Tuning
Reinforced fine-tuning (RFT) algorithms Zelikman et al. (2022); Gulcehre et al. (2023); Yuan et al.
(2023); Singh et al. (2024) have grown in popularity due to their simplicity and scalability. These
methods aggregate data and filter out the sub-optimal samples based on some reward model or
a verifier to construct a growing dataset of high-quality trajectories 𝒟. Given this dataset and a
parameterized model 𝜋𝜃 we can carry out standard supervised fine-tuning (SFT):
[︃ 𝑇
]︃
∑︁
ℒ(𝜋𝜃 , 𝒟) = −E𝒟
log 𝜋𝜃 (a𝑡 |h𝑡 )
(4)
𝑡=1
In this objective the divergence penalty is only applied implicitly by limiting the number of training
rounds. While simple and relatively successful, empirically these methods tend to under-perform
standard RL and alternatives Dubois et al. (2024); Tajwar et al. (2024); Setlur et al. (2024b) in the
text generation domain, particularly in reasoning. We largely observe similar empirical results, and
we use these methods mostly as baselines to build intuition.
3.2.2. Direct Preference Optimization
Direct Preference Optimization (DPO) Rafailov et al. (2023) is an offline RL Levine et al. (2020)
alternative to the classical RLHF optimization pipeline. It is a suitable algorithm for agent fine-tuning,
as it can use fully offline data and does not require online rollouts. The original formulation in the
pure text generation setting considers feedback of pairwise comparisons (h, a𝑤 , a𝑙 ), where s is a single
prompt and a𝑤 and a𝑙 are two responses with a𝑤 ≻ a𝑙 indicating that a𝑤 is preferred over a𝑙 . The
DPO objective then minimizes the following loss:
[︂
(︂(︂
)︂ (︂
)︂)︂]︂
𝜋𝜃 (a𝑤 |h𝑤 )
𝜋𝜃 (a𝑙 |h𝑙 )
ℒDPO (𝜋𝜃 ; 𝒟) = −E(h,a𝑤 ,a𝑙 )∼𝒟 log 𝜎
𝛽 log
−
𝛽
log
𝜋ref (a𝑤 |h𝑤 )
𝜋ref (a𝑙 |h𝑙 )
(5)
While the algorithm was developed in a bandit setting Hejna et al. (2024); Rafailov et al. (2024)
have extended it to multi-turn settings with preferences over over trajectories. In our setting, we can
directly utilize this objective as:
⎡
⎛⎛ 𝑤
⎞ ⎛ 𝑙
⎞⎞⎤
|𝜏 |
|𝜏 |
𝑤
𝑤
𝑙
𝑙
∑︁
∑︁
𝜋𝜃 (a𝑡 |h𝑡 ) ⎠ ⎝
𝜋𝜃 (a𝑡 |h𝑡 ) ⎠⎠⎦
ℒT-DPO (𝜋𝜃 ; 𝒟) = −E(𝜏𝑤 ,𝜏𝑙 )∼𝒟 ⎣log 𝜎 ⎝⎝
−
𝛽 log
𝛽 log
(6)
𝑤
𝑤
𝜋ref (a𝑡 |h𝑡 )
𝜋ref (a𝑙𝑡 |h𝑙𝑡 )
𝑡=0
𝑡=0
7
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Figure 3: Success rate of different approaches on the WebShop Yao et al. (2022) task. All models
are based on xLAM-v0.1-r Zhang et al. (2024c). RFT and DPO over xLAM-v0.1-r demonstrate
improvements in performance from 28.6% to 31.3% and 37.5% respectively. However, these methods
still lag behind average human performance of 50.0%. Our approach, Agent Q + MCTS achieves a
significant gain (76.57% relative improvement) over the base model, outperforming average human
performance on WebShop with a success rate of 50.5%.
One bottleneck for the practical deployment of the algorithm is the need for a reference model 𝜋ref
during optimization, which requires more computational resources. Instead in our settings, we slightly
modify the algorithm using an off-policy replay buffer, which aggregates trajectory data, as well as
likelihoods of the generated actions. During the optimization step, we sample tuples of trajectories
and the corresponding likelihoods under the data generation (reference) density, which eliminates
the need for a separate reference model.
4. Preliminary Approach With Outcome Supervision
In this section we will outline preliminary experimental results, which will build the base understanding for our further experiments. We use the AgentOhana xLAM-v0.1-r model Zhang et al. (2024c),
which is a fine-tune of a pre-trained Mixtral-8x7B-Instruct-v0.1 model Jiang et al. (2024) on a mix of
agentic applications, including WebShop SFT data. We also incorporate the same agent configuration
1 specified by the AgentLite Liu et al. (2024) work to ensure a fair comparison between our fine-tuned
model and the xLAM base model performance. We evaluate all approaches on the WebShop environment Yao et al. (2022), where the agent needs to find particular products by browsing a simulated
web shop. The environment comes with a set of 12,087 pre-defined tasks (corresponding to specific
products to find), which we split into a train set of 11,000 tasks, which we use for further agent
fine-tuning and a set of 1,087 held-out tasks, which we use for zero-shot evaluation. We show success
rates (exact product match) for different approaches in Fig. 3. The base xLAM-v0.1-r model achieves
success rate of 28.6% on the test tasks. All other methods are based on outcome-based supervision
1
https://github.com/SalesforceAIResearch/xLAM
8
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
only, depending on whether a particular attempt was successful or not. We see that further RFT
training, using a STaR-like algorithm Zelikman et al. (2022) on the trajectory level, as outlined in
Sec. 3.2.1, achieves success rate of 31.3%, which is a small improvements of 2.7% over the initial
model. This is not surprising since the base model is already trained as an agent on the environment
with supervised fine-tuning on demonstrations. Our next experiment fine-tunes the base model using
the trajectory-level DPO algorithm, as outlined in Eq. 6 in Sec. 3.2.2 using successful trajectories as
preferred over failed ones. This approach also uses only outcome-level supervision, but unlike the
RFT baseline can utilize failed trajectories as well, which improves the agent performance by 9.3%
over RFT agent to 40.6% success rate. We also evaluate this model with beam search for the action
generation, which can be considered a form of planning the horizon of a single environment action
(which still consists of multiple simple actions) Rafailov et al. (2023), but it only yields marginal
improvement over the base model. These findings match results on reasoning for math problems
Pang et al. (2024) and some recent approaches that also apply DPO to agent applications Song et al.
(2024); Xi et al. (2024).
Despite the additional reinforcement learning training, our agents are still not able to match the
average human performance on this environment. We identify that one of the core failure modes of
the DPO policy is that it executes a greedy search when looking for matches to the product query.
For example, for every search query, the WebShop environment yields a number of pages of results.
However, we find that the model nearly always greedily searches for the best matching item in the
first page of results rather than using the "[NEXT]" and "[PREV]" buttons to navigate between pages,
essentially deploying a weak exploration strategy.
5. Agent Search
As we discovered in the previous section, while training based on outcome supervision with DPO
yields meaningful improvement, the model is still not able to match human performance due to
it’s limited exploration. In this section we will explore endowing the agent with additional search
capability via MCTS.
5.1. Monte-Carlo Tree Search Over Web-Pages
The Monte Carlo Tree Search (MCTS) algorithm Kocsis and Szepesvári (2006) employed in this
work follows closely the one in Hao et al. (2023) and consists of four phases: selection, expansion,
simulation, and backpropagation. Each phase plays a critical role in balancing exploration and
exploitation while iteratively refining the policy.
We formulate the web agent execution as tree search over web-pages. The state is represented as
described in Section 3.1 and consist of the summary of the agent’s history and the DOM tree of
the current web-page. Unlike board games, such as Chess or Go Silver et al. (2017b) the complex
web-agent action space we use is open-format and variable. Instead we will use the base model as an
action-proposal distribution and sample a fixed amount of possible actions at each node (web-page).
Once we select and execute an action in the browser we traverse the next web-page, which together
with the updated history becomes the new node.
5.1.1. Action Selection With AI Process Supervision
The selection phase uses the Upper Confidence Bound (UCB1) formulation of MCTS also used by Hao
et al. (2023) to select nodes which aims to balance exploration and exploitation. With some abuse of
notation we will also denote the agent state with h𝑡 . We consider the value function 𝑄(h𝑡 , a) which
9
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Agent Input
<SYSTEM PROMPT>
<EXECUTION HISTORY>
USER QUERY: Book a reservation for the
restaurant Fogo de Chao on OpenTable for
2 people on August 14 2024 at 7:00pm
Proposed Action 1
Proposed Action 2
I will first select the desired date and time
for the reservation, then choose the number
of people and select a suitable time slot.
I am searching for the Terra - Eataly
Silicon Valley restaurant on OpenTable.
TYPE <searchbar> “Terra - Eataly”
CLICK <date selector>
CURRENT OBSERVATION:
LLM Critic
Proposed Action 2
LLM Actor
I am searching for the Terra - Eataly
Silicon Valley restaurant on OpenTable.
TYPE <searchbar> “Terra - Eataly”
Proposed Action 3
I will navigate to the OpenTable homepage
to search for the relevant restaurant.
GOTO “opentable.com”
pon analyzing the current browser
state, I notice that we are on the
OpenTable website. This page displays
a list of restaurants with available
reservation times. The relevant
elements
on the page are ...
U
The most promising command is to
TYPE
in the search bar for the search term
“Terra - Eataly”
Proposed Action 1
I will first select the desired date and time
for the reservation, then choose the number
of people and select a suitable time slot.
CLICK <date selector>
Proposed Action 3
I will navigate to the OpenTable homepage
to search for the relevant restaurant.
GOTO “opentable.com”
Figure 4: The policy proposes K actions at every step during inference time search. The critic, also
initialized as the same base LLM model used by the policy, ranks the actions proposed by the policy.
This ranking is used to guide node selection after expansion and used to construct preference pairs
during policy training.
represents the estimated value (chance of success) represents the estimated value of taking action a
in the state h𝑡 . At each new node h𝑡 we sample 𝐾 proposal actions from the base model a1𝑡 , . . . , a𝐾
𝑡 .
𝑖
We initialize all values 𝑄(h𝑡 , a𝑡 ), 𝑖 = 1, . . . , 𝐾 to zero. The web-based environment does not provide
intermediate rewards to guide the search, so we incorporate AI-based critique to provide process
supervision at the step level to guide the exploration process. We use the base model to produce a
feedback score for each action by asking it to rank the generated actions by its perceived utility in
helping the agent complete the user task.
We query the feedback model for multiple iterations, each time removing the best action selected from
the previous iteration from the list, until we have a full ranking of all actions. The full AI feedback
process is demonstrated in Figure 4. After the initial selection, we select the actions to explore based
on the standard MCTS UCB1 formulation:
√︃
[︃
]︃
log 𝑁 (h𝑡 )
*
a𝑡 = arg max 𝑄(h𝑡 , a) + 𝑐exp ·
,
(7)
1 + 𝑁 (h𝑡+1 )
a1𝑡 ,...,a𝐾
𝑡
where 𝑁 (h𝑡 ) is the visitation frequency of state h𝑡 , and 𝑐exp is an exploration constant. For each
rollout added to the tree, we start at the root node and follow the child states that maximize the
UCB1 score until we reach a leaf node. This process is repeated for each tree/prompt in the batch.
5.1.2. Expansion and Backtracking
Based on the preceding section, we select and execute an action in the browser environment to reach
a new node (page). Beginning from the selected state node’s trace, we roll out the trajectory using
the current policy 𝜋𝜃 until a terminal state is reached. The environment returns a reward at the
end of the trajectory, 𝑅, where 𝑅 = 1 if the agent was successful and 𝑅 = 0 otherwise. We then
backpropagate this reward by updating the values of each node bottom up from the leaf node to the
root as follows:
𝑄(h𝑡 , a𝑖𝑡 )𝑁 (h𝑡 , a𝑖𝑡 ) + 𝑅
𝑄(h𝑡 , a𝑖𝑡 ) ←
𝑁 (h𝑡 , a𝑖𝑡 ) + 1
(8)
𝑖
𝑖
𝑁 (h𝑡 , a𝑡 ) ← 𝑁 (h𝑡 , a𝑡 ) + 1
Each state node tracks two values: 𝑄(h𝑡 , a𝑖𝑡 ), the average reward for passing through state h𝑡 and
10
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Algorithm 1 MCTS Guided Direct Preference Optimization
Input: 𝜋𝜃0 : initial LLM policy, 𝒟𝑇 : dataset of tasks the agent must complete in the environment, 𝑁 :
number of iterations, 𝐵: number of samples per iteration, 𝑇 : MCTS tree depth, ℬ: replay buffer,
𝜃threshold : value threshold in (10), 𝐾: number of actions to sample for MCTS
Output: 𝜋𝜃𝑁 , the trained LLM policy
for 𝑖 = 1 to 𝑁 do
𝜋ref ← 𝜋𝜃𝑖 , 𝜋𝜃𝑖 ← 𝜋𝜃𝑖−1
Sample a batch of 𝐵 tasks from 𝒟𝑇
for each task in batch do
Initialize the root node h0
for 𝑡 = 1 to 𝑇 do
Selection: Traverse tree from the root node to a leaf node using tree policy (UCB1; 7)
Trajectory Rollout: From the selected node’s trace, roll out the trajectory using
𝜋𝜃𝑖 until a terminal state is reached
Backpropagation: Backpropagate the value estimate bottom-up (8)
end for
Collect trajectories from rollouts and store them in replay buffer ℬ
end for
𝑙 𝑇 −1
Construct preference pairs 𝒟𝑃 = {(h𝑡 , a𝑤
𝑡 , a𝑡 )}𝑡=1 where h𝑡 ∼ 𝒟𝑃 . For each node at step level
𝑡, compare each pair of child nodes, and construct the pair of generated actions (a𝑤 , a𝑙 ) if the
values of taking the action, |𝑄(h𝑡 , a𝑤 ) − 𝑄(h𝑡 , a𝑙 )| > 𝜃threshold , where 𝑄(h𝑡 , a𝑤 ) and 𝑄(h𝑡 , a𝑙 ) are
computed using (10)
Optimize LLM policy 𝜋𝜃𝑖 using DPO objective in Eq. (5) with 𝒟𝑃 and 𝜋ref
end for
choosing action∑︀a𝑖𝑡 , and 𝑁 (h𝑡 , a𝑖𝑡 ), the number of times this state action pair was visited during search
𝑖
(and 𝑁 (h𝑡 ) = 𝐾
𝑖=1 𝑁 (h𝑡 , a𝑡 )). The backpropogation updates correctly maintain these values.
5.2. Improving Zero-Shot Performance with Reinforcement Learning
Training large foundation models with offline Snell et al. (2022) or off-policy Chebotar et al. (2023)
reinforcement learning at scale has still remained challenging. At the same time online (on-policy)
reinforcement learning Stiennon et al. (2022); Ouyang et al. (2022) is not scalable to real interactive
environments. Instead, we follow a line of recent works, which apply the DPO algorithm Rafailov
et al. (2023, 2024) at the step level in multi-step reasoning problems in mathematical domains Xie
et al. (2024); Hwang et al. (2024); Chen et al. (2024); Lai et al. (2024b); Lu et al. (2024); Setlur
et al. (2024b); Zhang et al. (2024f). Our approach is most similar to Xie et al. (2024); Chen et al.
(2024); Zhang et al. (2024f) who also use the branching nature of tree search to produce step-level
preference pairs. We will also use this approach in our setting due to its simplicity, scalability and
prior success in smaller scale (non-interactive) reasoning applications.
𝑙
We will generate a dataset of preference pairs 𝒫 = {h𝑡 , a𝑤
𝑡 , a𝑡 } where we make sure both actions
were explored. We then optimize the DPO objective in Eq. 5 on the node level. We will leverage a
theoretical result below to guide the construction of these preferences. We can make a number of
modifications to Theorem 6.1 from Setlur et al. (2024b) to incorporate the interactive nature of the
web environment dynamics to obtain the following result:
Theorem 1. Consider a policy that optimizes the objective in Eq. 3 on trajectories generated by 𝜋ref
𝑙
𝑤
and that at each node h𝑡 we have preferences generated accordingly to 𝑝(a𝑤
𝑡 ≻ a𝑡 |h𝑡 ) ∝ 𝜎(𝑄(h𝑡 , a𝑡 ) −
11
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
𝑄(h𝑡 , a𝑙𝑡 )), then the policy which optimizes the DPO objective in Eq. 5 is identical to the optimal RL policy
𝜋 * (a|h𝑡 ) ∝ 𝜋ref (a|h𝑡 ) exp (𝑄(h𝑡 , a)/𝛽)
(9)
Proof. The proof follows directly from the proof of Theorem 6.1 in Setlur et al. (2024b) and the
control as inference arguments in Rafailov et al. (2024); Levine (2018).
That is, we can approximate the optimal RL policy if we generate preferences under the optimal
value function (or an approximation thereof). Since the outcome success provides limited supervision
we also incorporate process supervision through the AI feedback as outlined in Section 5.1.1. We
interpret the ranking of possible actions by the model to be driven by an implicit value function.
Similar semantics was used in Koh et al. (2024), where GPT-4 was used as a zero-shot value function,
while here we ask the model to instead reason over the given potential actions and provide rankings
instead. This self-rewarding approach has shown promise in the RLHF setting Yuan et al. (2024) and
we utilize it for our agent setting as well. Under this formulation, we compute the state-action value
as an average:
˜ 𝑡 , a𝑖 ) + (1 − 𝛼)𝑄(h
ˆ 𝑡 , a𝑖 )
𝑄(h𝑡 , a𝑖𝑡 ) = 𝛼𝑄(h
(10)
𝑡
𝑡
˜ 𝑡 , a𝑖𝑡 ) is the empirical value estimated through MCTS backpropagation and 𝑄(h
ˆ 𝑡 , a𝑖𝑡 ) is a
where 𝑄(h
𝑖
value estimate based on the ranking of the action a𝑡 by the process supervision AI model. We then create
𝑙
preferences over pairs of actions which are above a certain value threshold |𝑄(h𝑡 , a𝑤
𝑡 ) − 𝑄(h𝑡 , a𝑡 )| ≥
𝜃threshold . The full outline of our RL approach is shown in Algorithm 1.
5.3. Full WebShop Results
The full range of results and baselines is shown in Figure 3. We see that equipping the agent with
search capabilities at test time significantly boost success rates from 28.6% to 48.4% when using MCTS
on top of the base xLAM-v0.1-r model, approaching close to the average human performance of 50.0%
and significantly out-performing the zero-shot performance of the DPO model trained with outcome
supervision. We further fine-tune the base model using the approach outlined in Algorithm 1, which
yields an improvement of 0.9% over the base DPO model. Using MCTS on top of the trained Agent Q
model further improves performance to 50.5% slightly out-performing the average human success
rates. We find that the ability to search at test time is a significant paradigm shift from zero-shot
agents, even with significant RL training. Furthermore, while dense-level supervision improves over
purely outcome-based one, the improvement is modest on WebShop. This is because the environment
requires relatively short trajectories, and the model is capable to learn credit assignment purely from
outcome supervision. We will further explore more complex real world environment, which requires
longer-range credit assignment.
6. Scaling To Real World Websites
In this section we will investigate scaling the Agent Q framework to real use cases on live websites, in
particular bookings on OpenTable. We carried out initial experiments with the xLAM-v0.1-r model,
which proved to weak for the task achieving an initial success rate of 0.0%. Instead we shifted to the
LLaMa 70B Instruct model, which was able to achive some non-trivial initial success.
6.1. The OpenTable Environment
In OpenTable, the agent is tasked with booking a restaurant reservation for a user. The agent must
find a restaurant page on the OpenTable site, look for a reservation at a certain date and time, choose
12
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
USER
QUERY:
Book
restaurant
2
people
FINAL
on
Fogo
a
de
August
reservation
Chao
29
on
2024
for
the
OpenTable
at
for
7:00pm
OBSERVATION:
LLM Critic
Score 0.0
The agent booked a reservation for the
correct restaurant, but incorrect date and
time.
Figure 5: At the end of a trajectory, a GPT-4-V evaluator is called to provide feedback on the agent’s
performance given the final observation and action history to determine the success score. The model
is prompted with a condensed execution history of the trajectory and the screenshot of the final state.
The success metric is a binary 0/1 value.
seating options that align with a user’s preference and submit the user contact information to complete
the task successfully. Since OpenTable is a live environment and is difficult to programatically measure
metrics for, we use a language model, GPT-4-V to collect rewards for each trajectory, based on the
following metrics: (1) date and time set correctly, (2) party size set correctly, (3) user information
entered correctly, and (4) clicked complete reservation. The task is marked as completed if each
of the above constraints are satisfied. The outcome supervision setup is shown in Figure 5. We
experimented with using LLaMa 70B for outcome supervision as well, but discovered that vision
capabilities significantly improve the success classification accuracy (as measured by human validation).
At the time of writing no open source vision-language model of sufficient capability was available,
hence we opted to use GPT-4-V. We believe that as more open-source multi-modal models become
available we can switch to a fully self-supervised pipeline.
To generate queries for the OpenTable benchmark dataset, we programatically generate a diverse set
of user queries by combining the restaurant name, desired date and time, and user information.
Navigating on live websites pose a wide variety of challenges. For example, consider that the user
specifies a restaurant in a different city than the location the browser is initialized in, the model will
have to take extra steps to find the restaurant. Further, if the exact user requested date and time are
not available, the model may have to choose the closest available reservation slot. Lastly, if there are
preferences, such as indoor or outdoor seating options that the model is presented with, the desired
behavior is to interact with the user to determine the best course of action. OpenTable presents a
complex set of challenges for web navigation agents; the number of steps required to complete the
task is on average 13.9 steps, over double the average number of steps for Webshop, 6.8.
For the observation space for this environment, we design an intermediate state representation that
crawls the raw HTML content of a website to retrieve relevant visual components, and highlight
interactive elements to the model. The agent is allowed the actions, "CLICK [ID]", "GOTO [URL]",
"TYPE [ID] [TEXT]", "SUBMIT [ID]", "CLEAR [ID]", "SCROLL [UP/DOWN]", and "ASK USER HELP".
For OpenTable experiments, we use the LLaMA-3-70B-Instruct model as the initial policy. We find that
the superior reasoning abilities of this class of model is required for effective task completion, which
13
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Figure 6: Success rate of different approaches on OpenTable. All models unless otherwise stated
are based on LLaMA-3-70B-Instruct Touvron et al. (2023). Using DPO and RFT with MCTS further
improves performance from 18.6% to 71.8% and 84.3% respectively. We show that Agent Q in
itself achieves 81.7% and Agent Q + MCTS significantly outperforms all other techniques, with a
performance of 95.4% on OpenTable.
is necessary to produce the diverse success and failure trajectories required to effectively improve the
policy.
6.2. Results On OpenTable
The base xLAM-v0.1-r model achieves a success rate of 0.0%, largely from failing to follow instructions
for the general web navigation instructions used for live websites, contrary to the simplified observation
and action space used in WebShop. We instead initialize the base policy with the LLaMa-3 70B Instruct
model, which achieves a zero-shot success rate of 18.6%. We do a single round of RFT on 600 successful
trajectories which improves the success rate to 67.2% already out-performing the the GPT-4o model
zero-shot performance with a success rate of 62.6%. For all other baselines we adopt the RFT model
as the reference policy, due to the relatively low success rate of original LLaMa 3 70B Instruct model.
In this environment, training with outcome-supervision only DPO further improves performance by
4.6% to 71.8% but significantly under-performs the full Agent Q pipeline which achieves a zero-shot
success rate of 81.7% We hypothesizes that this is due to the fact that OpenTable is a significantly
more challenging environment, which requires almost twice as many steps to complete as WebShop,
so the agent benefits from fine-grained supervision and credit assignment. We further ablate the
role of the intermediate AI feedback process supervision during training as outlined in Eq. 10 and
use MCTS with online Q values computed from outcome rewards only. This setting still outperforms
training with trajectory-level DPO (75.2% versus 71.8%) likely due to the more fine-grained credit
assignment that the branching tree search provides to the agent. However, zero-shot performance
is still meaningfully worse than using intermediate process-level supervision and the full Agent Q
achieves 6.5% higher success rate at 81.7%.
14
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Similar to the WebShop experiment we see a step level increase in capability from allowing the
model to search at inference time, with the base RFT model achieving 84.3% success with MCTS,
outperforming the Agent Q zero-shot performance of 81.7% success. However, if we carry out
additional MCTS search using the Agent Q model as the base policy we achieve a significant 95.4%
success rate.
7. Discussion
In this work we developed algorithms for autonomous improvement of web-agents with limited human
supervision. While most prior works build frameworks around existing models without additional
training, we specifically seek to fine-tune pre-trained models for web navigation tasks based on
synthetic reasoning and search data. While we achieve significant improvement in model capabilities
on our target domain, many research questions remain.
Design of reasoning algorithms. The core challenge for our web agents is the weak reasoning
capabilities, which limit the agent’s exploration and search strategy. In our approach we used processlevel supervision from a separate critic model, which we prompt to rank possible agent actions. This
is in contrast to works in mathematical reasoning where PRMs are usually trained to classify the
correctness of individual steps Lightman et al. (2023), while other agent works Koh et al. (2024)
have prompted models as zero-shot value functions. Furthermore, while we spent significant effort in
training the agent policy, we maintain a frozen critic, which would likely also benefit from additional
fine-tuning. We defer exploration of these design choices to further work.
Choice of search algorithm. We used MCTS search due to the approach’s prior success in mathematical and code reasoning tasks. However, agent models executing MCTS on live environments might
require significant number of risky interactions and a different search strategy might be more suitable.
Recent works such as Lehnert et al. (2024); Gandhi et al. (2024) have even suggested directly learning
to optimally search and explore in reasoning tasks using meta-reinforcement learning. We believe
this is a promising research direction for autonomous agents, which we will pursue in further work.
Discrepancy between zero-shot vs search results. Similar to some recent works that focus on code
and reasoning, we observe significant gap between zero-shot agent performance and performance of
the agent equipped with search capabilities Snell et al. (2024); Brown et al. (2024). Investigating
these trade-offs at scale and the potential effect of different search/optimization approaches.
Online safety and interaction. The design of agent Q allows for largely autonomous exploration,
self-evaluation and improvement with limited human intervention. However, the agent might make a
significant number of mistakes in it’s search process which might be difficult to fix/reverse, especially
for safety-critical online transactions, such as communications/email, payments, filings etc. This limits
the scope of websites that Agent Q can be safely deployed and we might require additional safety
critics and human-in-the-loop training setups.
15
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
References
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.
arXiv preprint arXiv:2303.08774, 2023.
Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan
Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable
multimodal models. arXiv preprint arXiv:2312.11805, 1, 2023.
Anthropic.
Introducing the next
IntroducingthenextgenerationofClaude.
generation
of
claude,
2024.
URL
Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl:
Training in-the-wild device-control agents with autonomous reinforcement learning, 2024.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones,
Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson,
Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson,
Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile
Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova
DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El
Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan,
Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph,
Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback,
2022.
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi,
Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph
of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI
Conference on Artificial Intelligence, 38(16):17682–17690, March 2024. ISSN 2159-5399. doi:
10.1609/aaai.v38i16.29720. URL http://dx.doi.org/10.1609/aaai.v38i16.29720.
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia