An index of concepts, frameworks, and methodologies in:
- Function Calling Pipeline: Understanding the entire process from pre-call to post-call stages
- Sample Construction & Fine-tuning: Building effective training datasets and optimizing models
- Deployment & Inference: Practical implementation strategies for real-world applications
- Evaluation Frameworks: Benchmarks and metrics for assessing function calling capabilities
Reproducibility is important! We prioritize methods with open-source implementations.
Please cite our survey paper if this index is helpful:
@article{wang2025comprehensive,
title={Function Calling in Large Language Models: Industrial Practices, Challenges, and Future Direction},
author={Wang, Maolin and Zhang, Yingyi and Peng, Cunyin and Chen, Yicheng and Zhou, Wei and Gu, Jinjie and Zhuang, Chenyi and Guo, Ruocheng and Yu, Bowen and Wang, Wanyu and Zhao, Xiangyu},
url = {https://openreview.net/pdf?id=LNxVGPedFW}
year={2025},
}
- Challenges
- Sample Construction and Fine-Tuning
- Deployment and Inference
- Evaluation
- Industry Products
- Open Issues
Function calling capabilities in LLMs follow a three-stage workflow consisting of pre-call processing, on-call execution, and post-call validation.
Challenge | Description |
---|---|
Challenge 1.1: Intent Recognition | Understanding user intentions accurately from natural language queries |
Challenge 1.2: Function Redundancy | Managing redundant functions that serve similar purposes, increasing selection complexity |
Challenge | Description |
---|---|
Challenge 2.1: Missing Calls | Failure to initiate function calls when required for task completion |
Challenge 2.2: Unnecessary Calls | Triggering function calls when not required by the user's task |
Challenge 3.1: Missing/Illegal Parameters | Inadequate or inappropriate parameter extraction from user inputs |
Challenge 3.2: Function Hallucination | Mistakenly calling non-candidate or non-existent functions |
Challenge 3.3: Pronouns Resolving | Correctly interpreting contextual references and pronouns in queries |
Challenge 3.4: LLM Inherent Limitations | Performance constraints in latency and accuracy due to model architecture |
Challenge 3.5: Multi-Call Procedure | Managing complex workflows requiring multiple related function calls |
Challenge 3.6: Effective Context Management | Maintaining relevant information across multi-turn conversations |
Challenge | Description |
---|---|
Challenge 4.1: Execution Result Mismatch | Function outputs not aligning with user expectations |
Challenge 4.2: Irrelevant Information Overload | Excessive irrelevant information in function outputs |
Challenge 4.3: Mismatch Between Real-World Functions and Results | Gap between LLM-generated outputs and executable code |
Challenge 4.4: Execution Failure | Functions failing despite correct triggering and parameterization |
The training process involves specialized data preparation and fine-tuning strategies to equip models with function calling capabilities while maintaining general language understanding.
Method | Description |
---|---|
Manual Construction | Human-crafted functions with precise specifications and documentation |
LLM Generation | Leveraging large language models like GPT-4, LlaMA 70B, and Qwen to automatically generate function specifications |
Web Mining | Extracting diverse function objects from web resources, with descriptions supplemented by LLMs when necessary |
Approach | Paper | Code | Description |
---|---|---|---|
Text Representation | Toolformer: Language models can teach themselves to use tools (Schick et al., 2024) | Code | Represents functions as natural language text, providing flexibility but requiring more token space |
Text Representation | ToolGen: Unified Tool Retrieval and Calling via Generation (Wang et al., 2024) | Code | Integrates tool information through generation with natural language descriptions |
Token Representation | Toolformer: Language models can teach themselves to use tools (Schick et al., 2024) | Code | Encodes functions as special tokens during training for computational efficiency |
Token Representation | ToolGen: Unified Tool Retrieval and Calling via Generation (Wang et al., 2024) | Code | Uses token representation during training while maintaining semantic richness |
Multi-turn Interaction | Sequential API Function Calling Using GraphQL Schema (Saha et al., 2024) | - | Introduces structured API schemas and response mapping for sequential function calling |
Multi-turn Interaction | Hammer: Robust Function-Calling for On-Device Language Models via Function Masking (Lin et al., 2024) | - | Specialized techniques to address naming convention sensitivity issues for on-device deployment |
Method | Paper | Description |
---|---|---|
Supervised Fine-Tuning (SFT) | ToolGen: Unified Tool Retrieval and Calling via Generation (Wang et al., 2024) | Standard fine-tuning approach with unified retrieval and calling generation |
Supervised Fine-Tuning (SFT) | RAIT: Retrieval Augmented Instruction Tuning (Asai et al., 2023) | Retrieval-augmented approach for instruction tuning |
Supervised Fine-Tuning (SFT) | Show your work: Scratchpads for intermediate computation with language models (Nye et al., 2021) | Scratchpad-based training for step-by-step computation |
Supervised Fine-Tuning (SFT) | Giving BERT a calculator: Finding operations and arguments with reading comprehension (Andor et al., 2019) | Integrates mathematical operations with language understanding |
Supervised Fine-Tuning (SFT) | Rainier: Reinforced knowledge introspector for commonsense question answering (Liu et al., 2022) | Knowledge introspection for improved reasoning |
Supervised Fine-Tuning (SFT) | Learning to represent programs with graphs (Allamanis et al., 2018) | Program representation through graph structures |
Supervised Fine-Tuning (SFT) | A deep generative model of code syntactic structures (Barone et al., 2017) | Syntax-aware code generation models |
Supervised Fine-Tuning (SFT) | Pre-training for Abstractive Document Summarization (Liu et al., 2019) | Domain-specific pre-training for document summarization |
Supervised Fine-Tuning (SFT) | Character-level neural network for biomedical named entity recognition (Liu et al., 2017) | Character-level models for biomedical entity recognition |
Parameter-Efficient Fine-Tuning (PEFT) | Gpt4tools: Teaching large language model to use tools via self-instruction (Yang et al., 2024) | Self-instruction approach for tool utilization |
Parameter-Efficient Fine-Tuning (PEFT) | CITI: Enhancing Tool Utilizing Ability in Large Language Models without Sacrificing General Performance (Hao et al., 2024) | Balanced approach for tool utilization without performance tradeoffs |
Parameter-Efficient Fine-Tuning (PEFT) | Toolformer: Language models can teach themselves to use tools (Schick et al., 2024) | Self-supervised learning for tool usage |
Parameter-Efficient Fine-Tuning (PEFT) | PLUG: Parameter-efficient LLMs Using Plugin Adapters (Li et al., 2023) | Plugin adapter approach for parameter efficiency |
Parameter-Efficient Fine-Tuning (PEFT) | Prompt tuning for generative multimodal pretrained models (Wei et al., 2022) | Prompt-based tuning for multimodal generation |
Reinforcement Learning & RLHF | WebGPT: Browser-assisted question-answering with human feedback (Nakano et al., 2021) | Web browsing capabilities enhanced through human feedback |
Reinforcement Learning & RLHF | Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis (Liang et al., 2024) | Large-scale API connectivity through reinforcement learning |
Reinforcement Learning & RLHF | MADAC: Multi-Agent Decision-Aware Conversation via Reinforcement Learning (Li et al., 2023) | Decision-aware conversation through multi-agent reinforcement learning |
Reinforcement Learning & RLHF | GopherCite: Teaching language models to support answers with verified quotes (Menick et al., 2022) | Citation verification through reinforcement learning |
Reinforcement Learning & RLHF | Emergent Abilities of Large Language Models (Kojima et al., 2022) | Studies emergent abilities through reinforcement learning approaches |
Reinforcement Learning & RLHF | Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023) | Preference optimization without explicit reward modeling |
Reinforcement Learning & RLHF | Deep reinforcement learning from human preferences (Christiano et al., 2017) | Foundational work on learning from human preferences |
Reinforcement Learning & RLHF | The Bias-Variance Trade-off in RLHF: Overfitting to Human Feedback in Large Language Models (Manduzio et al., 2023) | Analysis of overfitting risks in human feedback |
Based on practical implementations, we emphasize that data quality (and variety) plays a more crucial role than data quantity in both data construction and fine-tuning phases, given the intricate nature of function calling tasks.
Based on practical implementations, we emphasize that data quality (and variety) plays a more crucial role than data quantity in both data construction and fine-tuning phases, given the intricate nature of function calling tasks.
Emphasis | Description |
---|---|
Data Quality | Prioritizing dataset diversity and verification over quantity for more robust function calling capabilities |
Model Scaling | Larger models demonstrate significantly better function calling capabilities, with notable improvements above 7B parameters |
Capability Balance | Maintaining a balance between specialized function calling abilities and general language capabilities to avoid performance tradeoffs |
This section explores practical deployment strategies for function-calling LLMs. Figure illustrates a typical workflow where queries pass through input construction, LLM processing, and format validation or execution, with memory components maintaining context throughout the process.
Name | Paper | Venue | Code | Comment |
---|---|---|---|---|
ReAct | React: Synergizing reasoning and acting in language models (Yao et al., 2022) | NeurIPS | Code | Combines reasoning and acting through chain-of-thought prompts |
ToolFormer | Toolformer: Language models can teach themselves to use tools (Schick et al., 2023) | NeurIPS | Code | Enables LLMs to use external tools through self-supervised learning |
Reverse Chain | Reverse chain: A generic-rule for llms to master multi-api planning (Zhang et al., 2023) | arXiv | - | Introduces target-driven backward reasoning for controlled multi-API planning |
AVATAR | AvaTaR: Optimizing LLM Agents for Tool-Assisted Knowledge Retrieval (Wu et al., 2024) | arXiv | Code | Actor-comparator architecture for tool-assisted knowledge retrieval |
DEPS | Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents (Wang et al., 2024) | NeurIPS | Code | Interactive planning through description-based decomposition |
LLM-MCTS | LLM-MCTS: Monte Carlo Tree Search with LLMs for reasoning tasks (Zhao et al., 2023) | arXiv | Code | Monte Carlo Tree Search approach for multi-step reasoning |
MACT | Measuring and narrowing the compositional gap in language models (Zheng et al., 2023) | arXiv | Code | Addresses compositional generalization through structured decomposition |
TACO | Taco: Towards api conversation workflows for tool augmentation (Mao et al., 2024) | arXiv | Code | Structured workflows for tool-augmented conversational agents |
PAE | Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents (Zhou et al., 2024) | arXiv | Code | Multi-agent system for autonomous skill discovery and planning |
SCIAGENT | SciAgent: Tool-augmented language models for scientific reasoning (Wang et al., 2023) | arXiv | Code | Tool-augmented planning for scientific problem-solving |
Agent Laboratory | Agent laboratory: Using llm agents as research assistants (Schmidgall et al., 2025) | arXiv | - | Multi-agent architecture with specialized roles for research planning |
Name | Paper | Venue | Code | Comment |
---|---|---|---|---|
AppAgent | AppAgent: Multimodal Agents as Smartphone Users (Yang et al., 2023) | arXiv | Code | Agents learn to operate smartphone applications via visual interfaces |
OS-ATLAS | OS-ATLAS: Foundation AI Agent for Desktop Operating Systems (Wang et al., 2024) | arXiv | Code | End-to-end desktop OS navigation with multimodal perception |
AndroidLab | AndroidLab: Large Language Models for Android UI Navigation (Yan et al., 2023) | arXiv | Code | Benchmarking and improving LLM-based Android UI navigation |
Ponder | Ponder: Toolkit-Aware Agent for Automating Desktop Tasks (Wu et al., 2024) | arXiv | Code | Self-reflective navigation through desktop interfaces |
OS-Genesis | OS-Genesis: Evaluating the multimodal capabilities of large language models in navigating operating systems (Li et al., 2024) | arXiv | Code | Comprehensive benchmark for evaluating OS navigation capabilities |
Name | Paper | Venue | Code | Comment |
---|---|---|---|---|
Orca | Orca: Progressive learning from complex explanation traces of gpt-4 (Mukherjee et al., 2023) | arXiv | - | Learns from complex explanation traces for progressive improvement |
Orca 2 | Orca 2: Teaching small language models how to reason (Mitra et al., 2023) | arXiv | Code | Enhanced reasoning capabilities through step-by-step explanation |
Memgpt | Memgpt: Towards llms as operating systems (Chen et al., 2023) | arXiv | Code | Memory management system with hierarchical storage |
AIOS-Agent | Aios-agent: In-context fine-grained os control with large language models (Chu et al., 2024) | arXiv | Code | System-level control through fine-grained OS operations |
SpecInfer | Specinfer: Accelerating generative llm inference via speculative execution (Yan et al., 2023) | arXiv | Code | Performance optimization through speculative execution |
PEOA | PEOA: Progressive Exemplar-Oriented API-Aware Prompting (Wang et al., 2024) | arXiv | Code | Exemplar-based prompting for API-aware interactions |
LLM-Tool Compiler | Compiler-aided Generation for Tool-LLM Inference (Song et al., 2024) | arXiv | - | Compilation techniques to optimize tool operations |
Name | Paper | Venue | Code | Comment |
---|---|---|---|---|
LLM-Planner | Llm-planner: Few-shot grounded planning for embodied agents with large language models (Song et al., 2023) | ICCV | Code | Environmental feedback for plan regeneration during failures |
ToolChain* | Toolchain*: Efficient action space navigation in large language models with a* search (Zhuang et al., 2023) | arXiv | Code | Employs decision trees for systematic API call management |
TPTU | Test-Time Prompt Updating for Text-to-Image Generative Models (Liang et al., 2023) | arXiv | Code | Adaptive prompt refinement based on execution feedback |
Buckets | Buckets: Efficient multi-environment learning for llm agents (Burkart et al., 2023) | arXiv | Code | Error-aware multi-environment learning framework |
AMOR | AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback (Guan et al., 2024) | arXiv | Code | FSM-based framework enabling process-level human feedback |
Name | Paper | Venue | Code | Comment |
---|---|---|---|---|
ControlLLM | Controlllm: Augment language models with tools by searching on graphs (Liu et al., 2023) | arXiv | Code | Tree of Thoughts with depth-first search on tool graphs |
PLUTO | Pluto: A recipe for building adaptable autonomous llm agents (Guan et al., 2024) | arXiv | Code | Adaptable autonomous agents with tree-based planning |
Toolink | Toolink: Linking toolkit creation and using through chain-of-solving on open-source model (Qian et al., 2023) | arXiv | Code | Hierarchical task decomposition with toolkit creation |
TPTU-v2 | TPTU-v2: Boosting Test-Time Prompt Tuning for Text-to-Image Generation (Kawar et al., 2023) | arXiv | Code | Enhanced tree-based prompt optimization strategies |
α-UMi | Small llms are weak tool learners: A multi-llm agent (Shen et al., 2024) | arXiv | Code | Planning-oriented fine-tuning for small LLMs |
Name | Paper | Venue | Code | Comment |
---|---|---|---|---|
COA | Chain of agents: A framework for collaborative tool utilization with language models (Chang et al., 2024) | arXiv | Code | Agent collaboration framework for specialized tool utilization |
DEER | DEER: Diverse Evolution Ensembles are Required for Large Language Model Agents (Chen et al., 2024) | arXiv | Code | Diverse evolution ensembles for LLM agent improvement |
SOAY | SOAY: Responsive and Safe Structured Editing with Dynamic Text Features (Wang et al., 2024) | arXiv | Code | Dynamic text feature adaptation for structured editing |
ProgPrompt | ProgPrompt: Generating Situated Robot Task Plans using Large Language Models (Singh et al., 2022) | arXiv | Code | Adaptive programming for situated robot task planning |
AutoTOD | Towards fully autonomous dialogue systems via interactive few-shot learning (Zhang et al., 2023) | arXiv | - | Interactive few-shot learning for dialogue system adaptation |
MATMCD | MATMCD: An Open Benchmark for Mobile Agent Testing in Minecraft with Concept Drift (Xiong et al., 2024) | arXiv | Code | Adaptive strategies for concept drift in Minecraft environments |
CC-PP | CC-PP: Chain-of-components pipeline prompting for planning with large language models (Gui et al., 2024) | arXiv | - | Component-based pipeline approach for adaptive planning |
AVT | AVT: Bridging Vision and Language with Adaptive Vision Transformers (Yang et al., 2024) | arXiv | Code | Adaptive vision transformers for multimodal planning |
K-agents | Autonomous Agents for Real-Time Decision Making: Applications in Banking (Balaji et al., 2023) | arXiv | - | Autonomous agent adaptation for financial decision making |
Agent-Pro | Agent-pro: Learning to evolve via policy-level reflection and optimization (Zhang et al., 2024) | arXiv | Code | Dynamic belief management and policy-level reflection |
Inner Thoughts | Proactive Conversational Agents with Inner Thoughts (Liu et al., 2024) | arXiv | Code | Continuous thought generation for proactive participation |
Approach | Paper | Code | Comment |
---|---|---|---|
Example demonstrations | Instance-wise prompting for few-shot transferability of large language models (Pan et al., 2024) | Code | Tailored examples for improved function understanding |
Four-shot prompting | - | - | Demonstrated optimal number of examples for tool usage |
Approach | Paper | Code | Comment |
---|---|---|---|
Function definitions | - | - | Including comprehensive function specifications in context |
Docstrings | - | - | Utilizing standardized documentation formats for clarity |
Chain-of-thought | Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022) | - | Step-by-step reasoning process for complex function selection |
Approach | Paper | Code | Comment |
---|---|---|---|
Ask-when-Needed | Learning to Ask: When LLMs Meet Unclear Instruction (Wang et al., 2024) | Code | On-demand clarification for tool selection |
Interactive refinement | - | - | Iterative query refinement through user interaction |
Approach | Paper | Venue | Code | Comment |
---|---|---|---|---|
Grammar Control | Grammar-Aligned Decoding (Park et al., 2024) | arXiv | Code | Constrains output using context-free grammar |
TOOL-ED | TOOL-ED: Enhancing Empathetic Response Generation with the Tool Calling Capability of LLM (Cao et al., 2024) | arXiv | - | Treats knowledge bases as callable tools for empathetic dialogue |
IBSEN | IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation (Han et al., 2024) | ACL | Code | Multi-agent coordination for controlled script generation |
Multi-agent coordination | Improving factuality and reasoning in language models through multiagent debate (Chan et al., 2023) | arXiv | Code | Collaborative refinement through structured agent debate |
Task proposal | Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents (Zhou et al., 2024) | arXiv | Code | Automated task proposal and execution validation |
Experience transfer | X-TOOLS: Tool Generation and Adaptation from Existing APIs for Dialogue Agents (Patil et al., 2023) | arXiv | Code | Transfers API experience across different domains |
Function mapping plays a crucial role in deploying function calling, primarily responsible for transforming model outputs at the semantic level into executable commands in the physical space. Moreover, as shown in figure, function mapping involves Pronoun Mapping, Format Alignment, and Error Checking.
Approach | Paper | Code | Comment |
---|---|---|---|
Rule-based | Deterministic coreference resolution based on entity-centric, precision-ranked rules (Lee et al., 2013) | Code | Predefined mapping rules for contextual references |
Rule-based | End-to-end neural entity linking (Kolitsas et al., 2018) | Code | Neural approach to entity linking with rule-based components |
Knowledge reasoning | Knowledge-aware Pronoun Coreference Resolution (Zhang et al., 2019) | - | Leverages knowledge graphs for reference resolution |
LLM mapping | End-to-end Neural Coreference Resolution (Lee et al., 2017) | Code | Uses neural models for contextual mapping |
Approach | Paper | Code | Comment |
---|---|---|---|
Dictionary mapping | Syllabus: Portable Curricula for Reinforcement Learning Agents (Sullivan et al., 2024) | Code | Unified APIs and format alignment mechanisms |
Semantic matching | Improving Semantic Similarity for Low-Resource Named Entity Linking (Niu et al., 2022) | Code | Vector-based semantic similarity for linking entities |
Normalization | - | - | Format standardization for consistent representation |
Approach | Paper | Code | Comment |
---|---|---|---|
Parameter checking | - | - | Verification of parameter completeness and formatting |
Value enumeration | - | - | Validating input values against acceptable ranges |
Permission management | - | - | Ensuring appropriate access levels for function execution |
Approach | Paper | Code | Comment |
---|---|---|---|
Placeholder results | Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings (Hao et al., 2024) | Code | Generated placeholders replaced with API call results |
Placeholder results | Large language models encode clinical knowledge (Singhal et al., 2023) | - | Domain-specific placeholder generation for clinical applications |
Placeholder results | Toolformer: Language models can teach themselves to use tools (Schick et al., 2023) | Code | Self-supervised approach to result placeholder generation |
Function unpredictability | React: Synergizing reasoning and acting in language models (Yao et al., 2023) | Code | Reasoning-action interleaving to handle unpredictable outputs |
Approach | Paper | Code | Comment |
---|---|---|---|
Structure format | Gorilla: Large language model connected with massive apis (Patil et al., 2023) | Code | Structured templates for consistent output formatting |
Structure format | Prompt2model: Generating deployable models from natural language instructions (Pryzant et al., 2023) | - | Transforms natural language into structured model specifications |
Formatting | The api bank: A comprehensive benchmark for tool-augmented llms (Li et al., 2023) | Code | Standardized formatting for API responses |
Signatures | Instance-wise prompting for few-shot transferability of large language models (Pan et al., 2024) | Code | Instance-specific signature generation |
Approach | Paper | Code | Comment |
---|---|---|---|
Validation | Prompt2model: Generating deployable models from natural language instructions (Pryzant et al., 2023) | - | Validation mechanisms for generated model specifications |
Validation | T-eval: Evaluating the tool utilization capability of large language models step by step (Chen et al., 2024) | Code | Step-by-step validation of tool utilization |
Agent correction | Learning to use tools via cooperative and interactive agents (Shi et al., 2024) | Code | Specialized agents review and correct each other's actions |
Agent correction | Self-correction of large language models via cognitive psychology (Sun et al., 2024) | - | Psychological principles for improved self-correction |
Feedback | Great principles for learning to use tools with llms (Guo et al., 2024) | - | Principles for effective feedback incorporation |
Feedback | WebGPT: Browser-assisted question-answering with human feedback (Nakano et al., 2021) | - | Human feedback integration for improved web interactions |
Feedback | WhiteboardAgent: Autonomous Multi-Step Visual Language Reasoning via Whiteboard Interaction (Wang et al., 2024) | Code | Visual reasoning through whiteboard interaction feedback |
Approach | Paper | Code | Comment |
---|---|---|---|
Example retrieval | ClusterLLM: Large Language Models as a Guide for Text Clustering (Chen et al., 2023) | Code | Clustered example retrieval for enhanced responses |
System mapping | A neural probabilistic model for entity disambiguation using multiple resources (Agarwal et al., 2019) | - | Multi-resource entity disambiguation for system mapping |
System mapping | LLM+P: Empowering Large Language Models with Optimal Planning Proficiency (Liu et al., 2023) | - | Planning-oriented mapping for systematic responses |
System mapping | Llm+p: Empowering large language models with planning capabilities in multi-scenario human-ai collaboration (Ma et al., 2024) | - | Enhanced collaborative mapping between human inputs and AI responses |
System mapping | InstructExcel: A Benchmark for Natural Language Instructions in Excel (Mao et al., 2023) | Code | Domain-specific mapping for spreadsheet operations |
System mapping | INSTRUCTION FOLLOWING EVALUATION BY PREDICTING HUMAN FEEDBACK (Muennighoff et al., 2023) | Code | Human feedback-based mapping evaluation |
System mapping | Instance-wise prompting for few-shot transferability of large language models (Pan et al., 2024) | Code | Instance-specific mapping mechanisms |
System mapping | Vipergpt: Visual inference via python execution for reasoning (Suris et al., 2023) | Code | Python execution-based visual reasoning and mapping |
Approach | Paper | Code | Comment |
---|---|---|---|
Hierarchical structure and storage | Memorybank: Enhancing large language models with long-term memory (Zhong et al., 2024) | Code | Hierarchical storage with Ebbinghaus-inspired updating |
Task-related symbolic memory | Zero-shot task-oriented dialogue in the wild (Xie et al., 2023) | - | Specialized memory structures for dialogue-based tasks |
Three-layered memory architecture | Longllms: Enabling language models to process long contexts by leveraging memory mechanisms (Li et al., 2024) | Code | Working, episodic, and semantic memory layers |
Persistent memory stream | Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system (Liang et al., 2023) | - | Continuous memory stream for unlimited context |
Approach | Paper | Code | Comment |
---|---|---|---|
Self-controlled memory mechanism | Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system (Liang et al., 2023) | - | Memory management through control systems |
Memory control system | Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system (Liang et al., 2023) | - | Automated memory control for extended contexts |
Memory control system | Memgpt: Towards llms as operating systems (Chen et al., 2023) | Code | Operating system-inspired memory management |
Multi-agent experience storage | Lmrl: Learning multiagent reinforcement learning framework in a collaborative agent society (Lee et al., 2024) | Code | Collaborative storage of multi-agent experiences |
Approach | Paper | Code | Comment |
---|---|---|---|
Cross-conversation memory retrieval | Memorybank: Enhancing large language models with long-term memory (Zhong et al., 2024) | Code | Retrieval mechanisms spanning multiple conversations |
LSH-based indexing mechanism | Memgpt: Towards llms as operating systems (Chen et al., 2023) | Code | Locality-sensitive hashing for efficient indexing |
Similarity-based retrieval | Synapse: Trajectory-as-exemplar prompting with memory for computer control (Zheng et al., 2023) | Code | Vector similarity for contextual memory access |
Efficient memory access | Think-in-memory: Recalling and post-thinking enable llms with long-term memory (Liu et al., 2023) | Code | Optimized access patterns for memory retrieval |
Approach | Paper | Code | Comment |
---|---|---|---|
Thought-based memory storage | Think-in-memory: Recalling and post-thinking enable llms with long-term memory (Liu et al., 2023) | Code | Stores and recalls thoughts rather than raw conversations |
Trajectory-as-exemplar framework | Synapse: Trajectory-as-exemplar prompting with memory for computer control (Zheng et al., 2023) | Code | Complete trajectories as exemplars for planning |
State abstraction mechanism | Synapse: Trajectory-as-exemplar prompting with memory for computer control (Zheng et al., 2023) | Code | Compact state representations for efficient storage |
Knowledge triplet | Memgpt: Towards llms as operating systems (Chen et al., 2023) | Code | Subject-predicate-object triplets for structured knowledge |
The experimental results demonstrate clear performance differences between models trained specifically for function calling versus general-purpose models adapted to the task.
Metric | Description | Example Works |
---|---|---|
Recall@K | Proportion of relevant tools ranked within top K positions | COLT: Towards Completeness-Oriented Tool Retrieval for Large Language Models (Qu et al., 2024) |
NDCG@K | Normalized Discounted Cumulative Gain at K | Large language models are versatile decomposers: Decompose evidence and questions for table-based reasoning (Cheng et al., 2023) |
COMP@K | Completeness-oriented retrieval evaluation at K | COLT: Towards Completeness-Oriented Tool Retrieval for Large Language Models (Qu et al., 2024) |
Metric | Description | Example Works |
---|---|---|
Pass Rate | Proportion of successfully completed instructions | Toolllm: Facilitating large language models to master 16000+ real-world apis (Qin et al., 2023) |
Win/Success Rate | Quality evaluation including information richness, factual accuracy | NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls (Basu et al., 2024) |
Metric | Description | Example Works |
---|---|---|
T-Eval | Comprehensive assessment of planning, reasoning, retrieval, understanding | T-eval: Evaluating the tool utilization capability of large language models step by step (Chen et al., 2024) |
Metric | Description | Example Works |
---|---|---|
BLEU | Bilingual Evaluation Understudy for translation quality | Bleu: a method for automatic evaluation of machine translation (Papineni et al., 2002) |
ROUGE-L | Longest Common Subsequence based metric for text summarization | Rouge: A package for automatic evaluation of summaries (Lin, 2004) |
Exact Match | Binary assessment of complete answer correctness | Bootstrapping a neural natural language interface for databases (Bogin et al., 2019) |
F1 score | Harmonic mean of precision and recall | Attention is all you need (Vaswani et al., 2017) |
Name | Paper | Code | Description |
---|---|---|---|
ToolLLM | Toolllm: Facilitating large language models to master 16000+ real-world apis (Qin et al., 2023) | Code | Comprehensive benchmark for API utility |
ToolAlpaca | Toolalpaca: Generalized tool learning for language models with 3000 simulated cases (Tang et al., 2023) | Code | Generalized tool learning with simulated cases |
Gorilla | Gorilla: Large language model connected with massive apis (Patil et al., 2023) | Code | Berkeley Function Calling Leaderboard |
Name | Paper | Code | Description |
---|---|---|---|
APIBench | Gorilla: Large language model connected with massive apis (Patil et al., 2023) | Code | Platform for standardized API evaluation |
API-Bank | Api-bank: A benchmark for tool-augmented llms (Li et al., 2023) | Code | Comprehensive API interaction testing |
Name | Paper | Code | Description |
---|---|---|---|
ShortcutsBench | Shortcutsbench: A large-scale real-world benchmark for api-based agents (Shen et al., 2024) | Code | Real APIs from Apple's operating systems |
BigCodeBench | You are not alone: Large language models effectively leverage duplications in code corpus (Zhou et al., 2023) | Code | Specialized benchmark for code-related function calls |
SEAL | Seal: A benchmark for software api learning with generative ai agents (Ji et al., 2023) | Code | Software API learning benchmark |
RadABench | Radial agent benchmark: evaluating task generalization capabilities of multi-platform ai agents (Yuan et al., 2024) | Code | Cross-platform agent evaluation framework |
NoisyToolBench | Learning to Ask: When LLMs Meet Unclear Instruction (Wang et al., 2024) | Code | Evaluates performance with unclear or noisy instructions |
Mobile-Bench | Benchmarking large language models on mobile applications (Cao et al., 2024) | Code | Specialized benchmark for mobile application interactions |
Name | Paper | Code | Description |
---|---|---|---|
IN3 | In3: Instruction-following language models for interactive tasks (Qi et al., 2023) | Code | Interactive task evaluation with instruction following |
NESTFUL | NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls (Basu et al., 2024) | - | Focuses on nested sequences of API calls |
UltraTool | Pluto: A recipe for building adaptable autonomous llm agents (Guan et al., 2024) | Code | Evaluates adaptable autonomous agent capabilities |
AppWorld | AppWorld: A Benchmark for Physical Mobile App Embodied Agent (Tian et al., 2023) | Code | Physical mobile app interaction benchmark |
TheAgentCompany | The agent company: A generative agent simulation of a software company (Yuan et al., 2024) | - | Simulated software company environment for evaluation |
AgentBoard | AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents (Liu et al., 2023) | Code | Multi-turn agent evaluation platform |
TravelPlanner | Travel planner: A benchmark for real-world planning with language agents (Wang et al., 2024) | Code | Travel planning task-specific benchmark |
ChinaTravel | Travel assistant: A benchmark for chinese llm agents in the tourism domain (Xia et al., 2024) | - | Chinese language travel planning benchmark |
Name | Paper | Code | Description |
---|---|---|---|
API-BLEND | API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs (Basu et al., 2024) | Code | Multi-domain API coverage with evaluation methods |
NESTOOLS | Nestools: Crafting efficient tools across diverse scenarios (Choi et al., 2024) | Code | Comprehensive evaluation across diverse scenarios |
MTU-Bench | MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models (Wang et al., 2024) | Code | Multi-granularity tool-use evaluation |
WTU-EVAL | Web tool use evaluation: Measuring large language models' capabilities on realistic web tasks (Mishra et al., 2023) | Code | Web-based tool usage evaluation framework |
Name | Organization | Release/Paper | Description |
---|---|---|---|
ChatGPT plugins | OpenAI | Introducing ChatGPT plugins | Ecosystem of third-party extensions for specific functionalities |
Claude's tool use API | Anthropic | Claude 3 Opus technical report | Native function calling capabilities in Claude AI models |
Cohere Command | Cohere | Introducing Cohere Command Light | API for function calling and structured JSON responses |
Qwen | Alibaba | Qwen Technical Report (Yang et al., 2023) | Multi-function Chinese language models with tool usage |
DeepSeek | DeepSeek | DeepSeek: Generalized Autoregressive Pretraining for Language and Vision (Dai et al., 2024) | Generalized foundation model with capabilities across tasks |
Name | Organization | Repository | Description |
---|---|---|---|
HuggingFace Transformer Agents | Hugging Face | Code | Framework for building agents with Hugging Face models |
Semantic Kernel | Microsoft | Code | SDK for building AI applications with native tool integration |
LangChain | LangChain | Code | Framework for building applications with LLMs and tools |
WebCPM | Tsinghua University | Code | Chinese web agent framework with browsing capabilities |
Name | Developer | Repository | Description |
---|---|---|---|
Auto-GPT | Significant Gravitas | Code | Self-prompting autonomous agent system |
BabyAGI | Yohei | Code | Task-driven autonomous agent framework |
BMTools | OpenBMB | Code | Toolset for enhancing language models with functions |
RestGPT | Microsoft | Code | Model that can interact with RESTful APIs |
xLAM | Silen | Code | Cross-language agent development framework |
Octopus-v4 | Baichuan | Octopus technical report (Hao et al., 2023) | Multi-agent system for complex task completion |
Name | Developer | Repository | Description |
---|---|---|---|
GRANITE-20B | IBM Research | Code | Large language model optimized for coding and tool use |
Mistral 7B | Mistral AI | Code | Open-weight model with tool use capabilities |
NexusRaven V2-13B | Nexusflow | Code | Function calling and multi-modality specialized model |
Gorilla | UC Berkeley | Code | Model specialized in API usage and integration |
FireFunction V1 | Fireworks AI | Model | Purpose-built for function calling capabilities |
Nous Hermes 2 | Nous Research | Model | Instruction-tuned model with enhanced tool use |
Name | Organization | Link | Description |
---|---|---|---|
AgentInstruct | Microsoft | Paper (Zeng et al., 2023) | Instruction dataset for agent training and evaluation |
AgentOhana | Duke University | Paper (Yang et al., 2024) | High-quality dataset for training multi-task agents |
Lumos | Cornell University | Paper (Guo et al., 2023) | Multi-step reasoning dataset for tool-based tasks |
- Standards Challenge: Lack of universally accepted standard for assessing quality and performance
- Latency Problems: High latency and low throughput affecting user experience
- Security Vulnerabilities: Potential for "jailbreak function" attacks and other security concerns
- Technical Costs: Integration and maintenance costs for API modifications
- System Architecture Limitations: Constraints imposed by existing system architectures
- Standardization Needs: Requirement for standardized API modification processes
- Complex Processing: Multiple steps in feedback processing introducing errors
- Learning Assessment: Difficulty in quantifying effectiveness of human feedback
- Strategy Requirements: Need for advanced algorithms to interpret unstructured feedback
- Isolation Strategy: Challenges in appropriately isolating functions for business needs
- Regulatory Compliance: Meeting specific regulatory requirements across functions
- Post-processing Solutions: Implementing effective middleware for compliance and data transformation