Comprehensive Reviews into Function Calling in Large Language Models

An index of concepts, frameworks, and methodologies in:

Function Calling Pipeline: Understanding the entire process from pre-call to post-call stages
Sample Construction & Fine-tuning: Building effective training datasets and optimizing models
Deployment & Inference: Practical implementation strategies for real-world applications
Evaluation Frameworks: Benchmarks and metrics for assessing function calling capabilities

Reproducibility is important! We prioritize methods with open-source implementations.

Please cite our survey paper if this index is helpful:

@article{wang2025comprehensive,
  title={Function Calling in Large Language Models: Industrial Practices, Challenges, and Future Direction},
  author={Wang, Maolin and Zhang, Yingyi and Peng, Cunyin and Chen, Yicheng and Zhou, Wei and Gu, Jinjie and Zhuang, Chenyi and Guo, Ruocheng and Yu, Bowen and Wang, Wanyu and Zhao, Xiangyu},
  url = {https://openreview.net/pdf?id=LNxVGPedFW}
  year={2025},
}

Challenges

Pre-call Stage

Challenge	Description
Challenge 1.1: Intent Recognition	Understanding user intentions accurately from natural language queries
Challenge 1.2: Function Redundancy	Managing redundant functions that serve similar purposes, increasing selection complexity

On-call Stage

Challenge	Description
Challenge 2.1: Missing Calls	Failure to initiate function calls when required for task completion
Challenge 2.2: Unnecessary Calls	Triggering function calls when not required by the user's task
Challenge 3.1: Missing/Illegal Parameters	Inadequate or inappropriate parameter extraction from user inputs
Challenge 3.2: Function Hallucination	Mistakenly calling non-candidate or non-existent functions
Challenge 3.3: Pronouns Resolving	Correctly interpreting contextual references and pronouns in queries
Challenge 3.4: LLM Inherent Limitations	Performance constraints in latency and accuracy due to model architecture
Challenge 3.5: Multi-Call Procedure	Managing complex workflows requiring multiple related function calls
Challenge 3.6: Effective Context Management	Maintaining relevant information across multi-turn conversations

Post-call Stage

Challenge	Description
Challenge 4.1: Execution Result Mismatch	Function outputs not aligning with user expectations
Challenge 4.2: Irrelevant Information Overload	Excessive irrelevant information in function outputs
Challenge 4.3: Mismatch Between Real-World Functions and Results	Gap between LLM-generated outputs and executable code
Challenge 4.4: Execution Failure	Functions failing despite correct triggering and parameterization

The training process involves specialized data preparation and fine-tuning strategies to equip models with function calling capabilities while maintaining general language understanding.

Sample Construction and Fine-Tuning

Function Collection

Method	Description
Manual Construction	Human-crafted functions with precise specifications and documentation
LLM Generation	Leveraging large language models like GPT-4, LlaMA 70B, and Qwen to automatically generate function specifications
Web Mining	Extracting diverse function objects from web resources, with descriptions supplemented by LLMs when necessary

Sample Construction

Approach	Paper	Code	Description
Text Representation	Toolformer: Language models can teach themselves to use tools (Schick et al., 2024)	Code	Represents functions as natural language text, providing flexibility but requiring more token space
Text Representation	ToolGen: Unified Tool Retrieval and Calling via Generation (Wang et al., 2024)	Code	Integrates tool information through generation with natural language descriptions
Token Representation	Toolformer: Language models can teach themselves to use tools (Schick et al., 2024)	Code	Encodes functions as special tokens during training for computational efficiency
Token Representation	ToolGen: Unified Tool Retrieval and Calling via Generation (Wang et al., 2024)	Code	Uses token representation during training while maintaining semantic richness
Multi-turn Interaction	Sequential API Function Calling Using GraphQL Schema (Saha et al., 2024)	-	Introduces structured API schemas and response mapping for sequential function calling
Multi-turn Interaction	Hammer: Robust Function-Calling for On-Device Language Models via Function Masking (Lin et al., 2024)	-	Specialized techniques to address naming convention sensitivity issues for on-device deployment

Fine-tuning Strategies

Method	Paper	Description
Supervised Fine-Tuning (SFT)	ToolGen: Unified Tool Retrieval and Calling via Generation (Wang et al., 2024)	Standard fine-tuning approach with unified retrieval and calling generation
Supervised Fine-Tuning (SFT)	RAIT: Retrieval Augmented Instruction Tuning (Asai et al., 2023)	Retrieval-augmented approach for instruction tuning
Supervised Fine-Tuning (SFT)	Show your work: Scratchpads for intermediate computation with language models (Nye et al., 2021)	Scratchpad-based training for step-by-step computation
Supervised Fine-Tuning (SFT)	Giving BERT a calculator: Finding operations and arguments with reading comprehension (Andor et al., 2019)	Integrates mathematical operations with language understanding
Supervised Fine-Tuning (SFT)	Rainier: Reinforced knowledge introspector for commonsense question answering (Liu et al., 2022)	Knowledge introspection for improved reasoning
Supervised Fine-Tuning (SFT)	Learning to represent programs with graphs (Allamanis et al., 2018)	Program representation through graph structures
Supervised Fine-Tuning (SFT)	A deep generative model of code syntactic structures (Barone et al., 2017)	Syntax-aware code generation models
Supervised Fine-Tuning (SFT)	Pre-training for Abstractive Document Summarization (Liu et al., 2019)	Domain-specific pre-training for document summarization
Supervised Fine-Tuning (SFT)	Character-level neural network for biomedical named entity recognition (Liu et al., 2017)	Character-level models for biomedical entity recognition
Parameter-Efficient Fine-Tuning (PEFT)	Gpt4tools: Teaching large language model to use tools via self-instruction (Yang et al., 2024)	Self-instruction approach for tool utilization
Parameter-Efficient Fine-Tuning (PEFT)	CITI: Enhancing Tool Utilizing Ability in Large Language Models without Sacrificing General Performance (Hao et al., 2024)	Balanced approach for tool utilization without performance tradeoffs
Parameter-Efficient Fine-Tuning (PEFT)	Toolformer: Language models can teach themselves to use tools (Schick et al., 2024)	Self-supervised learning for tool usage
Parameter-Efficient Fine-Tuning (PEFT)	PLUG: Parameter-efficient LLMs Using Plugin Adapters (Li et al., 2023)	Plugin adapter approach for parameter efficiency
Parameter-Efficient Fine-Tuning (PEFT)	Prompt tuning for generative multimodal pretrained models (Wei et al., 2022)	Prompt-based tuning for multimodal generation
Reinforcement Learning & RLHF	WebGPT: Browser-assisted question-answering with human feedback (Nakano et al., 2021)	Web browsing capabilities enhanced through human feedback
Reinforcement Learning & RLHF	Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis (Liang et al., 2024)	Large-scale API connectivity through reinforcement learning
Reinforcement Learning & RLHF	MADAC: Multi-Agent Decision-Aware Conversation via Reinforcement Learning (Li et al., 2023)	Decision-aware conversation through multi-agent reinforcement learning
Reinforcement Learning & RLHF	GopherCite: Teaching language models to support answers with verified quotes (Menick et al., 2022)	Citation verification through reinforcement learning
Reinforcement Learning & RLHF	Emergent Abilities of Large Language Models (Kojima et al., 2022)	Studies emergent abilities through reinforcement learning approaches
Reinforcement Learning & RLHF	Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)	Preference optimization without explicit reward modeling
Reinforcement Learning & RLHF	Deep reinforcement learning from human preferences (Christiano et al., 2017)	Foundational work on learning from human preferences
Reinforcement Learning & RLHF	The Bias-Variance Trade-off in RLHF: Overfitting to Human Feedback in Large Language Models (Manduzio et al., 2023)	Analysis of overfitting risks in human feedback

Critical Emphasis

Based on practical implementations, we emphasize that data quality (and variety) plays a more crucial role than data quantity in both data construction and fine-tuning phases, given the intricate nature of function calling tasks.

Emphasis	Description
Data Quality	Prioritizing dataset diversity and verification over quantity for more robust function calling capabilities
Model Scaling	Larger models demonstrate significantly better function calling capabilities, with notable improvements above 7B parameters
Capability Balance	Maintaining a balance between specialized function calling abilities and general language capabilities to avoid performance tradeoffs

Deployment and Inference

This section explores practical deployment strategies for function-calling LLMs. Figure illustrates a typical workflow where queries pass through input construction, LLM processing, and format validation or execution, with memory components maintaining context throughout the process.

Task Planning

Foundational Planning Mechanisms

Name	Paper	Venue	Code	Comment
ReAct	React: Synergizing reasoning and acting in language models (Yao et al., 2022)	NeurIPS	Code	Combines reasoning and acting through chain-of-thought prompts
ToolFormer	Toolformer: Language models can teach themselves to use tools (Schick et al., 2023)	NeurIPS	Code	Enables LLMs to use external tools through self-supervised learning
Reverse Chain	Reverse chain: A generic-rule for llms to master multi-api planning (Zhang et al., 2023)	arXiv	-	Introduces target-driven backward reasoning for controlled multi-API planning
AVATAR	AvaTaR: Optimizing LLM Agents for Tool-Assisted Knowledge Retrieval (Wu et al., 2024)	arXiv	Code	Actor-comparator architecture for tool-assisted knowledge retrieval
DEPS	Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents (Wang et al., 2024)	NeurIPS	Code	Interactive planning through description-based decomposition
LLM-MCTS	LLM-MCTS: Monte Carlo Tree Search with LLMs for reasoning tasks (Zhao et al., 2023)	arXiv	Code	Monte Carlo Tree Search approach for multi-step reasoning
MACT	Measuring and narrowing the compositional gap in language models (Zheng et al., 2023)	arXiv	Code	Addresses compositional generalization through structured decomposition
TACO	Taco: Towards api conversation workflows for tool augmentation (Mao et al., 2024)	arXiv	Code	Structured workflows for tool-augmented conversational agents
PAE	Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents (Zhou et al., 2024)	arXiv	Code	Multi-agent system for autonomous skill discovery and planning
SCIAGENT	SciAgent: Tool-augmented language models for scientific reasoning (Wang et al., 2023)	arXiv	Code	Tool-augmented planning for scientific problem-solving
Agent Laboratory	Agent laboratory: Using llm agents as research assistants (Schmidgall et al., 2025)	arXiv	-	Multi-agent architecture with specialized roles for research planning

GUI-based Approaches

Name	Paper	Venue	Code	Comment
AppAgent	AppAgent: Multimodal Agents as Smartphone Users (Yang et al., 2023)	arXiv	Code	Agents learn to operate smartphone applications via visual interfaces
OS-ATLAS	OS-ATLAS: Foundation AI Agent for Desktop Operating Systems (Wang et al., 2024)	arXiv	Code	End-to-end desktop OS navigation with multimodal perception
AndroidLab	AndroidLab: Large Language Models for Android UI Navigation (Yan et al., 2023)	arXiv	Code	Benchmarking and improving LLM-based Android UI navigation
Ponder	Ponder: Toolkit-Aware Agent for Automating Desktop Tasks (Wu et al., 2024)	arXiv	Code	Self-reflective navigation through desktop interfaces
OS-Genesis	OS-Genesis: Evaluating the multimodal capabilities of large language models in navigating operating systems (Li et al., 2024)	arXiv	Code	Comprehensive benchmark for evaluating OS navigation capabilities

System Optimizations

Name	Paper	Venue	Code	Comment
Orca	Orca: Progressive learning from complex explanation traces of gpt-4 (Mukherjee et al., 2023)	arXiv	-	Learns from complex explanation traces for progressive improvement
Orca 2	Orca 2: Teaching small language models how to reason (Mitra et al., 2023)	arXiv	Code	Enhanced reasoning capabilities through step-by-step explanation
Memgpt	Memgpt: Towards llms as operating systems (Chen et al., 2023)	arXiv	Code	Memory management system with hierarchical storage
AIOS-Agent	Aios-agent: In-context fine-grained os control with large language models (Chu et al., 2024)	arXiv	Code	System-level control through fine-grained OS operations
SpecInfer	Specinfer: Accelerating generative llm inference via speculative execution (Yan et al., 2023)	arXiv	Code	Performance optimization through speculative execution
PEOA	PEOA: Progressive Exemplar-Oriented API-Aware Prompting (Wang et al., 2024)	arXiv	Code	Exemplar-based prompting for API-aware interactions
LLM-Tool Compiler	Compiler-aided Generation for Tool-LLM Inference (Song et al., 2024)	arXiv	-	Compilation techniques to optimize tool operations

Error Handling Approaches

Name	Paper	Venue	Code	Comment
LLM-Planner	Llm-planner: Few-shot grounded planning for embodied agents with large language models (Song et al., 2023)	ICCV	Code	Environmental feedback for plan regeneration during failures
ToolChain*	Toolchain: Efficient action space navigation in large language models with a search (Zhuang et al., 2023)	arXiv	Code	Employs decision trees for systematic API call management
TPTU	Test-Time Prompt Updating for Text-to-Image Generative Models (Liang et al., 2023)	arXiv	Code	Adaptive prompt refinement based on execution feedback
Buckets	Buckets: Efficient multi-environment learning for llm agents (Burkart et al., 2023)	arXiv	Code	Error-aware multi-environment learning framework
AMOR	AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback (Guan et al., 2024)	arXiv	Code	FSM-based framework enabling process-level human feedback

Tree-based Approaches

Name	Paper	Venue	Code	Comment
ControlLLM	Controlllm: Augment language models with tools by searching on graphs (Liu et al., 2023)	arXiv	Code	Tree of Thoughts with depth-first search on tool graphs
PLUTO	Pluto: A recipe for building adaptable autonomous llm agents (Guan et al., 2024)	arXiv	Code	Adaptable autonomous agents with tree-based planning
Toolink	Toolink: Linking toolkit creation and using through chain-of-solving on open-source model (Qian et al., 2023)	arXiv	Code	Hierarchical task decomposition with toolkit creation
TPTU-v2	TPTU-v2: Boosting Test-Time Prompt Tuning for Text-to-Image Generation (Kawar et al., 2023)	arXiv	Code	Enhanced tree-based prompt optimization strategies
α-UMi	Small llms are weak tool learners: A multi-llm agent (Shen et al., 2024)	arXiv	Code	Planning-oriented fine-tuning for small LLMs

Adaptive Planning Strategies

Name	Paper	Venue	Code	Comment
COA	Chain of agents: A framework for collaborative tool utilization with language models (Chang et al., 2024)	arXiv	Code	Agent collaboration framework for specialized tool utilization
DEER	DEER: Diverse Evolution Ensembles are Required for Large Language Model Agents (Chen et al., 2024)	arXiv	Code	Diverse evolution ensembles for LLM agent improvement
SOAY	SOAY: Responsive and Safe Structured Editing with Dynamic Text Features (Wang et al., 2024)	arXiv	Code	Dynamic text feature adaptation for structured editing
ProgPrompt	ProgPrompt: Generating Situated Robot Task Plans using Large Language Models (Singh et al., 2022)	arXiv	Code	Adaptive programming for situated robot task planning
AutoTOD	Towards fully autonomous dialogue systems via interactive few-shot learning (Zhang et al., 2023)	arXiv	-	Interactive few-shot learning for dialogue system adaptation
MATMCD	MATMCD: An Open Benchmark for Mobile Agent Testing in Minecraft with Concept Drift (Xiong et al., 2024)	arXiv	Code	Adaptive strategies for concept drift in Minecraft environments
CC-PP	CC-PP: Chain-of-components pipeline prompting for planning with large language models (Gui et al., 2024)	arXiv	-	Component-based pipeline approach for adaptive planning
AVT	AVT: Bridging Vision and Language with Adaptive Vision Transformers (Yang et al., 2024)	arXiv	Code	Adaptive vision transformers for multimodal planning
K-agents	Autonomous Agents for Real-Time Decision Making: Applications in Banking (Balaji et al., 2023)	arXiv	-	Autonomous agent adaptation for financial decision making
Agent-Pro	Agent-pro: Learning to evolve via policy-level reflection and optimization (Zhang et al., 2024)	arXiv	Code	Dynamic belief management and policy-level reflection
Inner Thoughts	Proactive Conversational Agents with Inner Thoughts (Liu et al., 2024)	arXiv	Code	Continuous thought generation for proactive participation

Prompt Construction

Few-shot Integration

Approach	Paper	Code	Comment
Example demonstrations	Instance-wise prompting for few-shot transferability of large language models (Pan et al., 2024)	Code	Tailored examples for improved function understanding
Four-shot prompting	-	-	Demonstrated optimal number of examples for tool usage

Context Management

Approach	Paper	Code	Comment
Function definitions	-	-	Including comprehensive function specifications in context
Docstrings	-	-	Utilizing standardized documentation formats for clarity
Chain-of-thought	Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)	-	Step-by-step reasoning process for complex function selection

Query-based Retrieval

Approach	Paper	Code	Comment
Ask-when-Needed	Learning to Ask: When LLMs Meet Unclear Instruction (Wang et al., 2024)	Code	On-demand clarification for tool selection
Interactive refinement	-	-	Iterative query refinement through user interaction

Function Generation

Approach	Paper	Venue	Code	Comment
Grammar Control	Grammar-Aligned Decoding (Park et al., 2024)	arXiv	Code	Constrains output using context-free grammar
TOOL-ED	TOOL-ED: Enhancing Empathetic Response Generation with the Tool Calling Capability of LLM (Cao et al., 2024)	arXiv	-	Treats knowledge bases as callable tools for empathetic dialogue
IBSEN	IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation (Han et al., 2024)	ACL	Code	Multi-agent coordination for controlled script generation
Multi-agent coordination	Improving factuality and reasoning in language models through multiagent debate (Chan et al., 2023)	arXiv	Code	Collaborative refinement through structured agent debate
Task proposal	Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents (Zhou et al., 2024)	arXiv	Code	Automated task proposal and execution validation
Experience transfer	X-TOOLS: Tool Generation and Adaptation from Existing APIs for Dialogue Agents (Patil et al., 2023)	arXiv	Code	Transfers API experience across different domains

Function Mapping

Function mapping plays a crucial role in deploying function calling, primarily responsible for transforming model outputs at the semantic level into executable commands in the physical space. Moreover, as shown in figure, function mapping involves Pronoun Mapping, Format Alignment, and Error Checking.

Resolution

Approach	Paper	Code	Comment
Rule-based	Deterministic coreference resolution based on entity-centric, precision-ranked rules (Lee et al., 2013)	Code	Predefined mapping rules for contextual references
Rule-based	End-to-end neural entity linking (Kolitsas et al., 2018)	Code	Neural approach to entity linking with rule-based components
Knowledge reasoning	Knowledge-aware Pronoun Coreference Resolution (Zhang et al., 2019)	-	Leverages knowledge graphs for reference resolution
LLM mapping	End-to-end Neural Coreference Resolution (Lee et al., 2017)	Code	Uses neural models for contextual mapping

Alignment

Approach	Paper	Code	Comment
Dictionary mapping	Syllabus: Portable Curricula for Reinforcement Learning Agents (Sullivan et al., 2024)	Code	Unified APIs and format alignment mechanisms
Semantic matching	Improving Semantic Similarity for Low-Resource Named Entity Linking (Niu et al., 2022)	Code	Vector-based semantic similarity for linking entities
Normalization	-	-	Format standardization for consistent representation

Validation

Approach	Paper	Code	Comment
Parameter checking	-	-	Verification of parameter completeness and formatting
Value enumeration	-	-	Validating input values against acceptable ranges
Permission management	-	-	Ensuring appropriate access levels for function execution

Response Generation

Initial Generation

Approach	Paper	Code	Comment
Placeholder results	Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings (Hao et al., 2024)	Code	Generated placeholders replaced with API call results
Placeholder results	Large language models encode clinical knowledge (Singhal et al., 2023)	-	Domain-specific placeholder generation for clinical applications
Placeholder results	Toolformer: Language models can teach themselves to use tools (Schick et al., 2023)	Code	Self-supervised approach to result placeholder generation
Function unpredictability	React: Synergizing reasoning and acting in language models (Yao et al., 2023)	Code	Reasoning-action interleaving to handle unpredictable outputs

Templates

Approach	Paper	Code	Comment
Structure format	Gorilla: Large language model connected with massive apis (Patil et al., 2023)	Code	Structured templates for consistent output formatting
Structure format	Prompt2model: Generating deployable models from natural language instructions (Pryzant et al., 2023)	-	Transforms natural language into structured model specifications
Formatting	The api bank: A comprehensive benchmark for tool-augmented llms (Li et al., 2023)	Code	Standardized formatting for API responses
Signatures	Instance-wise prompting for few-shot transferability of large language models (Pan et al., 2024)	Code	Instance-specific signature generation

Review

Approach	Paper	Code	Comment
Validation	Prompt2model: Generating deployable models from natural language instructions (Pryzant et al., 2023)	-	Validation mechanisms for generated model specifications
Validation	T-eval: Evaluating the tool utilization capability of large language models step by step (Chen et al., 2024)	Code	Step-by-step validation of tool utilization
Agent correction	Learning to use tools via cooperative and interactive agents (Shi et al., 2024)	Code	Specialized agents review and correct each other's actions
Agent correction	Self-correction of large language models via cognitive psychology (Sun et al., 2024)	-	Psychological principles for improved self-correction
Feedback	Great principles for learning to use tools with llms (Guo et al., 2024)	-	Principles for effective feedback incorporation
Feedback	WebGPT: Browser-assisted question-answering with human feedback (Nakano et al., 2021)	-	Human feedback integration for improved web interactions
Feedback	WhiteboardAgent: Autonomous Multi-Step Visual Language Reasoning via Whiteboard Interaction (Wang et al., 2024)	Code	Visual reasoning through whiteboard interaction feedback

RAG

Approach	Paper	Code	Comment
Example retrieval	ClusterLLM: Large Language Models as a Guide for Text Clustering (Chen et al., 2023)	Code	Clustered example retrieval for enhanced responses
System mapping	A neural probabilistic model for entity disambiguation using multiple resources (Agarwal et al., 2019)	-	Multi-resource entity disambiguation for system mapping
System mapping	LLM+P: Empowering Large Language Models with Optimal Planning Proficiency (Liu et al., 2023)	-	Planning-oriented mapping for systematic responses
System mapping	Llm+p: Empowering large language models with planning capabilities in multi-scenario human-ai collaboration (Ma et al., 2024)	-	Enhanced collaborative mapping between human inputs and AI responses
System mapping	InstructExcel: A Benchmark for Natural Language Instructions in Excel (Mao et al., 2023)	Code	Domain-specific mapping for spreadsheet operations
System mapping	INSTRUCTION FOLLOWING EVALUATION BY PREDICTING HUMAN FEEDBACK (Muennighoff et al., 2023)	Code	Human feedback-based mapping evaluation
System mapping	Instance-wise prompting for few-shot transferability of large language models (Pan et al., 2024)	Code	Instance-specific mapping mechanisms
System mapping	Vipergpt: Visual inference via python execution for reasoning (Suris et al., 2023)	Code	Python execution-based visual reasoning and mapping

Memory Scheme

Memory Structure

Approach	Paper	Code	Comment
Hierarchical structure and storage	Memorybank: Enhancing large language models with long-term memory (Zhong et al., 2024)	Code	Hierarchical storage with Ebbinghaus-inspired updating
Task-related symbolic memory	Zero-shot task-oriented dialogue in the wild (Xie et al., 2023)	-	Specialized memory structures for dialogue-based tasks
Three-layered memory architecture	Longllms: Enabling language models to process long contexts by leveraging memory mechanisms (Li et al., 2024)	Code	Working, episodic, and semantic memory layers
Persistent memory stream	Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system (Liang et al., 2023)	-	Continuous memory stream for unlimited context

Memory Management

Approach	Paper	Code	Comment
Self-controlled memory mechanism	Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system (Liang et al., 2023)	-	Memory management through control systems
Memory control system	Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system (Liang et al., 2023)	-	Automated memory control for extended contexts
Memory control system	Memgpt: Towards llms as operating systems (Chen et al., 2023)	Code	Operating system-inspired memory management
Multi-agent experience storage	Lmrl: Learning multiagent reinforcement learning framework in a collaborative agent society (Lee et al., 2024)	Code	Collaborative storage of multi-agent experiences

Memory Retrieval

Approach	Paper	Code	Comment
Cross-conversation memory retrieval	Memorybank: Enhancing large language models with long-term memory (Zhong et al., 2024)	Code	Retrieval mechanisms spanning multiple conversations
LSH-based indexing mechanism	Memgpt: Towards llms as operating systems (Chen et al., 2023)	Code	Locality-sensitive hashing for efficient indexing
Similarity-based retrieval	Synapse: Trajectory-as-exemplar prompting with memory for computer control (Zheng et al., 2023)	Code	Vector similarity for contextual memory access
Efficient memory access	Think-in-memory: Recalling and post-thinking enable llms with long-term memory (Liu et al., 2023)	Code	Optimized access patterns for memory retrieval

Memory Processing

Approach	Paper	Code	Comment
Thought-based memory storage	Think-in-memory: Recalling and post-thinking enable llms with long-term memory (Liu et al., 2023)	Code	Stores and recalls thoughts rather than raw conversations
Trajectory-as-exemplar framework	Synapse: Trajectory-as-exemplar prompting with memory for computer control (Zheng et al., 2023)	Code	Complete trajectories as exemplars for planning
State abstraction mechanism	Synapse: Trajectory-as-exemplar prompting with memory for computer control (Zheng et al., 2023)	Code	Compact state representations for efficient storage
Knowledge triplet	Memgpt: Towards llms as operating systems (Chen et al., 2023)	Code	Subject-predicate-object triplets for structured knowledge

Evaluation

Overall Performance

The experimental results demonstrate clear performance differences between models trained specifically for function calling versus general-purpose models adapted to the task.

Function Selection Metrics

Metric	Description	Example Works
Recall@K	Proportion of relevant tools ranked within top K positions	COLT: Towards Completeness-Oriented Tool Retrieval for Large Language Models (Qu et al., 2024)
NDCG@K	Normalized Discounted Cumulative Gain at K	Large language models are versatile decomposers: Decompose evidence and questions for table-based reasoning (Cheng et al., 2023)
COMP@K	Completeness-oriented retrieval evaluation at K	COLT: Towards Completeness-Oriented Tool Retrieval for Large Language Models (Qu et al., 2024)

Core Evaluation Metrics

Metric	Description	Example Works
Pass Rate	Proportion of successfully completed instructions	Toolllm: Facilitating large language models to master 16000+ real-world apis (Qin et al., 2023)
Win/Success Rate	Quality evaluation including information richness, factual accuracy	NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls (Basu et al., 2024)

Comprehensive Assessment

Metric	Description	Example Works
T-Eval	Comprehensive assessment of planning, reasoning, retrieval, understanding	T-eval: Evaluating the tool utilization capability of large language models step by step (Chen et al., 2024)

Quality-based Metrics

Metric	Description	Example Works
BLEU	Bilingual Evaluation Understudy for translation quality	Bleu: a method for automatic evaluation of machine translation (Papineni et al., 2002)
ROUGE-L	Longest Common Subsequence based metric for text summarization	Rouge: A package for automatic evaluation of summaries (Lin, 2004)
Exact Match	Binary assessment of complete answer correctness	Bootstrapping a neural natural language interface for databases (Bogin et al., 2019)
F1 score	Harmonic mean of precision and recall	Attention is all you need (Vaswani et al., 2017)

Benchmarks

Early Foundational

Name	Paper	Code	Description
ToolLLM	Toolllm: Facilitating large language models to master 16000+ real-world apis (Qin et al., 2023)	Code	Comprehensive benchmark for API utility
ToolAlpaca	Toolalpaca: Generalized tool learning for language models with 3000 simulated cases (Tang et al., 2023)	Code	Generalized tool learning with simulated cases
Gorilla	Gorilla: Large language model connected with massive apis (Patil et al., 2023)	Code	Berkeley Function Calling Leaderboard

Standardized Platforms

Name	Paper	Code	Description
APIBench	Gorilla: Large language model connected with massive apis (Patil et al., 2023)	Code	Platform for standardized API evaluation
API-Bank	Api-bank: A benchmark for tool-augmented llms (Li et al., 2023)	Code	Comprehensive API interaction testing

Domain-Specific

Name	Paper	Code	Description
ShortcutsBench	Shortcutsbench: A large-scale real-world benchmark for api-based agents (Shen et al., 2024)	Code	Real APIs from Apple's operating systems
BigCodeBench	You are not alone: Large language models effectively leverage duplications in code corpus (Zhou et al., 2023)	Code	Specialized benchmark for code-related function calls
SEAL	Seal: A benchmark for software api learning with generative ai agents (Ji et al., 2023)	Code	Software API learning benchmark
RadABench	Radial agent benchmark: evaluating task generalization capabilities of multi-platform ai agents (Yuan et al., 2024)	Code	Cross-platform agent evaluation framework
NoisyToolBench	Learning to Ask: When LLMs Meet Unclear Instruction (Wang et al., 2024)	Code	Evaluates performance with unclear or noisy instructions
Mobile-Bench	Benchmarking large language models on mobile applications (Cao et al., 2024)	Code	Specialized benchmark for mobile application interactions

Task-Oriented

Name	Paper	Code	Description
IN3	In3: Instruction-following language models for interactive tasks (Qi et al., 2023)	Code	Interactive task evaluation with instruction following
NESTFUL	NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls (Basu et al., 2024)	-	Focuses on nested sequences of API calls
UltraTool	Pluto: A recipe for building adaptable autonomous llm agents (Guan et al., 2024)	Code	Evaluates adaptable autonomous agent capabilities
AppWorld	AppWorld: A Benchmark for Physical Mobile App Embodied Agent (Tian et al., 2023)	Code	Physical mobile app interaction benchmark
TheAgentCompany	The agent company: A generative agent simulation of a software company (Yuan et al., 2024)	-	Simulated software company environment for evaluation
AgentBoard	AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents (Liu et al., 2023)	Code	Multi-turn agent evaluation platform
TravelPlanner	Travel planner: A benchmark for real-world planning with language agents (Wang et al., 2024)	Code	Travel planning task-specific benchmark
ChinaTravel	Travel assistant: A benchmark for chinese llm agents in the tourism domain (Xia et al., 2024)	-	Chinese language travel planning benchmark

Comprehensive Systems

Name	Paper	Code	Description
API-BLEND	API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs (Basu et al., 2024)	Code	Multi-domain API coverage with evaluation methods
NESTOOLS	Nestools: Crafting efficient tools across diverse scenarios (Choi et al., 2024)	Code	Comprehensive evaluation across diverse scenarios
MTU-Bench	MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models (Wang et al., 2024)	Code	Multi-granularity tool-use evaluation
WTU-EVAL	Web tool use evaluation: Measuring large language models' capabilities on realistic web tasks (Mishra et al., 2023)	Code	Web-based tool usage evaluation framework

Industry Products

Commercial Platforms

Name	Organization	Release/Paper	Description
ChatGPT plugins	OpenAI	Introducing ChatGPT plugins	Ecosystem of third-party extensions for specific functionalities
Claude's tool use API	Anthropic	Claude 3 Opus technical report	Native function calling capabilities in Claude AI models
Cohere Command	Cohere	Introducing Cohere Command Light	API for function calling and structured JSON responses
Qwen	Alibaba	Qwen Technical Report (Yang et al., 2023)	Multi-function Chinese language models with tool usage
DeepSeek	DeepSeek	DeepSeek: Generalized Autoregressive Pretraining for Language and Vision (Dai et al., 2024)	Generalized foundation model with capabilities across tasks

Frameworks & SDKs

Name	Organization	Repository	Description
HuggingFace Transformer Agents	Hugging Face	Code	Framework for building agents with Hugging Face models
Semantic Kernel	Microsoft	Code	SDK for building AI applications with native tool integration
LangChain	LangChain	Code	Framework for building applications with LLMs and tools
WebCPM	Tsinghua University	Code	Chinese web agent framework with browsing capabilities

Autonomous Agent Systems

Name	Developer	Repository	Description
Auto-GPT	Significant Gravitas	Code	Self-prompting autonomous agent system
BabyAGI	Yohei	Code	Task-driven autonomous agent framework
BMTools	OpenBMB	Code	Toolset for enhancing language models with functions
RestGPT	Microsoft	Code	Model that can interact with RESTful APIs
xLAM	Silen	Code	Cross-language agent development framework
Octopus-v4	Baichuan	Octopus technical report (Hao et al., 2023)	Multi-agent system for complex task completion

Open Source Models

Name	Developer	Repository	Description
GRANITE-20B	IBM Research	Code	Large language model optimized for coding and tool use
Mistral 7B	Mistral AI	Code	Open-weight model with tool use capabilities
NexusRaven V2-13B	Nexusflow	Code	Function calling and multi-modality specialized model
Gorilla	UC Berkeley	Code	Model specialized in API usage and integration
FireFunction V1	Fireworks AI	Model	Purpose-built for function calling capabilities
Nous Hermes 2	Nous Research	Model	Instruction-tuned model with enhanced tool use

Training Resources & Datasets

Name	Organization	Link	Description
AgentInstruct	Microsoft	Paper (Zeng et al., 2023)	Instruction dataset for agent training and evaluation
AgentOhana	Duke University	Paper (Yang et al., 2024)	High-quality dataset for training multi-task agents
Lumos	Cornell University	Paper (Guo et al., 2023)	Multi-step reasoning dataset for tool-based tasks

Open Issues

Service Issues of Function Calling

Standards Challenge: Lack of universally accepted standard for assessing quality and performance
Latency Problems: High latency and low throughput affecting user experience
Security Vulnerabilities: Potential for "jailbreak function" attacks and other security concerns

Usability and Modification of Functions

Technical Costs: Integration and maintenance costs for API modifications
System Architecture Limitations: Constraints imposed by existing system architectures
Standardization Needs: Requirement for standardized API modification processes

Feedback Quality and Optimization

Complex Processing: Multiple steps in feedback processing introducing errors
Learning Assessment: Difficulty in quantifying effectiveness of human feedback
Strategy Requirements: Need for advanced algorithms to interpret unstructured feedback

Function Isolation and Post-Processing

Isolation Strategy: Challenges in appropriately isolating functions for business needs
Regulatory Compliance: Meeting specific regulatory requirements across functions
Post-processing Solutions: Implementing effective middleware for compliance and data transformation

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
figures		figures
Paper.pdf		Paper.pdf
README.md		README.md

Applied-Machine-Learning-Lab/Awesome-Function-Callings

Folders and files

Latest commit

History

Repository files navigation