Currently, expectedCall appears to be optimized for single-tool call validation. In more complex scenarios (like agents that trigger multiple parallel tool calls or tasks that can be solved using different tools in any order), a strict one-to-one or ordered comparison causes evaluation failures even when the model's logic is correct.
Example: If a prompt is: "Check the weather in London and Paris", the model might call get_weather(city='London') then get_weather(city='Paris'), or vice versa. Both should be considered a "Pass."