-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Hi, I have a question about SFT (Supervised Fine-tuning) for an agent model.
Suppose I want to fine-tune an agent model that may receive two different tools: tool1 and tool2. These tools have different parameters and types in their schema definitions.
When I try to merge datasets containing different tool definitions, I get the following error:
TypeError: Couldn't cast array of type
struct<refundFee: struct<description: string, type: string>, ... , servicerId: struct<description: string, type: string>>
to
{
'refundFee': {'description': Value(dtype='string'), 'type': Value(dtype='string')},
...
'templateId': {'description': Value(dtype='string'), 'type': Value(dtype='string')}
}
From my understanding, the merge fails because the tools column's nested structure is different across datasets — e.g., one struct contains an extra field servicerId while the other does not. This causes HuggingFace Datasets (and its underlying Apache Arrow schema) to reject the merge.
My question is: why is it designed this way?
Is this strict schema matching a hard requirement of the library?
Is there a recommended way to merge datasets with different tool schemas (different parameters and types)?
For an agent model supporting multiple tools, what's the best practice for preparing/merging training data without losing flexibility?
Any guidance or design rationale would be greatly appreciated. Thanks!