Skip to content

Why does dataset merge fail when tools have different parameters? #7869

@hitszxs

Description

@hitszxs

Hi, I have a question about SFT (Supervised Fine-tuning) for an agent model.

Suppose I want to fine-tune an agent model that may receive two different tools: tool1 and tool2. These tools have different parameters and types in their schema definitions.

When I try to merge datasets containing different tool definitions, I get the following error:

TypeError: Couldn't cast array of type
struct<refundFee: struct<description: string, type: string>, ... , servicerId: struct<description: string, type: string>>
to
{
'refundFee': {'description': Value(dtype='string'), 'type': Value(dtype='string')},
...
'templateId': {'description': Value(dtype='string'), 'type': Value(dtype='string')}
}
From my understanding, the merge fails because the tools column's nested structure is different across datasets — e.g., one struct contains an extra field servicerId while the other does not. This causes HuggingFace Datasets (and its underlying Apache Arrow schema) to reject the merge.

My question is: why is it designed this way?

Is this strict schema matching a hard requirement of the library?
Is there a recommended way to merge datasets with different tool schemas (different parameters and types)?
For an agent model supporting multiple tools, what's the best practice for preparing/merging training data without losing flexibility?
Any guidance or design rationale would be greatly appreciated. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions