[BUG: Unable to fine-tune for a multi-modal inputs

### Python Version

```shell
3.10
```

### Pip Freeze

```shell
.
```

### Reproduction Steps

I have tried it using the api platform to upload a dataset with multimodal  inputs (contents contains a list of dicts)
[{"content": [{"image_url": "https://image_path.png", "type": "image_url"}, {"text": "Describe the image"}], "role": "user"}, {"content": [{"text": "description of the image"}], "role": "assistant"}]}

Whenever I try to upload such a dataset to the api platform, Im getting validation error. It is expecting str for content.
So How can I finetune for multimodal tasks ?

### Expected Behavior

.

### Additional Context

_No response_

### Suggested Solutions

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG: Unable to fine-tune for a multi-modal inputs #113

Python Version

Pip Freeze

Reproduction Steps

Expected Behavior

Additional Context

Suggested Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG: Unable to fine-tune for a multi-modal inputs #113

Description

Python Version

Pip Freeze

Reproduction Steps

Expected Behavior

Additional Context

Suggested Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions