Skip to content
This repository was archived by the owner on Apr 30, 2026. It is now read-only.

Filter out blank content when saving mixed dataset#652

Merged
mergify[bot] merged 2 commits into
instructlab:mainfrom
derekhiggins:filter-None
Jun 10, 2025
Merged

Filter out blank content when saving mixed dataset#652
mergify[bot] merged 2 commits into
instructlab:mainfrom
derekhiggins:filter-None

Conversation

@derekhiggins

Copy link
Copy Markdown
Contributor

Ensure we don't save out a mised dataset rows where content=None.

@mergify mergify Bot added testing Relates to testing ci-failure labels Jun 4, 2025
@derekhiggins

Copy link
Copy Markdown
Contributor Author

Do we also need to filter out the rows where content="" ? If so I'll update the PR

@aakankshaduggal

Copy link
Copy Markdown
Contributor

@derekhiggins thanks for taking this up. I would recommend dropping the empty content even before converting the dataset to messages format here.

@derekhiggins

Copy link
Copy Markdown
Contributor Author

@derekhiggins thanks for taking this up. I would recommend dropping the empty content even before converting the dataset to messages format here.

@aakankshaduggal I've yet to find out exactly where the null values have been introduced so I didn't want to filter them out too early, in case they are introduced after the filter.

I'm currently trying to find out where they got introduced so we can filter them in the correct place.
In the meantime I thought filtering them out where we know they definitely exist would unblock things.

if you're sure gen_train_data is the correct place then I can move the filter but looking at the example from the person reporting the bug I don't see any null values in their train....jsonl file

@aakankshaduggal

Copy link
Copy Markdown
Contributor

@derekhiggins - I am pretty positive it is not after the conversion, but let's keep both null checks to be sure.

Ensure we don't save out a mixed or train datasets
with rows where content=None.

Signed-off-by: Derek Higgins <derekh@redhat.com>
@derekhiggins

Copy link
Copy Markdown
Contributor Author

@aakankshaduggal I've found the issue, null content is being introduced by __create_auxiliary_ds here
https://github.com/instructlab/sdg/blob/55e224207/src/instructlab/sdg/datamixing.py#L471

The problem is that the record being used has 'document': None, this gets renamed to "response" and then used as the assistant content, so we end up with{'role': 'assistant', 'content': None}

@bbrowning

Copy link
Copy Markdown
Contributor

@derekhiggins Do we know why the document value is None here?

We can't use them as they cause errors in the datamixing code.

Signed-off-by: Derek Higgins <derekh@redhat.com>
@derekhiggins

Copy link
Copy Markdown
Contributor Author

@derekhiggins Do we know why the document value is None here?

It looks like the pipeline that generated them was generating various summaries from a llm, one of the summaries was null and then later renamed to "document"

I've updated the PR to filter out instances where document=null , I think this alone would be enough to deal the the error in question.

But for now I've left the other two filters in place so we can discuss to keep them or not

Comment thread tests/unit/test_datamixing.py

@aakankshaduggal aakankshaduggal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @derekhiggins! LGTM!

@mergify mergify Bot added the one-approval label Jun 10, 2025
@aakankshaduggal aakankshaduggal requested a review from a team June 10, 2025 15:05
@mergify mergify Bot merged commit e3bf8b3 into instructlab:main Jun 10, 2025
27 checks passed
@mergify mergify Bot removed the one-approval label Jun 10, 2025
@derekhiggins

Copy link
Copy Markdown
Contributor Author

@Mergifyio backport release-v0.8

@mergify

mergify Bot commented Jun 11, 2025

Copy link
Copy Markdown
Contributor

backport release-v0.8

✅ Backports have been created

Details

mergify Bot added a commit that referenced this pull request Jun 11, 2025
Filter out blank content when saving mixed dataset (backport #652)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants