Filter out blank content when saving mixed dataset#652
Conversation
|
Do we also need to filter out the rows where content="" ? If so I'll update the PR |
|
@derekhiggins thanks for taking this up. I would recommend dropping the empty content even before converting the dataset to messages format here. |
@aakankshaduggal I've yet to find out exactly where the null values have been introduced so I didn't want to filter them out too early, in case they are introduced after the filter. I'm currently trying to find out where they got introduced so we can filter them in the correct place. if you're sure gen_train_data is the correct place then I can move the filter but looking at the example from the person reporting the bug I don't see any null values in their train....jsonl file |
|
@derekhiggins - I am pretty positive it is not after the conversion, but let's keep both null checks to be sure. |
Ensure we don't save out a mixed or train datasets with rows where content=None. Signed-off-by: Derek Higgins <derekh@redhat.com>
|
@aakankshaduggal I've found the issue, null content is being introduced by __create_auxiliary_ds here The problem is that the record being used has 'document': None, this gets renamed to "response" and then used as the assistant content, so we end up with |
|
@derekhiggins Do we know why the document value is |
We can't use them as they cause errors in the datamixing code. Signed-off-by: Derek Higgins <derekh@redhat.com>
It looks like the pipeline that generated them was generating various summaries from a llm, one of the summaries was null and then later renamed to "document" I've updated the PR to filter out instances where document=null , I think this alone would be enough to deal the the error in question. But for now I've left the other two filters in place so we can discuss to keep them or not |
aakankshaduggal
left a comment
There was a problem hiding this comment.
Thanks @derekhiggins! LGTM!
|
@Mergifyio backport release-v0.8 |
✅ Backports have been createdDetails
|
Filter out blank content when saving mixed dataset (backport #652)
Ensure we don't save out a mised dataset rows where content=None.