Skip to content

Fix data schema in example evaluation script#21

Merged
antoine-tran merged 17 commits intomainfrom
tuan/fix_17
Jan 16, 2025
Merged

Fix data schema in example evaluation script#21
antoine-tran merged 17 commits intomainfrom
tuan/fix_17

Conversation

@antoine-tran
Copy link
Contributor

@antoine-tran antoine-tran commented Jan 15, 2025

Why ?

In example evaluation script ("examples/evaluation/prepare_evaluation_data.py"), the processed datasets are hardcoded with the column schema with new names "prompt", "answer". This was done to make the next data processing steps in LCM evaluation (sentence splitting, sonar embedding) easier, but it was inconsistent in LLM evaluation, because they do not need much data processing and can work directly with original dataset.

This PR makes the following changes to make the evaluation script more flexible:

  • In Step 1 (preparing the JSONL dataset split), if the user specifies "prompt" parameters (prompt_prefix, prompt_suffix), we rename the columns to "prompt" and "answer".
  • If the user does no specify these parameters, the original column names are kept

NOTE: There is an issue in Python 3.12 compatibility related to stopes facebookresearch/stopes#71 , which also makes the current CI failed. This PR was tested and passed on Python 3.11

@antoine-tran
Copy link
Contributor Author

Merged despite the CI failures to fix the issues

@antoine-tran antoine-tran merged commit d640223 into main Jan 16, 2025
11 of 13 checks passed
LUIGIVAMPER pushed a commit to XiangningLin/large_concept_model that referenced this pull request Nov 25, 2025
Fix data schema in example evaluation script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants