We should add python scripts for the data's pre-processing according to the datasheet's instructions.
For example:
- split train/test according to the month of the year;
- binarize sensitive attribute column (according to the datasheet and paper this is done with <50 and >=50 years old);
- correctly load data from the parquet files (need to list
pyarrow in the requirements.txt file)
- helper functions to label encode (or one-hot encode) categortical columns;
- their current string encoding is incompatible with some well known algorithms;
- or use
dtype="category" if using pandas, this solves the problem for some algorithms including LightGBM;
Ideally this would all be implemented in a baf_helper python package, which could deal with downloading the data as well.
NOTE:
- we can use this package to download files from GDrive
- we should also summarize the readme and be more straight to the point on how to get usable data to train ML models;
We should add python scripts for the data's pre-processing according to the datasheet's instructions.
For example:
pyarrowin the requirements.txt file)dtype="category"if using pandas, this solves the problem for some algorithms including LightGBM;Ideally this would all be implemented in a
baf_helperpython package, which could deal with downloading the data as well.NOTE: