Add pre-processing scripts as per the Datasheet's instructions

We should add python scripts for the data's pre-processing according to the datasheet's instructions.
For example:

- split train/test according to the month of the year;
- binarize sensitive attribute column (according to the datasheet and paper this is done with <50 and >=50 years old);
- correctly load data from the parquet files (need to list `pyarrow` in the requirements.txt file)
- helper functions to label encode (or one-hot encode) categortical columns;
  - their current string encoding is incompatible with some well known algorithms;
  - or use `dtype="category"` if using pandas, this solves the problem for some algorithms including LightGBM;

Ideally this would all be implemented in a `baf_helper` python package, which could deal with downloading the data as well.

NOTE:
- we can use [this package](https://github.com/wkentaro/gdown) to download files from GDrive
- we should also summarize the readme and be more straight to the point on how to get usable data to train ML models;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pre-processing scripts as per the Datasheet's instructions #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add pre-processing scripts as per the Datasheet's instructions #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions