Skip to content

Add pre-processing scripts as per the Datasheet's instructions #2

@AndreFCruz

Description

@AndreFCruz

We should add python scripts for the data's pre-processing according to the datasheet's instructions.
For example:

  • split train/test according to the month of the year;
  • binarize sensitive attribute column (according to the datasheet and paper this is done with <50 and >=50 years old);
  • correctly load data from the parquet files (need to list pyarrow in the requirements.txt file)
  • helper functions to label encode (or one-hot encode) categortical columns;
    • their current string encoding is incompatible with some well known algorithms;
    • or use dtype="category" if using pandas, this solves the problem for some algorithms including LightGBM;

Ideally this would all be implemented in a baf_helper python package, which could deal with downloading the data as well.

NOTE:

  • we can use this package to download files from GDrive
  • we should also summarize the readme and be more straight to the point on how to get usable data to train ML models;

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions