Skip to content

Refactor dataset loaders into dedicated module#357

Closed
neuralsorcerer wants to merge 2 commits intofacebookresearch:mainfrom
neuralsorcerer:refct
Closed

Refactor dataset loaders into dedicated module#357
neuralsorcerer wants to merge 2 commits intofacebookresearch:mainfrom
neuralsorcerer:refct

Conversation

@neuralsorcerer
Copy link
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings March 5, 2026 00:42
@meta-cla meta-cla bot added the cla signed label Mar 5, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the simulated dataset loader functions into a dedicated balance.datasets.loading_data module while preserving the existing balance.datasets public API via re-exports.

Changes:

  • Added balance/datasets/loading_data.py containing load_sim_data, load_cbps_data, and load_data.
  • Updated balance/datasets/__init__.py to re-export the loader functions from the new module.
  • Added a changelog entry documenting the refactor.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
balance/datasets/loading_data.py New dedicated module housing dataset loader implementations and their docstrings.
balance/datasets/init.py Re-exports loaders to preserve balance.datasets.* public API.
CHANGELOG.md Documents the refactor under “Code Quality & Refactoring”.
Comments suppressed due to low confidence (4)

balance/datasets/loading_data.py:33

  • The docstring says only sample_df has a happiness column, but the implementation also sets target_df["happiness"]. Please update the description to match the actual returned schema (or adjust the code if the intent is for only one dataframe to include the outcome).
    Version 01 returns two dataframes containing the columns gender ("Male", "Female" and nan),
    age_group ("18-24", "25-34", "35-44", "45+"), income (some numbers from a normal distribution), and id.
    The sample_df also has a column called happiness with a value from 0 to 100 that depends on the covariates.

balance/datasets/loading_data.py:48

  • The Returns: section documents Tuple[pd.DataFrame, pd.DataFrame], but the function can return (None, None) for unsupported versions (and the type annotation already reflects that). Please update the docstring return type/description to include the None case (or raise an explicit error instead of returning None).
    Returns:
        Tuple[pd.DataFrame, pd.DataFrame]: Two DataFrames containing simulated data for the target and sample of interest.
    """

balance/datasets/loading_data.py:153

  • The load_data docstring says it returns two DataFrames, but the function returns (None, None) when source is unrecognized. Please align the documented return type/behavior with the implementation (e.g., document the None case or raise a ValueError for invalid source).
    Returns:
        Tuple[pd.DataFrame, pd.DataFrame]: The first dataframe contains simulated data of the "target" and the second dataframe contains simulated data of the "sample".
    """

balance/datasets/loading_data.py:123

  • The load_cbps_data docstring says "You can view the structure ... by looking at the example below", but no example is included. Please either add the example (the previous version had one) or remove this line to avoid misleading documentation.
    """Load simulated data for CBPS comparison with R.

    The code in balance that implements CBPS attempts to mimic the code from the R package CBPS (https://cran.r-project.org/web/packages/CBPS/).

    In the help page of the CBPS function in R (i.e.: `?CBPS::CBPS`) there is a simulated dataset that is used to showcase the CBPS function.
    The output of that simulated dataset is saved in balance in order to allow for comparison of `balance` (Python) with `CBPS` (R).

    You can view the structure of the simulated dataset by looking at the example below.

@neuralsorcerer neuralsorcerer added this to the balance 0.17.0 milestone Mar 5, 2026
@neuralsorcerer neuralsorcerer requested a review from talgalili March 5, 2026 00:54
Copy link
Contributor

@talgalili talgalili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@meta-codesync
Copy link

meta-codesync bot commented Mar 5, 2026

@talgalili has imported this pull request. If you are a Meta employee, you can view this in D95347392.

@meta-codesync
Copy link

meta-codesync bot commented Mar 5, 2026

@talgalili merged this pull request in c7b2e9f.

@neuralsorcerer neuralsorcerer deleted the refct branch March 5, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants