Skip to content
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,10 @@
appear in the final output CSV. Keep columns that are not id, weight,
covariate, or outcome columns will be placed into ``ignore_columns`` during
processing but are still retained and available in the output.
- **Clarified `_prepare_input_model_matrix` argument docs**
- Updated docstrings in `balance.utils.model_matrix` with
explicit descriptions for `sample`, `target`, `variables`, and `add_na`
behavior when preparing model-matrix inputs.

## Bug Fixes

Expand Down
32 changes: 26 additions & 6 deletions balance/utils/model_matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -285,14 +285,34 @@ def _prepare_input_model_matrix(
- Add na indicator if required.

Args:
sample (pd.DataFrame | Any): This can either be a DataFrame or a Sample object. TODO: add text.
target (pd.DataFrame | Any | None, optional): This can either be a DataFrame or a Sample object.. Defaults to None.
variables (List[str] | None, optional): Defaults to None. TODO: add text.
add_na (bool, optional): Defaults to True. TODO: add text.
fix_columns_names (bool, optional): Defaults to True. If to fix the column names of the DataFrame by changing special characters to '_'.
sample (pd.DataFrame | Any): Input sample data as either a
``pandas.DataFrame`` or a ``Sample`` object from
``balance.sample_class`` (recognized via ``_isinstance_sample``).
target (pd.DataFrame | Any | None, optional): Optional target data as
either a ``DataFrame`` or a ``Sample`` object. If provided, the
model-matrix inputs are prepared from a sample/target union of
variables and rows. Defaults to None.
variables (List[str] | None, optional): Variables to use from both
inputs. If provided, `choose_variables` validates that each
requested variable exists in both sample and target (when target is
supplied), otherwise it raises ``ValueError``. For ``Sample``
inputs, this validation/inference is based on covariate names
(``sample.covars().names()``), not all raw ``._df`` columns. If
None, variables are inferred by `choose_variables`.
add_na (bool, optional): If True, add NA indicator columns before
model-matrix creation. If False, drop rows containing missing
values; this can raise ``ValueError`` if dropping rows empties the
sample or target. Defaults to True.
fix_columns_names (bool, optional): Whether to sanitize column names by
replacing non-word characters with ``_`` and making duplicate names
unique. Defaults to True.

Raises:
Exception: "Variable names cannot contain characters '[' or ']'"
ValueError: If requested ``variables`` are not present in the
provided input frame(s) (and in both sample and target when target
is supplied), if variables contain ``[`` or ``]``, or if
``add_na=False`` drops all rows from sample/target, or if
sample has zero rows.

Returns:
Dict[str, Any]: returns a dictionary containing two keys: 'all_data' and 'sample_n'.
Expand Down
Loading