Releases · lshpaner/eda_toolkit

20 Dec 20:03

lshpaner

0.0.22

4009c27

EDA Toolkit 0.0.22 Latest

Latest

Renamed conditional_histograms to grouped_distributions
Updated SciPy version requirement in README

Assets 2

20 Dec 18:49

lshpaner

0.0.21

29ddced

EDA Toolkit 0.0.21

Fixed dependency resolution for Python 3.11+ by updating SciPy version constraints.
Improved compatibility with Google Colab and Python 3.12 environments.
Updated package acknowledgements and metadata in __init__.py.
No API changes.

Assets 2

20 Dec 16:40

lshpaner

0.0.20

ad7a926

EDA Toolkit 0.0.20

What's Changed

flexible summarize_combinations sans excel by @lshpaner in #104
Groupby imputer by @Oscar-Gil-Data in #109
Refactor kde_distributions and extract density overlay logic into plot utilities by @lshpaner in #110
Introduce Enhanced Q-Q Plot Support for Distribution GOF Diagnostics by @lshpaner in #111
Add ECDF plot type to data_doctor; add SciPy to requirements by @Oscar-Gil-Data in #112
Add conditional histograms and unify figure-saving utilities by @lshpaner in #114
DataFrame memory management utility, del_inactive_dataframes by @Oscar-Gil-Data in #115

Full Changelog: 0.0.19...0.0.20

Contributors

Oscar-Gil-Data and lshpaner

Assets 2

20 Oct 22:29

lshpaner

0.0.19

66ceb5f

EDA Toolkit 0.0.19

Full Changelog: 0.0.18...0.0.19

Assets 2

31 Jul 17:24

lshpaner

0.0.18

eae01b0

EDA Toolkit 0.0.18

What's Changed

Refactor: Add Typing, Welch's t-test Option, and Improve Coverage to 80% by @lshpaner in #100
Enhance Table 1 formatting and pretty-printing by @lshpaner in #101

Full Changelog: 0.0.17...0.0.18

Contributors

lshpaner

Assets 2

13 Jul 21:01

lshpaner

0.0.17

bd0cc47

EDA Toolkit 0.0.17

Changed all instances of the word/term grid to subplots where applicable.
Introduced a new function, outcome_crosstab_plot, which generates crosstab-based stacked bar plots for visualizing the relationship between a binary outcome variable and multiple categorical features. The function supports:
- Normalized or raw counts
- Custom label fonts and layout control
- Flexible color customization (default, list, or column-specific dict)
- Optional value count or percentage annotations in the legend
- PNG/SVG export support
This adds a compact, interpretable way to explore variable–outcome relationships during EDA.

Assets 2

27 Mar 07:45

lshpaner

0.0.16

9087a4b

EDA Toolkit 0.0.16

What's Changed

Enhanced Plotting (+) Text Wrap Functionality by @lshpaner in #82
added sklearn import by @lshpaner in #83
Changed dataframe_columns name to dataframe_profiler by @Oscar-Gil-Data in #84
Improve plot_3d_pdp Function: Fix UnboundLocalError & Update Docstring by @lshpaner in #85
- Moved plot_2d_pdp and plot_3d_pdp to https://github.com/lshpaner/model_metrics
Unit Tests LS by @lshpaner in #86
Unittests ls by @lshpaner in #87
Refactor, Feature Expansion & Test Separation by @lshpaner in #88
Add include_types filter to generate_table1() by @lshpaner in #89
Improve generate_table1 Function for Enhanced Usability and Output Control by @lshpaner in #90
Improve formatting of integer columns in Table 1 summary by @lshpaner in #91
Fix Pretty Print Regression in generate_table1 by @lshpaner in #92
Improve generate_table1 Output Handling and Markdown Logic by @lshpaner in #93
Add .drop() method to TableWrapper for preserving pretty print by @lshpaner in #95

Full Changelog: 0.0.15...0.0.16

Contributors

Oscar-Gil-Data and lshpaner

Assets 2

28 Dec 20:52

lshpaner

0.0.15

31cf220

EDA Toolkit 0.0.15

Scatter Plot Function Updates

Avoid In-Place Modification of `exclude_combinations`

This addresses an issue where the scatter_fit_plot function modifies the exclude_combinations parameter in-place, causing errors when reused in subsequent calls.

Changes Made

Create a local copy of exclude_combinations for normalization instead of modifying the input directly:
```
exclude_combinations_normalized = {tuple(sorted(pair)) for pair in exclude_combinations}
```

Improve Progress Tracking and Resolve Last Plot Saving Issue

Separate Progress Bars for Grid Plot:
- Added a tqdm progress bar to track the rendering of subplots in the grid.
- Introduced a second tqdm progress bar to handle the saving step of the entire grid plot.
Fix for Last Plot Saving with save_plots="all":
- Ensured individual plots and the grid plot are saved independently without overlap or interference.
- Addressed an issue where the last individual plot was incorrectly saved or overwritten.
Accurate Updates and Feedback:
- Progress bars now provide clear updates for rendering and saving stages, avoiding any hanging or delays.

Updated `tqdm` saving logic in `scatter_fit_plot`

Refactored tqdm progress bar in scatter_fit_plot to track the overall plot-saving process, covering both individual and grid plots.
Updated tqdm progress bar description in scatter_fit_plot to use universal phrasing: "Saving scatter plot(s)."
Ensured consistency for singular or multiple plot-saving scenarios in progress tracking.

Assets 2

27 Dec 19:46

lshpaner

0.0.14

d330794

EDA Toolkit 0.0.14

Ensure Crosstabs Dictionary is Populated with `return_dict=True`

This resolves the issue where the stacked_crosstab_plot function fails to populate and return the crosstabs dictionary (crosstabs_dict) when return_dict=True and output="plots_only". The fix ensures that crosstabs are always generated when return_dict=True, regardless of the output parameter.

Always Generate Crosstabs with return_dict=True:
Added logic to ensure crosstabs are created and populated in crosstabs_dict whenever return_dict=True, even if the output parameter is set to "plots_only".
Separation of Crosstabs Display from Generation:
- The generation of crosstabs is now independent of the output parameter.
- Crosstabs display (print) occurs only when output includes "both" or "crosstabs_only".

Enhancements and Fixes for `scatter_fit_plot` Function

This addresses critical issues and introduces key enhancements for the scatter_fit_plot function. These changes aim to improve usability, flexibility, and robustness of the function.

Enhancements and Fixes

1. Added `exclude_combinations` Parameter

Feature: Users can now exclude specific variable pairs from being plotted by providing a list of tuples with the combinations to omit.

2. Added `combinations` Parameter to `show_plot`

Feature: Users can now show just the list of combinations that are part of the selection process when all_vars=True

3. Fixed Bug with Single Variable Pair Plotting

Bug: When plotting a single variable pair with show_plot="both", the function threw an AttributeError.
Fix: Single-variable pairs are now properly handled.

4. Updated Default for `show_plot` Parameter

Enhancement: Changed the default value of show_plot to "both" to prevent excessive individual plots when handling large variable sets.

5. Legend, `xlim`, `ylim` inputs were not being used; fixed.

Fix Default Title and Filename Handling in `flex_corr_matrix`

This resolves issues in the flex_corr_matrix function where:

No default title was provided when title=None, resulting in missing titles on plots.
Saved plot filenames were incorrect, leading to issues like .png.png when title was not provided.

The fix ensures that a default title ("Correlation Matrix") is used for both plot display and file saving when no title is explicitly provided. If title is explicitly set to None, the plot will have no title, but the saved filename will still use "correlation_matrix".

1. Default Filename and Title Logic:

If no title is provided, "Correlation Matrix" is used as the default for filenames and displayed titles.
If title=None is explicitly passed, no title is displayed on the plot.

2. File Saving Improvements:

File names are generated based on the title or default to "correlation_matrix" if title is not provided.
Spaces in the title are replaced with underscores, and special characters like : are removed to ensure valid filenames.

Assets 2

25 Dec 00:43

lshpaner

0.0.13a

2506116

EDA Toolkit 0.0.13a

Description

This release introduces a series of updates and fixes across multiple functions in the to enhance error handling, improve cross-environment compatibility, streamline usability, and optimize performance. These changes address critical issues, add new features, and ensure consistent behavior in both terminal and notebook environments.

Add `ValueError` for Insufficient Pool Size in `add_ids` and Enhance ID Deduplication

This update enhances the add_ids function by adding explicit error handling and improving the uniqueness guarantee for
generated IDs. The following changes have been implemented:

Key Changes

New ValueError for Insufficient Pool Size:
Calculates the pool size ($(9 \times 10^{d-1}$)) and compares it with the number of rows in the DataFrame.
- Behavior:
  - Throws a ValueError if n_rows > pool_size.
  - Prints a warning if n_rows approaches 90% of the pool size, suggesting an increase in digit length.
Improved ID Deduplication:
Introduced a set (unique_ids) to track generated IDs.
IDs are checked against this set to ensure uniqueness before being added to the DataFrame.
Prevents collisions by regenerating IDs only for duplicates, minimizing retries and improving performance.

Benefits

Ensures robust error handling, avoiding silent failures or excessive retries caused by small digit lengths.
Guarantees unique IDs even for large DataFrames, improving reliability and scalability.

Enhance `strip_trailing_period` to Support Strings and Mixed Data Types

This enhances the strip_trailing_period function to handle trailing periods in both numeric and string values. The updated implementation ensures robustness for columns with mixed data types and gracefully handles special cases like NaN.

Key Enhancements

Support for Strings with Trailing Periods:
- Removes trailing periods from string values, such as "123." or "test.".
Mixed Data Types:
- Handles columns containing both numeric and string values seamlessly.
Graceful Handling of NaN:
- Skips processing for NaN values, leaving them unchanged.
Robust Type Conversion:
- Converts numeric strings (e.g., "123.") back to float where applicable.
- Retains strings if conversion to float is not possible.

Changes in `stacked_crosstab_plot`

Remove `IPython` Dependency by Replacing `display` with `print`

This resolves an issue where the eda_toolkit library required IPython as a dependency due to the use of display(crosstab_df) in the stacked_crosstab_plot function. The dependency caused import failures in environments without IPython, especially in non-Jupyter terminal-based workflows.

Changes Made

Replaced display with print:

The line display(crosstab_df) was replaced with print(crosstab_df) to eliminate the need for IPython. This ensures compatibility across terminal and Jupyter environments without requiring additional dependencies.
Removed IPython Import:
- The from IPython.display import display import statement was removed from the codebase.

Updated Function Behavior:

Crosstabs are displayed using print, maintaining functionality in all runtime environments.
The change ensures no loss in usability or user experience.

Root Cause and Fix

The issue arose from the reliance on IPython.display.display for rendering crosstab tables in Jupyter notebooks. Since IPython is not a core dependency of eda_toolkit, environments without IPython experienced a ModuleNotFoundError.

To address this, the display(crosstab_df) statement was replaced with print(crosstab_df), simplifying the function while maintaining compatibility with both Jupyter and terminal environments.

Testing

Jupyter Notebook:
- Crosstabs are displayed as plain text via print(), rendered neatly in notebook outputs.
Terminal Session:
- Crosstabs are printed as expected, ensuring seamless use in terminal-based workflows.

Add Environment Detection to `dataframe_columns` Function

This enhances the dataframe_columns function to dynamically adjust its output based on the runtime environment (Jupyter Notebook or terminal). It resolves issues where the function's styled output was incompatible with terminal environments.

Changes Made

Environment Detection:
- Added a check to determine if the function is running in a Jupyter Notebook or terminal:
```
is_notebook_env = "ipykernel" in sys.modules
```
Dynamic Output Behavior:
- Terminal Environment:
  - Returns a plain DataFrame (result_df) when running outside of a notebook or when return_df=True.
- Jupyter Notebook:
  - Retains the styled DataFrame functionality when running in a notebook and return_df=False.
Improved Compatibility:
- The function now works seamlessly in both terminal and notebook environments without requiring additional dependencies.
Preserved Existing Features:
- Maintains sorting behavior via sort_cols_alpha.
- Keeps the background color styling for specific columns (unique_values_total, max_unique_value, etc.) in notebook environments.

Add `tqdm` Progress Bar to `dataframe_columns` Function

This enhances the dataframe_columns function by incorporating a tqdm progress bar to track the processing of each column. This is particularly useful for analyzing large DataFrames, providing real-time feedback on the function's progress.

Changes Made

Added tqdm Progress Bar:

Wrapped the column processing loop with a tqdm progress bar:

for col in tqdm(df.columns, desc="Processing columns"):
    ...

The progress bar is labeled with the description "Processing columns" for clarity.
The progress bar is non-intrusive and works seamlessly in both terminal and Jupyter Notebook environments.

`box_violin_plot` Fix Plot Display for Terminal Applications and Simplify `save_plot` Functionality

This addresses the following issues:

Removes plt.close(fig)
- Ensures plots display properly in terminal-based applications and IDEs outside Jupyter Notebooks.
- Fixes the incompatibility with non-interactive environments by leaving figures open after rendering.
Simplifies save_plot Parameter
- Converts save_plot into a boolean for simplicity and better integration with the existing show_plot parameter.
- Automatically saves plots based on the value of show_plot ("individual," "grid," or "both") when save_plot=True.

These changes improve the usability and flexibility of the plotting function across different environments.

Changes Made

Removed plt.close(fig) to allow plots to remain open in non-Jupyter environments.
Updated the save_plot parameter to be a boolean, streamlining the control logic with show_plot.
Adjusted the relevant sections of the code to implement these changes.

Updated ValueError check based on the new save_plots input:

# Check for valid save_plots value
if not isinstance(save_plots, bool): raise ValueError("`save_plots` must be a boolean value (True or False).")

`scatter_fit_plot`: Render Plots Before Saving

Update the scatter_fit_plot function to render all plots (plt.show()) before saving, improving user experience and output quality validation.

Changes

Added plt.show() to render individual and grid plots before saving.
Integrated tqdm for progress tracking during saving individual plots and grid plots

Add `tqdm` Progress Bar to `save_dataframes_to_excel`

This enhances the save_dataframes_to_excel function by integrating a tqdm progress bar for improved tracking of the DataFrame saving process. Users can now visually monitor the progress of writing each DataFrame to its respective sheet in the Excel file.

Changes Made

Added a tqdm Progress Bar:
- Tracks the progress of saving DataFrames to individual sheets.
- Ensures that the user sees an incremental update as each DataFrame is written.
Updated Functionality:
- Incorporated the progress bar into the loop that writes DataFrames to sheets.
- Retained the existing formatting features (e.g., auto-fitting columns, numeric formatting, and header styles).

Add Progress Tracking and Enhance Functionality for `summarize_all_combinations`

This enhances the summarize_all_combinations function by adding user-friendly progress tracking using tqdm and addressing usability concerns. The following changes have been implemented:

Progress Tracking with tqdm
Excel File Finalization:

Addressed UserWarning messages related to close() being called on already closed files by explicitly managing file closure.
Added a final confirmation message when the Excel file is successfully saved.

Fix Plot Display Logic in `plot_2d_pdp`

This resolves an issue in the plot_2d_pdp function where all plots (grid and individual) were being displayed unnecessarily when save_plots="all". The function now adheres strictly to the plot_type parameter, showing only the intended plots. It also ensures unused plots are closed to prevent memory issues.

Changes Made:

Grid Plot Logic:
- Grid plots are only displayed if plot_type="grid" or plot_type="both".
- If save_plots="all" or save_plots="grid", plots are saved without being displayed unless specified by plot_type.
Individual Plot Logic:
- Individual plots are only displayed if plot_type="individual" or plot_type="both".
- If save_plots="all" or save_plots="individual", plots are saved but not displayed unless specified by `plo...

Assets 2

Releases: lshpaner/eda_toolkit

EDA Toolkit 0.0.22

Uh oh!

EDA Toolkit 0.0.21

Uh oh!

EDA Toolkit 0.0.20

What's Changed

Contributors

Uh oh!

EDA Toolkit 0.0.19

Uh oh!

EDA Toolkit 0.0.18

What's Changed

Contributors

Uh oh!

EDA Toolkit 0.0.17

Uh oh!

EDA Toolkit 0.0.16

What's Changed

Contributors

Uh oh!

EDA Toolkit 0.0.15

Scatter Plot Function Updates

Avoid In-Place Modification of exclude_combinations

Improve Progress Tracking and Resolve Last Plot Saving Issue

Updated tqdm saving logic in scatter_fit_plot

Uh oh!

EDA Toolkit 0.0.14

Ensure Crosstabs Dictionary is Populated with return_dict=True

Enhancements and Fixes for scatter_fit_plot Function

Enhancements and Fixes

1. Added exclude_combinations Parameter

2. Added combinations Parameter to show_plot

3. Fixed Bug with Single Variable Pair Plotting

4. Updated Default for show_plot Parameter

5. Legend, xlim, ylim inputs were not being used; fixed.

Fix Default Title and Filename Handling in flex_corr_matrix

1. Default Filename and Title Logic:

2. File Saving Improvements:

Uh oh!

EDA Toolkit 0.0.13a

Description

Add ValueError for Insufficient Pool Size in add_ids and Enhance ID Deduplication

Enhance strip_trailing_period to Support Strings and Mixed Data Types

Changes in stacked_crosstab_plot

Remove IPython Dependency by Replacing display with print

Root Cause and Fix

Testing

Add Environment Detection to dataframe_columns Function

Add tqdm Progress Bar to dataframe_columns Function

box_violin_plot Fix Plot Display for Terminal Applications and Simplify save_plot Functionality

scatter_fit_plot: Render Plots Before Saving

Add tqdm Progress Bar to save_dataframes_to_excel

Add Progress Tracking and Enhance Functionality for summarize_all_combinations

Fix Plot Display Logic in plot_2d_pdp

Uh oh!

Avoid In-Place Modification of `exclude_combinations`

Updated `tqdm` saving logic in `scatter_fit_plot`

Ensure Crosstabs Dictionary is Populated with `return_dict=True`

Enhancements and Fixes for `scatter_fit_plot` Function

1. Added `exclude_combinations` Parameter

2. Added `combinations` Parameter to `show_plot`

4. Updated Default for `show_plot` Parameter

5. Legend, `xlim`, `ylim` inputs were not being used; fixed.

Fix Default Title and Filename Handling in `flex_corr_matrix`

Add `ValueError` for Insufficient Pool Size in `add_ids` and Enhance ID Deduplication

Enhance `strip_trailing_period` to Support Strings and Mixed Data Types

Changes in `stacked_crosstab_plot`

Remove `IPython` Dependency by Replacing `display` with `print`

Add Environment Detection to `dataframe_columns` Function

Add `tqdm` Progress Bar to `dataframe_columns` Function

`box_violin_plot` Fix Plot Display for Terminal Applications and Simplify `save_plot` Functionality

`scatter_fit_plot`: Render Plots Before Saving

Add `tqdm` Progress Bar to `save_dataframes_to_excel`

Add Progress Tracking and Enhance Functionality for `summarize_all_combinations`

Fix Plot Display Logic in `plot_2d_pdp`