Releases: lshpaner/eda_toolkit
EDA Toolkit 0.0.22
EDA Toolkit 0.0.21
- Fixed dependency resolution for Python 3.11+ by updating SciPy version constraints.
- Improved compatibility with Google Colab and Python 3.12 environments.
- Updated package acknowledgements and metadata in
__init__.py. - No API changes.
EDA Toolkit 0.0.20
What's Changed
- flexible summarize_combinations sans excel by @lshpaner in #104
- Groupby imputer by @Oscar-Gil-Data in #109
- Refactor
kde_distributionsand extract density overlay logic into plot utilities by @lshpaner in #110 - Introduce Enhanced Q-Q Plot Support for Distribution GOF Diagnostics by @lshpaner in #111
- Add ECDF plot type to data_doctor; add SciPy to requirements by @Oscar-Gil-Data in #112
- Add conditional histograms and unify figure-saving utilities by @lshpaner in #114
- DataFrame memory management utility, del_inactive_dataframes by @Oscar-Gil-Data in #115
Full Changelog: 0.0.19...0.0.20
EDA Toolkit 0.0.19
Full Changelog: 0.0.18...0.0.19
EDA Toolkit 0.0.18
What's Changed
- Refactor: Add Typing, Welch's t-test Option, and Improve Coverage to 80% by @lshpaner in #100
- Enhance Table 1 formatting and pretty-printing by @lshpaner in #101
Full Changelog: 0.0.17...0.0.18
EDA Toolkit 0.0.17
-
Changed all instances of the word/term
gridtosubplotswhere applicable. -
Introduced a new function,
outcome_crosstab_plot, which generates crosstab-based stacked bar plots for visualizing the relationship between a binary outcome variable and multiple categorical features. The function supports:- Normalized or raw counts
- Custom label fonts and layout control
- Flexible color customization (default, list, or column-specific dict)
- Optional value count or percentage annotations in the legend
- PNG/SVG export support
This adds a compact, interpretable way to explore variable–outcome relationships during EDA.
EDA Toolkit 0.0.16
What's Changed
- Enhanced Plotting (+) Text Wrap Functionality by @lshpaner in #82
- added sklearn import by @lshpaner in #83
- Changed
dataframe_columnsname todataframe_profilerby @Oscar-Gil-Data in #84 - Improve
plot_3d_pdpFunction: Fix UnboundLocalError & Update Docstring by @lshpaner in #85- Moved
plot_2d_pdpandplot_3d_pdpto https://github.com/lshpaner/model_metrics
- Moved
- Unit Tests LS by @lshpaner in #86
- Unittests ls by @lshpaner in #87
- Refactor, Feature Expansion & Test Separation by @lshpaner in #88
- Add
include_typesfilter togenerate_table1()by @lshpaner in #89 - Improve generate_table1 Function for Enhanced Usability and Output Control by @lshpaner in #90
- Improve formatting of integer columns in Table 1 summary by @lshpaner in #91
- Fix Pretty Print Regression in generate_table1 by @lshpaner in #92
- Improve
generate_table1Output Handling and Markdown Logic by @lshpaner in #93 - Add
.drop()method toTableWrapperfor preserving pretty print by @lshpaner in #95
Full Changelog: 0.0.15...0.0.16
EDA Toolkit 0.0.15
Scatter Plot Function Updates
Avoid In-Place Modification of exclude_combinations
This addresses an issue where the scatter_fit_plot function modifies the exclude_combinations parameter in-place, causing errors when reused in subsequent calls.
Changes Made
- Create a local copy of
exclude_combinationsfor normalization instead of modifying the input directly:exclude_combinations_normalized = {tuple(sorted(pair)) for pair in exclude_combinations}
Improve Progress Tracking and Resolve Last Plot Saving Issue
-
Separate Progress Bars for Grid Plot:
- Added a
tqdmprogress bar to track the rendering of subplots in the grid. - Introduced a second
tqdmprogress bar to handle the saving step of the entire grid plot.
- Added a
-
Fix for Last Plot Saving with
save_plots="all":- Ensured individual plots and the grid plot are saved independently without overlap or interference.
- Addressed an issue where the last individual plot was incorrectly saved or overwritten.
-
Accurate Updates and Feedback:
- Progress bars now provide clear updates for rendering and saving stages, avoiding any hanging or delays.
Updated tqdm saving logic in scatter_fit_plot
- Refactored
tqdmprogress bar inscatter_fit_plotto track the overall plot-saving process, covering both individual and grid plots. - Updated
tqdmprogress bar description inscatter_fit_plotto use universal phrasing: "Saving scatter plot(s)." - Ensured consistency for singular or multiple plot-saving scenarios in progress tracking.
EDA Toolkit 0.0.14
Ensure Crosstabs Dictionary is Populated with return_dict=True
This resolves the issue where the stacked_crosstab_plot function fails to populate and return the crosstabs dictionary (crosstabs_dict) when return_dict=True and output="plots_only". The fix ensures that crosstabs are always generated when return_dict=True, regardless of the output parameter.
-
Always Generate Crosstabs with
return_dict=True: -
Added logic to ensure crosstabs are created and populated in
crosstabs_dictwheneverreturn_dict=True, even if the output parameter is set to"plots_only". -
Separation of Crosstabs Display from Generation:
- The generation of crosstabs is now independent of the output parameter.
- Crosstabs display (
print) occurs only when output includes"both"or"crosstabs_only".
Enhancements and Fixes for scatter_fit_plot Function
This addresses critical issues and introduces key enhancements for the scatter_fit_plot function. These changes aim to improve usability, flexibility, and robustness of the function.
Enhancements and Fixes
1. Added exclude_combinations Parameter
- Feature: Users can now exclude specific variable pairs from being plotted by providing a list of tuples with the combinations to omit.
2. Added combinations Parameter to show_plot
- Feature: Users can now show just the list of combinations that are part of the selection process when
all_vars=True
3. Fixed Bug with Single Variable Pair Plotting
Bug: When plotting a single variable pair withshow_plot="both", the function threw anAttributeError.Fix: Single-variable pairs are now properly handled.
4. Updated Default for show_plot Parameter
- Enhancement: Changed the default value of
show_plotto"both"to prevent excessive individual plots when handling large variable sets.
5. Legend, xlim, ylim inputs were not being used; fixed.
Fix Default Title and Filename Handling in flex_corr_matrix
This resolves issues in the flex_corr_matrix function where:
- No default title was provided when
title=None, resulting in missing titles on plots. - Saved plot filenames were incorrect, leading to issues like
.png.pngwhentitlewas not provided.
The fix ensures that a default title ("Correlation Matrix") is used for both plot display and file saving when no title is explicitly provided. If title is explicitly set to None, the plot will have no title, but the saved filename will still use "correlation_matrix".
1. Default Filename and Title Logic:
- If no
titleis provided,"Correlation Matrix"is used as the default for filenames and displayed titles. - If
title=Noneis explicitly passed, no title is displayed on the plot.
2. File Saving Improvements:
- File names are generated based on the
titleor default to"correlation_matrix"iftitleis not provided. - Spaces in the
titleare replaced with underscores, and special characters like:are removed to ensure valid filenames.
EDA Toolkit 0.0.13a
Description
This release introduces a series of updates and fixes across multiple functions in the to enhance error handling, improve cross-environment compatibility, streamline usability, and optimize performance. These changes address critical issues, add new features, and ensure consistent behavior in both terminal and notebook environments.
Add ValueError for Insufficient Pool Size in add_ids and Enhance ID Deduplication
This update enhances the add_ids function by adding explicit error handling and improving the uniqueness guarantee for
generated IDs. The following changes have been implemented:
Key Changes
-
New
ValueErrorfor Insufficient Pool Size: -
Calculates the pool size (
$(9 \times 10^{d-1}$ )) and compares it with the number of rows in the DataFrame.-
Behavior:
- Throws a ValueError if n_rows > pool_size.
- Prints a warning if n_rows approaches 90% of the pool size, suggesting an increase in digit length.
-
-
Improved ID Deduplication:
-
Introduced a set (
unique_ids) to track generated IDs. -
IDs are checked against this set to ensure uniqueness before being added to the DataFrame.
-
Prevents collisions by regenerating IDs only for duplicates, minimizing retries and improving performance.
Benefits
- Ensures robust error handling, avoiding silent failures or excessive retries caused by small digit lengths.
- Guarantees unique IDs even for large DataFrames, improving reliability and scalability.
Enhance strip_trailing_period to Support Strings and Mixed Data Types
- This enhances the
strip_trailing_periodfunction to handle trailing periods in both numeric and string values. The updated implementation ensures robustness for columns with mixed data types and gracefully handles special cases likeNaN.
Key Enhancements
- Support for Strings with Trailing Periods:
- Removes trailing periods from string values, such as "123." or "test.".
- Mixed Data Types:
- Handles columns containing both numeric and string values seamlessly.
- Graceful Handling of
NaN:- Skips processing for
NaNvalues, leaving them unchanged.
- Skips processing for
- Robust Type Conversion:
- Converts numeric strings (e.g., "123.") back to float where applicable.
- Retains strings if conversion to float is not possible.
Changes in stacked_crosstab_plot
Remove IPython Dependency by Replacing display with print
This resolves an issue where the eda_toolkit library required IPython as a dependency due to the use of display(crosstab_df) in the stacked_crosstab_plot function. The dependency caused import failures in environments without IPython, especially in non-Jupyter terminal-based workflows.
Changes Made
- Replaced
displaywithprint:
-
The line
display(crosstab_df)was replaced withprint(crosstab_df)to eliminate the need forIPython. This ensures compatibility across terminal and Jupyter environments without requiring additional dependencies. -
Removed
IPythonImport:- The
from IPython.display import displayimport statement was removed from the codebase.
- The
Updated Function Behavior:
- Crosstabs are displayed using print, maintaining functionality in all runtime environments.
- The change ensures no loss in usability or user experience.
Root Cause and Fix
The issue arose from the reliance on IPython.display.display for rendering crosstab tables in Jupyter notebooks. Since IPython is not a core dependency of eda_toolkit, environments without IPython experienced a ModuleNotFoundError.
To address this, the display(crosstab_df) statement was replaced with print(crosstab_df), simplifying the function while maintaining compatibility with both Jupyter and terminal environments.
Testing
-
Jupyter Notebook:
- Crosstabs are displayed as plain text via print(), rendered neatly in notebook outputs.
-
Terminal Session:
- Crosstabs are printed as expected, ensuring seamless use in terminal-based workflows.
Add Environment Detection to dataframe_columns Function
This enhances the dataframe_columns function to dynamically adjust its output based on the runtime environment (Jupyter Notebook or terminal). It resolves issues where the function's styled output was incompatible with terminal environments.
Changes Made
-
Environment Detection:
- Added a check to determine if the function is running in a Jupyter Notebook or terminal:
is_notebook_env = "ipykernel" in sys.modules
-
Dynamic Output Behavior:
- Terminal Environment:
- Returns a plain DataFrame (
result_df) when running outside of a notebook or whenreturn_df=True.
- Returns a plain DataFrame (
- Jupyter Notebook:
- Retains the styled DataFrame functionality when running in a notebook and
return_df=False.
- Retains the styled DataFrame functionality when running in a notebook and
- Terminal Environment:
-
Improved Compatibility:
- The function now works seamlessly in both terminal and notebook environments without requiring additional dependencies.
-
Preserved Existing Features:
- Maintains sorting behavior via
sort_cols_alpha. - Keeps the background color styling for specific columns (
unique_values_total,max_unique_value, etc.) in notebook environments.
- Maintains sorting behavior via
Add tqdm Progress Bar to dataframe_columns Function
This enhances the dataframe_columns function by incorporating a tqdm progress bar to track the processing of each column. This is particularly useful for analyzing large DataFrames, providing real-time feedback on the function's progress.
Changes Made
- Added
tqdmProgress Bar:
- Wrapped the column processing loop with a
tqdmprogress bar:
for col in tqdm(df.columns, desc="Processing columns"):
...- The progress bar is labeled with the description
"Processing columns"for clarity. - The progress bar is non-intrusive and works seamlessly in both terminal and Jupyter Notebook environments.
box_violin_plot Fix Plot Display for Terminal Applications and Simplify save_plot Functionality
This addresses the following issues:
- Removes
plt.close(fig)- Ensures plots display properly in terminal-based applications and IDEs outside Jupyter Notebooks.
- Fixes the incompatibility with non-interactive environments by leaving figures open after rendering.
- Simplifies
save_plotParameter- Converts
save_plotinto abooleanfor simplicity and better integration with the existingshow_plotparameter. - Automatically saves plots based on the value of
show_plot("individual,""grid,"or"both") whensave_plot=True.
- Converts
These changes improve the usability and flexibility of the plotting function across different environments.
Changes Made
-
Removed
plt.close(fig)to allow plots to remain open in non-Jupyter environments. -
Updated the
save_plotparameter to be aboolean, streamlining the control logic withshow_plot. -
Adjusted the relevant sections of the code to implement these changes.
-
Updated
ValueErrorcheck based on the newsave_plotsinput:# Check for valid save_plots value if not isinstance(save_plots, bool): raise ValueError("`save_plots` must be a boolean value (True or False).")
scatter_fit_plot: Render Plots Before Saving
- Update the
scatter_fit_plotfunction to render all plots (plt.show()) before saving, improving user experience and output quality validation.
Changes
- Added
plt.show()to render individual and grid plots before saving. - Integrated
tqdmfor progress tracking during saving individual plots and grid plots
Add tqdm Progress Bar to save_dataframes_to_excel
This enhances the save_dataframes_to_excel function by integrating a tqdm progress bar for improved tracking of the DataFrame saving process. Users can now visually monitor the progress of writing each DataFrame to its respective sheet in the Excel file.
Changes Made
-
Added a
tqdmProgress Bar:- Tracks the progress of saving DataFrames to individual sheets.
- Ensures that the user sees an incremental update as each DataFrame is written.
-
Updated Functionality:
- Incorporated the progress bar into the loop that writes DataFrames to sheets.
- Retained the existing formatting features (e.g., auto-fitting columns, numeric formatting, and header styles).
Add Progress Tracking and Enhance Functionality for summarize_all_combinations
This enhances the summarize_all_combinations function by adding user-friendly progress tracking using tqdm and addressing usability concerns. The following changes have been implemented:
- Progress Tracking with
tqdm - Excel File Finalization:
- Addressed
UserWarningmessages related to close() being called on already closed files by explicitly managing file closure. - Added a final confirmation message when the Excel file is successfully saved.
Fix Plot Display Logic in plot_2d_pdp
This resolves an issue in the plot_2d_pdp function where all plots (grid and individual) were being displayed unnecessarily when save_plots="all". The function now adheres strictly to the plot_type parameter, showing only the intended plots. It also ensures unused plots are closed to prevent memory issues.
Changes Made:
-
Grid Plot Logic:
- Grid plots are only displayed if
plot_type="grid"orplot_type="both". - If
save_plots="all"orsave_plots="grid", plots are saved without being displayed unless specified byplot_type.
- Grid plots are only displayed if
-
Individual Plot Logic:
- Individual plots are only displayed if
plot_type="individual"orplot_type="both". - If
save_plots="all"orsave_plots="individual", plots are saved but not displayed unless specified by `plo...
- Individual plots are only displayed if