Welcome to Our DataFrame Memory Optimizer


Testing
Package
Meta

This is a collaborative project developed as part of DSCI 524: Collaborative Software Development.

Project Summary

DataFrame Memory Optimizer is a lightweight Python package that reduces the memory footprint of pandas DataFrames by automatically selecting more efficient data types. It analyzes each column in a DataFrame and applies safe, transparent optimizations such as numeric downcasting and categorical conversion while preserving the original data values. This will allow users to work more efficiently with large datasets in data analysis, machine learning pipelines, and resource-constrained environments.

Example on how to use our package

To use the DataFrame Memory Optimizer, start by importing optimize_dataframe from the group_32 package. Pass any pandas DataFrame to this function, and it will return a new optimized copy while leaving the original DataFrame unchanged. The optimizer automatically applies safe numeric downcasting, converts low-cardinality text columns to categorical types, and reports columns that were intentionally left unchanged (such as identifiers or high-cardinality text). The optimized DataFrame behaves the same as the original in downstream analysis, but typically uses less memory.

import pandas as pd
from group_32.optimize_dataframe import optimize_dataframe

# Create a simple DataFrame
df = pd.DataFrame({
    "region": ["US", "CA", "US", "US"],
    "quantity": [1, 2, 3, 4],
    "price": [10.5, 12.0, 9.99, 11.25]
})

# Optimize memory usage
optimized_df = optimize_dataframe(df)

# Inspect optimized dtypes
print(optimized_df.dtypes)

Planned Functions

The package is designed around a simple, reproducible workflow: users call optimize_dataframe() once, and the function applies a consistent set of memory-saving steps while keeping the original DataFrame unchanged.

optimize_dataframe(df) The main user-facing entry point. It creates a copy of df and coordinates the full optimization pipeline by (1) downcasting numeric columns, (2) converting low-cardinality text columns to category, and (3) running a diagnostics pass to report columns that were intentionally left unchanged (e.g., IDs). Returns the optimized DataFrame.

optimize_numeric(df) Numeric optimizer used by optimize_dataframe(). It reduces memory usage by downcasting integer columns to the smallest safe integer dtype and optionally downcasting floating-point columns (e.g., float64 → float32) with minimal expected precision impact.

optimize_categorical(df, max_unique_ratio=0.5) Categorical optimizer used by optimize_dataframe(). It converts low-cardinality string/object columns to pandas category dtype based on the ratio of unique values to total rows. This is especially effective for repeated labels like status codes, regions, or categories.

analyze_special_columns(df) Diagnostics and transparency helper used by optimize_dataframe(). It identifies columns that may require special handling (e.g., high-cardinality IDs, coordinate columns, free-text fields) and reports them to help users understand optimization decisions. This function does not modify the DataFrame.

Installing the Package

You can install the DataFrame Memory Optimizer package directly from TestPyPI using pip. Run the following command in your terminal:

pip install -i https://test.pypi.org/simple/ DSCI-524-group32-df-optimizer

Developer Documentation

The following sections describe how developers, instructors, and TAs can set up the project locally, run tests, and build/deploy documentation. ## 1. Clone the repository and set Up the Development Environment This project uses conda for environment management. To set up the development environment, run:

git clone [email protected]:UBC-MDS/DSCI_524_group32_df_optimizer.git
cd DSCI_524_group32_df_optimizer
conda env create -f environment.yml
conda activate dataframe-memory-optimizer

The environment.yml file contains all required dependencies for development, testing, and documentation. ## 2. Install the Package To install the package in development mode, navigate to the project root directory and run:

pip install -e .

This allows you to make changes to the source code and have them reflected immediately without reinstalling the package. ## 3. Run Tests To run the test suite, use pytest. From the project root directory, execute:

pytest

This will discover and run all tests located in the tests/ directory. ## 4. Build and Serve Documentation The project uses quartodoc for documentation. To build and serve the documentation locally, run:

quarto render

This will generate the documentation in the docs/ directory. You can then open the generated HTML files in your web browser to view the documentation. ## 5. Deploy Documentation (Automated) Documentation deployment is fully automated via GitHub Actions.

Deployment workflow :-

Triggered automatically on pushes to the main branch
Builds documentation using Quarto
Deploys the site to GitHub Pages

The live documentation is available at: https://ubc-mds.github.io/DSCI_524_group32_df_optimizer/

How the Package Fits in the Python Ecosystem

Pandas provides low-level tools for memory optimization (e.g., pd.to_numeric(..., downcast=...), DataFrame.convert_dtypes()), but these require manual orchestration and offer limited transparency.

DataFrame Memory Optimizer combines these tools into a single, opinionated workflow with clear diagnostics, making memory optimization easier, safer, and more reproducible. ## Contributors

Mohammed Ibrahim

Roganci Fontelera

Wai Yan Lee

William Chong

Copyright

Free software distributed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Welcome to Our DataFrame Memory Optimizer

Project Summary

Example on how to use our package

Planned Functions

Installing the Package

Developer Documentation

How the Package Fits in the Python Ecosystem

Copyright

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Welcome to Our DataFrame Memory Optimizer

Project Summary

Example on how to use our package

Planned Functions

Installing the Package

Developer Documentation

How the Package Fits in the Python Ecosystem

Copyright