Skip to content

Commit 81d1f68

Browse files
alancmliuPurityjralahaaqilEggathin
authored
Merge Dev Branch into Main (#132)
* Add codecov into CI and add badges to `README` (#112) * Add Codecov token and enable run on pull request to the dev branch * Update badges on README * update readme - m4 (#114) * update readme: reorganize developer guide * reword comment * add dummy data for data_version_diff function * update functions docstring and README examples * update data_version_diff function (#117) * update data_version_diff function * fix flake8 issues * add tutorial for csvplus package (#113) * docs: add tutorial for csvplus package * update _quarto.yml file * fix docstrings * rename tutorial.qmd to index.qmd * update environment.yml * Add an example for the `load_optimized_csv` usage in the README.md (#120) * Fix resolve_string_value example in README (#121) * Fix test generate report (#123) * removed classes to align with rest of tests * Update changelog (#119) * Update changelog (#125) * Add two README peer review fixes details to CHANGELOG * Add Jupyter and nbformat to docs dependencies * Fix formatting issue in pyproject.toml --------- Co-authored-by: Purity jangaya <[email protected]> * added retrospective and next step section (#127) added retrospective and next step section in CONTRIBUTING.md file * Update changelog (#129) * Add emails (#131) * Add emails * Update changelog * Updated README.md examples, installation instruction, and environment.yml dependency (#128) --------- Co-authored-by: Purity jangaya <[email protected]> Co-authored-by: Ralah Aaqil <[email protected]> Co-authored-by: Oswin <[email protected]>
1 parent a773ebf commit 81d1f68

File tree

14 files changed

+1073
-574
lines changed

14 files changed

+1073
-574
lines changed

.github/workflows/build.yml

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name: ci
33
on:
44
push:
55
pull_request:
6-
branches: [main]
6+
branches: [main, dev]
77

88
workflow_dispatch:
99

@@ -27,9 +27,14 @@ jobs:
2727
2828
- name: Run tests with coverage
2929
run: |
30-
pytest --cov --cov-report=term --cov-branch
30+
pytest --cov --cov-branch --cov-report=xml
3131
3232
- name: flake8 Lint
3333
uses: py-actions/flake8@v2
3434
with:
35-
max-line-length: "100"
35+
max-line-length: "100"
36+
37+
- name: Upload coverage reports to Codecov
38+
uses: codecov/codecov-action@v5
39+
with:
40+
token: ${{ secrets.CODECOV_TOKEN }}

CHANGELOG.md

Lines changed: 73 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,79 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8-
## [Unreleased]
8+
## [3.0.0] (Milestone 4) - 2026-02-02
99

10-
- Upcoming features and fixes
10+
### Fixed
1111

12-
## [0.1.0] - (1979-01-01)
12+
- PR [#114](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/pull/114) to reorganize `README.md` for clarity and usability for both users and developers to address peer review Issue [#103](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/issues/103)
13+
- PR [#121](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/pull/121) to fix `resolve_string_value()` example in `README.md` to address peer review Issue [#100](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/issues/100)
14+
- Addressed inconsistencies in test_generate_report.py (#122)
15+
- PR [#131](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/pull/131) to add author emails to address peer review Issue [#130](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/issues/130)
1316

14-
- First release
17+
### Added
18+
19+
- Retrospective and next steps to CONTRIBUTING.md (#126)
20+
21+
## [2.0.0] (Milestone 3) - 2026-01-25
22+
23+
### Added
24+
25+
- Additional unit tests for improved coverage (#95, #80)
26+
- Flake8 linter to workflow (#83)
27+
- Quartodoc YAML file for documentation (#91)
28+
- Additional data validation and unit tests (#76)
29+
- Deploy and build workflow files (#68, #71)
30+
31+
### Changed
32+
33+
- Bump package version from 0.1.2 to 0.2.2 (#97)
34+
- Updated README for milestone 3 (#94)
35+
- Updated dependencies and deleted commented out code (#62)
36+
- Installed necessary dev and test dependencies (#69)
37+
38+
### Fixed
39+
40+
- Linter issues (#98)
41+
- Action version and added skip-existing option (#96)
42+
- Style issues and flake8 compliance (#89, #86)
43+
- Docstring style errors (#73)
44+
- Pandas version to pass all unit tests (#76)
45+
46+
## [1.0.0] (Milestone 2) - 2026-01-17
47+
48+
### Added
49+
50+
- Implemented `data_version_diff` function (#46)
51+
- Created tests for `data_version_diff` function (#53)
52+
- Improved test coverage for `data_version_diff` function (#56)
53+
- Implemented `generate_report` function (#51)
54+
- Implemented `load_optimized_csv` function with tests (#48)
55+
- Implemented `resolve_string_value` function with unit tests (#40)
56+
- Initial version of environment.yml (#32)
57+
58+
### Changed
59+
60+
- Updated README (#55)
61+
- Updated docstring and function specs (#37)
62+
- Renamed `data-correction.py` to `data_correction.py` and updated docstrings (#34)
63+
64+
## [0.0.1] (Milestone 1) - 2026-01-10
65+
66+
### Added
67+
68+
- Initial commit with project setup
69+
- Function stub and docstring for `data_version_diff` (#17)
70+
- Created `generate-report.py` with docstring (#15)
71+
- Function definition and docstring for `load_optimized_csv` (#14)
72+
- Added `resolve_string_value` function in data-correction.py (#13)
73+
- Package details and contributors in README (#12)
74+
75+
### Changed
76+
77+
- Updated code of conduct to reflect group values (#11)
78+
- Edited CONTRIBUTING.md (#18)
79+
- Added raised errors to docstring (#16)
80+
81+
### Fixed
82+
83+
- Address inconsistencies in function names in README.md and data-correction.py (#21)

CONTRIBUTING.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,3 +117,18 @@ When opening a Pull Request:
117117
- Technical decisions are discussed during team meetings or in GitHub Issues
118118
- If consensus cannot be reached, the team will vote
119119
- Blocking issues should be raised as early as possible to avoid deadline risk
120+
121+
## Retrospective and Next Steps
122+
123+
Our group used the development tools introduced in DSCI 524, including Python Packages, PyTest, Continuous Integration
124+
and Deployment, and publishing on PyPI. Flake8 linter was used to maintain code quality.
125+
126+
We followed a GitHub flow workflow, where we listed Issues and created a branch for each issue.
127+
Each pull request addresses a specific issue and requires a review from at least one other group member before merging.
128+
129+
GitHub was our main form of organization, with Issues used to communicate, report bugs, and keep track of progress.
130+
For timely responses, we also used Slack as a secondary means of communication.
131+
132+
If we were to scale up our project, we would still used Git version control and CI/CD with trunk-based development.
133+
External software such as Jira can be used for task management and bug reporting.
134+
In general, the tools used in this course are well-suited for adaptation at a larger scale.

README.md

Lines changed: 83 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
# csvplus
22

3-
| | |
4-
|--------|--------|
5-
| Package | [![Latest PyPI Version](https://img.shields.io/pypi/v/csvplus-1.svg)](https://pypi.org/project/csvplus-1/) [![Supported Python Versions](https://img.shields.io/pypi/pyversions/csvplus-1.svg)](https://pypi.org/project/csvplus-1/) |
6-
| Meta | [![Code of Conduct](https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg)](CODE_OF_CONDUCT.md) |
3+
| | |
4+
| ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
5+
| CI/CD | [![CI](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/actions/workflows/build.yml/badge.svg)](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/actions/workflows/build.yml) [![codecov](https://codecov.io/github/UBC-MDS/DSCI_524_group37_csvplus/graph/badge.svg?token=zmpNtn6nI6)](https://codecov.io/github/UBC-MDS/DSCI_524_group37_csvplus) |
6+
| Package | [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) |
7+
| Meta | [![Code of Conduct](https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg)](CODE_OF_CONDUCT.md) |
78

89
> **Note**: PyPI badges are included for completeness but may not reflect a published package.
910
@@ -27,21 +28,20 @@ The package is intended to support:
2728

2829
This package addresses common data preprocessing and exploration tasks through the following functions:
2930

30-
|Function |Description |
31-
|--------|--------|
32-
|`load_optimized_csv`|Loads a CSV file and automatically downcasts data types to minimize memory footprint.|
33-
|`data_version_diff`|Compare two versions of a pandas DataFrame and return a structured summary of schema, row count, missing values, numeric statistics, and data type changes.|
34-
|`resolve_string_value`|Consolidating spelling variations of the same data value in a column.|
35-
|`summary_report`|Produce a list of descriptive statistics of the data and information about missing values.|
31+
| Function | Description |
32+
| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
33+
| `load_optimized_csv` | Loads a CSV file and automatically downcasts data types to minimize memory footprint. |
34+
| `data_version_diff` | Compare two versions of a pandas DataFrame and return a structured summary of schema, row count, missing values, numeric statistics, and data type changes. |
35+
| `resolve_string_value` | Consolidating spelling variations of the same data value in a column. |
36+
| `summary_report` | Produce a list of descriptive statistics of the data and information about missing values. |
3637

3738
Some functions operate on **CSV files**, while others work directly on **pandas DataFrames**, allowing users to integrate `csvplus` into existing pandas-based workflows.
3839

3940
Our package fits into the Python preprocessing framework. Currently, the [`pandas`](https://pandas.pydata.org/) package provides basic functionality to read CSV and produce summary statistics, and the [`pyjanitor`](https://pyjanitor-devs.github.io/pyjanitor/) package provides functions for sanitizing the column names and converting column dtype.
4041

4142
`csvplus` extends these tools with automated memory optimization, dataset version comparison and high-level summaries useful for auditing and exploratory analysis
4243

43-
Full API reference and examples are available at: https://ubc-mds.github.io/DSCI_524_group37_csvplus/reference/
44-
---
44+
## Full API reference and examples are available at: https://ubc-mds.github.io/DSCI_524_group37_csvplus/reference/
4545

4646
## Get started
4747

@@ -67,12 +67,6 @@ pip install --index-url https://test.pypi.org/simple/ --extra-index-url https:/
6767

6868
> Note: Step 3 is only required on macOS due to a known rapidfuzz build issue. On Linux or Windows, pip will install dependencies automatically.
6969
70-
Or install from PyPI (once published)
71-
72-
```bash
73-
pip install csvplus
74-
```
75-
7670
## Usage Examples
7771

7872
```python
@@ -81,78 +75,118 @@ from csvplus.data_version_diff import data_version_diff, display_data_version_di
8175
from csvplus.load_optimized_csv import load_optimized_csv
8276
from csvplus.data_correction import resolve_string_value
8377
from csvplus.generate_report import summary_report
78+
import tempfile
79+
import os
8480

8581
# --- compare two DataFrame versions ---
86-
df_old = pd.DataFrame({"id": [1,2,3], "value": [10,20,30]})
87-
df_new = pd.DataFrame({"id": [1,2,3,4], "value": [10,25,30,40], "category": ["A","B",None,"C"], "amount": [100,200,300,400]})
88-
89-
diff = data_version_diff(df_old, df_new)
90-
display_data_version_diff(diff)
82+
# Original dataset
83+
df_v1 = pd.DataFrame({
84+
"id": [1, 2, 3],
85+
"value": [10, 20, 30],
86+
"status": [1, 0, 1]
87+
})
88+
89+
# Updated dataset
90+
df_v2 = pd.DataFrame({
91+
"id": [1, 2, 3, 4],
92+
"value": ["10", "25", "30", "40"],
93+
"category": ["A", "B", None, "C"],
94+
"amount": [100, 200, 300, 400]
95+
})
96+
97+
diff = data_version_diff(df_v1, df_v2)
98+
display_data_version_diff(diff) #prints a human-readable summary of the comparison.
9199

92100
# --- resolve string value --
93101
df1 = pd.DataFrame({ "company": ["Google", "Gooogle", "Gogle", "Microsoft", "Microsof"]})
94-
resolve_string_value(df1, column="company", canonical_values=["Google", "Microsoft"],threshold=80)
95-
print(df)
96-
97-
# --- load a CSV file with optimized memory usage ---
98-
df = load_optimized_csv("large_dataset.csv")
99-
print(df1.dtypes)
102+
resolve_string_value(df1, "company", ["Google", "Microsoft"], 80)
103+
print(df1)
100104

101105
# --- Generate summary statistics ---
106+
df = pd.DataFrame({
107+
'age': [25, 21, 32, None, 40],
108+
'city': ['NYC', 'LA', 'NYC', 'SF', 'LA']
109+
})
102110
numeric_stats, categorical_stats = summary_report(df)
103111
print(numeric_stats.head())
104112
print(categorical_stats.head())
113+
114+
# --- load a CSV file with optimized memory usage ---
115+
sample_data = pd.DataFrame({
116+
"int8_col": [1, 2, 100, -100, 5],
117+
"int16_col": [1000, -1000, 30000, -30000, 500],
118+
"float_col": [1.123, 2.234, 3.345, 4.456, 5.567],
119+
"sparse_col": [0, 0, 0, 0, 1], # 80% zeros -> will be sparse
120+
"category_col": ["A", "A", "B", "B", "C"] # low cardinality -> categorical
121+
})
122+
123+
with tempfile.TemporaryDirectory() as tmp_dir:
124+
csv_path = os.path.join(tmp_dir, "sample.csv")
125+
sample_data.to_csv(csv_path, index=False)
126+
127+
df_optimized = load_optimized_csv(csv_path)
128+
print("Optimized dtypes:")
129+
print(df_optimized.dtypes)
130+
# int8_col -> int8 (downcasted)
131+
# int16_col -> int16 (downcasted)
132+
# float_col -> float32 (downcasted)
133+
# sparse_col -> Sparse[int8, 0] (sparse conversion)
134+
# category_col -> category (categorical conversion)
105135
```
106136

107137
## Developers
108138

109139
### Development Setup
110140

111-
Create conda environment and clone the repo.
141+
Clone the repo, create conda environment and register the csvplus environment as a Jupyter kernel.
112142

113143
```bash
144+
git clone https://github.com/UBC-MDS/DSCI_524_group37_csvplus
145+
cd DSCI_524_group37_csvplus
146+
114147
conda env create -f environment.yml
115148
conda activate csvplus
116149

117-
git clone https://github.com/UBC-MDS/DSCI_524_group37_csvplus
118-
cd DSCI_524_group37_csvplus
150+
# Optional: register the environment as a Jupyter/Quarto kernel
151+
# (required only if kernel is not registered correctly)
152+
python -m ipykernel install --user --name csvplus --display-name "csvplus"
153+
119154
```
120155

121-
### Run Tests and Coverage
156+
### Install csvplus package (editable mode)
122157

123-
All tests are written using `pytest`. To run the full test suite and generate a coverage report execute:
158+
This allows you to edit the source code locally while using the package.
124159

125160
```bash
126-
# install coverage tools if not yet installed
127-
pip install pytest pytest-cov
128-
129-
pytest --cov=csvplus --cov-report=term-missing
161+
pip install -e ".[docs]"
130162
```
131163

132-
### Install csvplus package (editable mode)
164+
### Run Tests and Coverage
133165

134-
This allows you to edit the source code locally while using the package.
166+
All tests are written using `pytest`. To run the full test suite and generate a coverage report execute:
135167

136168
```bash
137-
pip install -e .
169+
pytest --cov=csvplus --cov-report=term-missing
138170
```
139171

140172
### Build and Preview Documentation
141173

142174
```bash
143175
quartodoc build
144-
quarto preview
145176
quarto render
177+
quarto preview
146178
```
147179

148-
## Contributors
180+
### Deploy Documentation (automated)
181+
Documentation is deployed automatically by the `build-docs` job in `.github/workflows/docs-publish.yml` on a pull request (PR) aimed at the main branch.
149182

150-
- Alan Liu
151-
- Oswin Gan
152-
- Purity Jangaya
153-
- Ralah Aaqil
183+
## Contributors
154184

155-
## License
185+
- Alan Liu ([email protected])
186+
- Oswin Gan ([email protected])
187+
- Purity Jangaya ([email protected])
188+
- Ralah Aaqil ([email protected])
156189

157-
- Copyright © 2026
190+
## Copyright
191+
- Copyright © 2026 Alan Liu, Oswin Gan, Purity Jangaya, Ralah Aaqil
158192
- Free software distributed under the [MIT License](./LICENSE).

_quarto.yml

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,14 @@ website:
77
background: primary
88
search: true
99
left:
10-
- text: "Home"
10+
# - text: "Home"
11+
# file: index.qmd
12+
- text: "Tutorial"
1113
file: index.qmd
1214
- text: "Reference"
1315
file: reference/index.qmd
16+
- text: "Contributing"
17+
file: CONTRIBUTING.md
1418

1519
# tell quarto to read the generated sidebar
1620
metadata-files:
@@ -31,6 +35,17 @@ quartodoc:
3135
css: reference/_styles-quartodoc.css
3236

3337
sections:
38+
- title: Overview
39+
desc: |
40+
`csvplus` is a lightweight Python package that provides **practical utilities for loading,
41+
comparing, cleaning, and summarizing tabular data**. While some functions operate directly on
42+
CSV files, others work with **pandas DataFrames**, making `csvplus` easy to integrate into
43+
existing data analysis workflows.
44+
45+
It is designed for data scientists, analysts, and students who work with evolving datasets
46+
and want **clear, interpretable insights** into data structure, quality, and change over time.
47+
contents: []
48+
3449
- title: Data Loading
3550
desc: Function for loading a CSV file and return a memory-optimized DataFrame.
3651
contents:
@@ -40,7 +55,7 @@ quartodoc:
4055
desc: Function for summarizing structural and statistical differences between two DataFrame versions.
4156
contents:
4257
- data_version_diff
43-
- data_version_diff.display_data_version_diff
58+
# - data_version_diff.display_data_version_diff
4459

4560
- title: Data Cleaning
4661
desc: Function for resolving inconsistent string values to standardized names using fuzzy matching.

0 commit comments

Comments
 (0)