UBC-MDS
diff --git a/‎.github/workflows/build.yml‎
Lines changed: 8 additions & 3 deletions b/‎.github/workflows/build.yml‎
Lines changed: 8 additions & 3 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 73 additions & 4 deletions b/‎CHANGELOG.md‎
Lines changed: 73 additions & 4 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 15 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 83 additions & 49 deletions b/‎README.md‎
Lines changed: 83 additions & 49 deletions
diff --git a/‎_quarto.yml‎
Lines changed: 17 additions & 2 deletions b/‎_quarto.yml‎
Lines changed: 17 additions & 2 deletions
@@ -3,7 +3,7 @@ name: ci
 on:
   push:
   pull_request:
-    branches: [main]
+    branches: [main, dev]
 
   workflow_dispatch:
 
@@ -27,9 +27,14 @@ jobs:
 
       - name: Run tests with coverage
         run: |
-          pytest --cov --cov-report=term --cov-branch
+          pytest --cov --cov-branch --cov-report=xml
       
       - name: flake8 Lint
         uses: py-actions/flake8@v2
         with:
-          max-line-length: "100"
+          max-line-length: "100"
+      
+      - name: Upload coverage reports to Codecov
+        uses: codecov/codecov-action@v5
+        with:
+          token: ${{ secrets.CODECOV_TOKEN }}
@@ -5,10 +5,79 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## [Unreleased]
+## [3.0.0] (Milestone 4) - 2026-02-02
 
-- Upcoming features and fixes
+### Fixed
 
-## [0.1.0] - (1979-01-01)
+- PR [#114](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/pull/114) to reorganize `README.md` for clarity and usability for both users and developers to address peer review Issue [#103](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/issues/103)
+- PR [#121](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/pull/121) to fix `resolve_string_value()` example in `README.md` to address peer review Issue [#100](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/issues/100)
+- Addressed inconsistencies in test_generate_report.py (#122)
+- PR [#131](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/pull/131) to add author emails to address peer review Issue [#130](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/issues/130)
 
-- First release
+### Added
+
+- Retrospective and next steps to CONTRIBUTING.md (#126)
+
+## [2.0.0] (Milestone 3) - 2026-01-25
+
+### Added
+
+- Additional unit tests for improved coverage (#95, #80)
+- Flake8 linter to workflow (#83)
+- Quartodoc YAML file for documentation (#91)
+- Additional data validation and unit tests (#76)
+- Deploy and build workflow files (#68, #71)
+
+### Changed
+
+- Bump package version from 0.1.2 to 0.2.2 (#97)
+- Updated README for milestone 3 (#94)
+- Updated dependencies and deleted commented out code (#62)
+- Installed necessary dev and test dependencies (#69)
+
+### Fixed
+
+- Linter issues (#98)
+- Action version and added skip-existing option (#96)
+- Style issues and flake8 compliance (#89, #86)
+- Docstring style errors (#73)
+- Pandas version to pass all unit tests (#76)
+
+## [1.0.0] (Milestone 2) - 2026-01-17
+
+### Added
+
+- Implemented `data_version_diff` function (#46)
+- Created tests for `data_version_diff` function (#53)
+- Improved test coverage for `data_version_diff` function (#56)
+- Implemented `generate_report` function (#51)
+- Implemented `load_optimized_csv` function with tests (#48)
+- Implemented `resolve_string_value` function with unit tests (#40)
+- Initial version of environment.yml (#32)
+
+### Changed
+
+- Updated README (#55)
+- Updated docstring and function specs (#37)
+- Renamed `data-correction.py` to `data_correction.py` and updated docstrings (#34)
+
+## [0.0.1] (Milestone 1) - 2026-01-10
+
+### Added
+
+- Initial commit with project setup
+- Function stub and docstring for `data_version_diff` (#17)
+- Created `generate-report.py` with docstring (#15)
+- Function definition and docstring for `load_optimized_csv` (#14)
+- Added `resolve_string_value` function in data-correction.py (#13)
+- Package details and contributors in README (#12)
+
+### Changed
+
+- Updated code of conduct to reflect group values (#11)
+- Edited CONTRIBUTING.md (#18)
+- Added raised errors to docstring (#16)
+
+### Fixed
+
+- Address inconsistencies in function names in README.md and data-correction.py (#21)
@@ -117,3 +117,18 @@ When opening a Pull Request:
 - Technical decisions are discussed during team meetings or in GitHub Issues
 - If consensus cannot be reached, the team will vote
 - Blocking issues should be raised as early as possible to avoid deadline risk
+
+## Retrospective and Next Steps
+
+Our group used the development tools introduced in DSCI 524, including Python Packages, PyTest, Continuous Integration
+and Deployment, and publishing on PyPI. Flake8 linter was used to maintain code quality.
+
+We followed a GitHub flow workflow, where we listed Issues and created a branch for each issue.
+Each pull request addresses a specific issue and requires a review from at least one other group member before merging.
+
+GitHub was our main form of organization, with Issues used to communicate, report bugs, and keep track of progress.
+For timely responses, we also used Slack as a secondary means of communication.
+
+If we were to scale up our project, we would still used Git version control and CI/CD with trunk-based development.
+External software such as Jira can be used for task management and bug reporting.
+In general, the tools used in this course are well-suited for adaptation at a larger scale.
@@ -1,9 +1,10 @@
 # csvplus
 
-|        |        |
-|--------|--------|
-| Package | [![Latest PyPI Version](https://img.shields.io/pypi/v/csvplus-1.svg)](https://pypi.org/project/csvplus-1/) [![Supported Python Versions](https://img.shields.io/pypi/pyversions/csvplus-1.svg)](https://pypi.org/project/csvplus-1/)  |
-| Meta   | [![Code of Conduct](https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg)](CODE_OF_CONDUCT.md) |
+|         |                                                                                                                                                                                                                                                                                                                                                           |
+| ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| CI/CD   | [![CI](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/actions/workflows/build.yml/badge.svg)](https://github.com/UBC-MDS/DSCI_524_group37_csvplus/actions/workflows/build.yml) [![codecov](https://codecov.io/github/UBC-MDS/DSCI_524_group37_csvplus/graph/badge.svg?token=zmpNtn6nI6)](https://codecov.io/github/UBC-MDS/DSCI_524_group37_csvplus) |
+| Package | [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)                                                                                                                                                                                                                                                  |
+| Meta    | [![Code of Conduct](https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg)](CODE_OF_CONDUCT.md)                                                                                                                                                                                                                                   |
 
 > **Note**: PyPI badges are included for completeness but may not reflect a published package.
 
@@ -27,21 +28,20 @@ The package is intended to support:
 
 This package addresses common data preprocessing and exploration tasks through the following functions:
 
-|Function    |Description    |
-|--------|--------|
-|`load_optimized_csv`|Loads a CSV file and automatically downcasts data types to minimize memory footprint.|
-|`data_version_diff`|Compare two versions of a pandas DataFrame and return a structured summary of schema, row count, missing values, numeric statistics, and data type changes.|
-|`resolve_string_value`|Consolidating spelling variations of the same data value in a column.|
-|`summary_report`|Produce a list of descriptive statistics of the data and information about missing values.|
+| Function               | Description                                                                                                                                                 |
+| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `load_optimized_csv`   | Loads a CSV file and automatically downcasts data types to minimize memory footprint.                                                                       |
+| `data_version_diff`    | Compare two versions of a pandas DataFrame and return a structured summary of schema, row count, missing values, numeric statistics, and data type changes. |
+| `resolve_string_value` | Consolidating spelling variations of the same data value in a column.                                                                                       |
+| `summary_report`       | Produce a list of descriptive statistics of the data and information about missing values.                                                                  |
 
 Some functions operate on **CSV files**, while others work directly on **pandas DataFrames**, allowing users to integrate `csvplus` into existing pandas-based workflows.
 
 Our package fits into the Python preprocessing framework. Currently, the [`pandas`](https://pandas.pydata.org/) package provides basic functionality to read CSV and produce summary statistics, and the [`pyjanitor`](https://pyjanitor-devs.github.io/pyjanitor/) package provides functions for sanitizing the column names and converting column dtype.
 
 `csvplus` extends these tools with automated memory optimization, dataset version comparison and high-level summaries useful for auditing and exploratory analysis
 
-Full API reference and examples are available at: https://ubc-mds.github.io/DSCI_524_group37_csvplus/reference/
----
+## Full API reference and examples are available at: https://ubc-mds.github.io/DSCI_524_group37_csvplus/reference/
 
 ## Get started
 
@@ -67,12 +67,6 @@ pip install --index-url https://test.pypi.org/simple/  --extra-index-url https:/
 
 > Note: Step 3 is only required on macOS due to a known rapidfuzz build issue. On Linux or Windows, pip will install dependencies automatically.
 
-Or install from PyPI (once published)
-
-```bash
-pip install csvplus
-```
-
 ## Usage Examples
 
 ```python
@@ -81,78 +75,118 @@ from csvplus.data_version_diff import data_version_diff, display_data_version_di
 from csvplus.load_optimized_csv import load_optimized_csv
 from csvplus.data_correction import resolve_string_value
 from csvplus.generate_report import summary_report
+import tempfile
+import os
 
 # --- compare two DataFrame versions ---
-df_old = pd.DataFrame({"id": [1,2,3], "value": [10,20,30]})
-df_new = pd.DataFrame({"id": [1,2,3,4], "value": [10,25,30,40], "category": ["A","B",None,"C"], "amount": [100,200,300,400]})
-
-diff = data_version_diff(df_old, df_new)
-display_data_version_diff(diff)
+# Original dataset
+df_v1 = pd.DataFrame({
+    "id": [1, 2, 3],
+    "value": [10, 20, 30],
+    "status": [1, 0, 1]
+})
+
+# Updated dataset
+df_v2 = pd.DataFrame({
+    "id": [1, 2, 3, 4],
+    "value": ["10", "25", "30", "40"],
+    "category": ["A", "B", None, "C"],
+    "amount": [100, 200, 300, 400]
+})
+
+diff = data_version_diff(df_v1, df_v2)
+display_data_version_diff(diff)  #prints a human-readable summary of the comparison.
 
 # --- resolve string value --
 df1 = pd.DataFrame({ "company": ["Google", "Gooogle", "Gogle", "Microsoft", "Microsof"]})
-resolve_string_value(df1, column="company", canonical_values=["Google", "Microsoft"],threshold=80)
-print(df)
-
-# --- load a CSV file with optimized memory usage ---
-df = load_optimized_csv("large_dataset.csv")
-print(df1.dtypes) 
+resolve_string_value(df1, "company", ["Google", "Microsoft"], 80)
+print(df1)
 
 # --- Generate summary statistics ---
+df = pd.DataFrame({
+    'age': [25, 21, 32, None, 40],
+    'city': ['NYC', 'LA', 'NYC', 'SF', 'LA']
+     })
 numeric_stats, categorical_stats = summary_report(df)
 print(numeric_stats.head())
 print(categorical_stats.head())
+
+# --- load a CSV file with optimized memory usage ---
+sample_data = pd.DataFrame({
+    "int8_col": [1, 2, 100, -100, 5],
+    "int16_col": [1000, -1000, 30000, -30000, 500],
+    "float_col": [1.123, 2.234, 3.345, 4.456, 5.567],
+    "sparse_col": [0, 0, 0, 0, 1],       # 80% zeros -> will be sparse
+    "category_col": ["A", "A", "B", "B", "C"]  # low cardinality -> categorical
+})
+
+with tempfile.TemporaryDirectory() as tmp_dir:
+    csv_path = os.path.join(tmp_dir, "sample.csv")
+    sample_data.to_csv(csv_path, index=False)
+
+    df_optimized = load_optimized_csv(csv_path)
+    print("Optimized dtypes:")
+    print(df_optimized.dtypes)
+    # int8_col      -> int8 (downcasted)
+    # int16_col     -> int16 (downcasted)
+    # float_col     -> float32 (downcasted)
+    # sparse_col    -> Sparse[int8, 0] (sparse conversion)
+    # category_col  -> category (categorical conversion)
 ```
 
 ## Developers
 
 ### Development Setup
 
-Create conda environment and clone the repo.
+Clone the repo, create conda environment and register the csvplus environment as a Jupyter kernel.
 
 ```bash
+git clone https://github.com/UBC-MDS/DSCI_524_group37_csvplus
+cd DSCI_524_group37_csvplus
+
 conda env create -f environment.yml
 conda activate csvplus
 
-git clone https://github.com/UBC-MDS/DSCI_524_group37_csvplus
-cd DSCI_524_group37_csvplus
+# Optional: register the environment as a Jupyter/Quarto kernel
+# (required only if kernel is not registered correctly)
+python -m ipykernel install --user --name csvplus --display-name "csvplus"
+
 ```
 
-### Run Tests and Coverage
+### Install csvplus package (editable mode)
 
-All tests are written using `pytest`. To run the full test suite and generate a coverage report execute:
+This allows you to edit the source code locally while using the package.
 
 ```bash
-# install coverage tools if not yet installed
-pip install pytest pytest-cov
-
-pytest --cov=csvplus --cov-report=term-missing
+pip install -e ".[docs]"
 ```
 
-### Install csvplus package (editable mode)
+### Run Tests and Coverage
 
-This allows you to edit the source code locally while using the package.
+All tests are written using `pytest`. To run the full test suite and generate a coverage report execute:
 
 ```bash
-pip install -e .
+pytest --cov=csvplus --cov-report=term-missing
 ```
 
 ### Build and Preview Documentation
 
 ```bash
 quartodoc build
-quarto preview 
 quarto render
+quarto preview
 ```
 
-## Contributors
+### Deploy Documentation (automated)
+Documentation is deployed automatically by the `build-docs` job in `.github/workflows/docs-publish.yml` on a pull request (PR) aimed at the main branch.
 
-- Alan Liu
-- Oswin Gan
-- Purity Jangaya
-- Ralah Aaqil
+## Contributors
 
-## License
+- Alan Liu ([email protected])
+- Oswin Gan ([email protected])
+- Purity Jangaya ([email protected])
+- Ralah Aaqil ([email protected])
 
-- Copyright © 2026
+## Copyright
+- Copyright © 2026 Alan Liu, Oswin Gan, Purity Jangaya, Ralah Aaqil
 - Free software distributed under the [MIT License](./LICENSE).
@@ -7,10 +7,14 @@ website:
     background: primary
     search: true
     left:
-      - text: "Home"
+      # - text: "Home"
+      #   file: index.qmd
+      - text: "Tutorial"
         file: index.qmd
       - text: "Reference"
         file: reference/index.qmd
+      - text: "Contributing"
+        file: CONTRIBUTING.md
 
 # tell quarto to read the generated sidebar
 metadata-files:
@@ -31,6 +35,17 @@ quartodoc:
   css: reference/_styles-quartodoc.css
 
   sections:
+    - title: Overview
+      desc: |
+        `csvplus` is a lightweight Python package that provides **practical utilities for loading, 
+        comparing, cleaning, and summarizing tabular data**. While some functions operate directly on 
+        CSV files, others work with **pandas DataFrames**, making `csvplus` easy to integrate into 
+        existing data analysis workflows.
+        
+        It is designed for data scientists, analysts, and students who work with evolving datasets 
+        and want **clear, interpretable insights** into data structure, quality, and change over time.
+      contents: []
+
     - title: Data Loading
       desc: Function for loading a CSV file and return a memory-optimized DataFrame.
       contents:
@@ -40,7 +55,7 @@ quartodoc:
       desc: Function for summarizing structural and statistical differences between two DataFrame versions.
       contents:
         - data_version_diff
-        - data_version_diff.display_data_version_diff
+        # - data_version_diff.display_data_version_diff
 
     - title: Data Cleaning
       desc: Function for resolving inconsistent string values to standardized names using fuzzy matching.