Update Docs

Petrinax · Petrinax · commit b23df203dbbe · 2025-07-19T23:43:59.000+05:30
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -0,0 +1,93 @@
+# System Architecture: Expense Classifier
+
+## Overview
+
+The Expense Classifier is architected as a modular, extensible ETL pipeline for financial transaction data. It is designed for robustness, scalability, and ease of extension, supporting both CLI and programmatic use cases.
+
+---
+
+## Architecture Diagram
+
+```mermaid
+flowchart TD
+    A["User Input (CLI/Script)"] --> B["Ingestor"]
+    B --> C["Transformer"]
+    C --> D["Paytm Lookup (optional)"]
+    D --> E["Classifier"]
+    E --> F["File Correction / Manual Correction"]
+    F --> G["Database (SQLAlchemy)"]
+    F --> H["Reporting/Export"]
+    G --> H
+    H --> I["Final Output (CSV/DB/Report)"]
+    style A fill:#f9f,stroke:#333,stroke-width:2px
+    style I fill:#bbf,stroke:#333,stroke-width:2px
+```
+
+---
+
+## Component Breakdown
+
+### 1. Ingestor
+- Loads raw bank/Paytm files (CSV/XLSX)
+- Standardizes columns and cleans data
+- Handles messy, real-world input formats
+
+### 2. Transformer
+- Extracts transaction details (mode, payee, UPI ID, etc.)
+- Adds derived columns (e.g., group, account, fiscal period)
+- Normalizes and enriches data for downstream processing
+
+### 3. Paytm Lookup (Optional)
+- Matches uncategorized transactions with Paytm UPI data
+- Enriches records for improved classification
+- Supports both file and DB sources
+
+### 4. Classifier
+- Assigns categories using a rule-based engine (keyword-driven)
+- Extensible for AI/ML-based classification
+- Handles both expense and income categories
+
+### 5. File/Manual Correction
+- Exports uncategorized transactions for user review
+- Allows users to add new keywords/categories
+- Updates mappings for future automation
+
+### 6. Database (SQLAlchemy)
+- Stores all pipeline stages and category mappings
+- Enables persistent analytics and reporting
+- Alembic migrations for schema evolution
+
+### 7. Reporting/Export
+- Outputs clean, categorized data to CSV and/or database
+- (Planned) hooks for dashboards and BI tools
+
+---
+
+## Data Flow & Extensibility
+
+- **Pipeline Orchestration:** Each stage is a class with a clear interface, enabling easy extension or replacement.
+- **Configurable:** Supports custom columns, banks, and enrichment steps via parameters.
+- **Extensible:** Add new data sources, transformation logic, or classification methods with minimal code changes.
+- **Persistence:** All intermediate and final data can be stored for auditability and reproducibility.
+
+---
+
+## Engineering Highlights
+
+- **Separation of Concerns:** Each module handles a single responsibility.
+- **Testability:** Modular design enables unit and integration testing.
+- **Performance:** Vectorized operations and batch DB writes.
+- **User-Centric:** Interactive correction and easy customization.
+- **Scalability:** Designed to handle large datasets and evolving requirements.
+
+---
+
+## Scalability, Maintainability, Extensibility
+
+- **Scalability:** Efficient pandas operations, batch processing, and DB integration support large data volumes.
+- **Maintainability:** Clear module boundaries, docstrings, and type hints.
+- **Extensibility:** Plug-and-play pipeline stages, easy to add new banks, categories, or enrichment logic.
+
+---
+
+For more, see the [main README](README.md) or [DATA_ENGINEERING.md](DATA_ENGINEERING.md). 
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,62 @@
+# Contributing to Expense Classifier
+
+Thank you for your interest in contributing! Your help is welcome to make this project better for everyone.
+
+---
+
+## Getting Started
+
+1. **Fork the repository** and clone your fork.
+2. **Set up a virtual environment:**
+   ```bash
+   python -m venv venv
+   source venv/bin/activate
+   pip install -r requirements.txt
+   ```
+3. **Install in editable mode:**
+   ```bash
+   pip install -e .
+   ```
+4. **Create a new branch** for your feature or bugfix.
+
+---
+
+## Code Style & Best Practices
+
+- Follow [PEP8](https://www.python.org/dev/peps/pep-0008/) for Python code.
+- Write clear docstrings and comments.
+- Keep functions and classes small and focused.
+- Use type hints where possible.
+- Modularize code for testability and reuse.
+
+---
+
+## Submitting Issues & Pull Requests
+
+- **Issues:** Please use GitHub Issues for bugs, feature requests, or questions.
+- **Pull Requests:**
+  - Reference the issue you are addressing (if any).
+  - Describe your changes clearly.
+  - Ensure your code runs and passes all checks.
+  - Add/modify tests if relevant (testing suite in progress).
+
+---
+
+## Testing (Planned)
+
+- Automated tests will use `pytest`.
+- Please include tests for new features or bugfixes when the suite is available.
+
+---
+
+## Code of Conduct
+
+- Be respectful and inclusive.
+- Provide constructive feedback.
+- Help others learn and grow.
+
+---
+
+## Contact
+
+For questions or suggestions, open an issue or contact [Piyush Upreti](mailto:piyushupreti@gmail.com). 
diff --git a/DATA_ENGINEERING.md b/DATA_ENGINEERING.md
@@ -0,0 +1,108 @@
+# Data Engineering Deep Dive: Expense Classifier
+
+## ETL Pipeline Overview
+
+The pipeline is designed as a modular, extensible sequence of stages:
+
+1. **Ingestion**: Load and standardize raw bank/Paytm files (CSV/XLSX)
+2. **Transformation**: Extract and normalize transaction details
+3. **Enrichment**: Paytm lookup and manual/file correction
+4. **Classification**: Assign categories using rule-based logic
+5. **Publishing**: Export to CSV and/or database
+
+---
+
+## 1. Ingestion
+
+- **File Support:** CSV, XLSX (bank statements, Paytm UPI)
+- **Standardization:** Renames columns, parses dates, cleans numeric fields
+- **Validation:** Drops empty rows, handles invalid/missing data
+
+```python
+from expense_classifier.ingestor import Ingestor
+
+df = Ingestor('my_statement.csv').get_data()
+```
+
+---
+
+## 2. Transformation
+
+- **Feature Extraction:** Payment mode, payee, UPI ID, etc.
+- **Derived Columns:** Group (Income/Expense), Account, Fiscal Period
+- **Normalization:** Lowercases descriptions, standardizes formats
+
+```python
+from expense_classifier.transformer import Transformer
+
+df = Transformer(df, bank, account_name).transform()
+```
+
+---
+
+## 3. Enrichment
+
+### Paytm Lookup
+- Matches uncategorized transactions with Paytm UPI data
+- Supports both file and DB sources
+
+```python
+from expense_classifier.paytm_lookup import PaytmLookup
+
+lookup = PaytmLookup(df, 'paytm.xlsx')
+df = lookup.perform_lookup()
+```
+
+### Manual/File Correction
+- Exports uncategorized records for user review
+- Allows user to add new keywords/categories
+- Updates mappings for future automation
+
+---
+
+## 4. Classification
+
+- **Rule-Based Engine:** Keyword-driven, extensible
+- **Handles:** Both expense and income categories
+- **Extensible:** Plug in AI/ML models for advanced classification
+
+```python
+from expense_classifier.classifier import Classifier
+
+df = Classifier().classify(df)
+```
+
+---
+
+## 5. Publishing
+
+- **Export:** Clean, categorized data to CSV and/or database
+- **Persistence:** All stages can be stored for auditability
+
+---
+
+## Error Handling & Validation
+
+- **Input Validation:** Checks file formats, required columns, and data types
+- **Error Logging:** Handles and logs invalid/missing data
+- **User Prompts:** Interactive correction for ambiguous cases
+
+---
+
+## Performance Optimizations
+
+- **Vectorized Operations:** Uses pandas for fast, efficient ETL
+- **Batch DB Writes:** Efficient storage of large datasets
+- **Progress Storage:** Optionally saves intermediate results
+
+---
+
+## Extensibility
+
+- **Add New Banks:** Extend bank_utils and mappings
+- **Custom Logic:** Plug in new transformation or classification modules
+- **Enrichment:** Add new data sources (e.g., other UPI providers)
+
+---
+
+For more, see [ARCHITECTURE.md](ARCHITECTURE.md) or [README.md](README.md). 
diff --git a/FEATURES.md b/FEATURES.md
@@ -0,0 +1,90 @@
+# Features & Usage: Expense Classifier
+
+## Feature Overview
+
+| Category   | Feature                                      | Description                                                      |
+|------------|----------------------------------------------|------------------------------------------------------------------|
+| Core       | CLI Processing                               | One-command classification of bank statements                    |
+| Core       | Programmatic API                             | Use pipeline in Python scripts                                   |
+| Core       | Modular ETL Pipeline                         | Ingest, transform, classify, correct, and export                 |
+| Core       | Database Integration                         | SQLAlchemy ORM, persistent storage, Alembic migrations           |
+| Core       | Manual/File Correction                       | Export uncategorized records for user review and enrichment       |
+| Core       | Multi-bank Support                           | Works with multiple banks, custom columns                        |
+| Advanced   | Paytm UPI Lookup                             | Enrich uncategorized records with Paytm data                     |
+| Advanced   | Progress Storage                             | Save intermediate results for auditability                       |
+| Advanced   | Custom Categories/Keywords                   | User-driven enrichment and learning                              |
+| Advanced   | Reporting/Export                             | Clean CSVs, DB tables for analysis                               |
+| Advanced   | Error Handling                               | Validates input, handles edge cases                              |
+| Planned    | AI/ML Classification                         | Plug-in for AI-based categorization                              |
+| Planned    | Visualization/Dashboards                     | Hooks for BI tools and dashboards                                |
+| Planned    | Automated Testing                            | pytest-based test suite                                          |
+
+---
+
+## CLI Usage Examples
+
+```bash
+# Basic usage
+expense-classifier --path my_statement.csv --bank-code SBI --account "Savings Account"
+
+# With Paytm lookup and manual correction
+expense-classifier --path my_statement.csv --bank-code SBI --account "Savings Account" --paytm-lookup --paytm-file paytm.xlsx
+
+# Store results in database
+expense-classifier --path my_statement.csv --bank-code SBI --account "Savings Account" --store-in-db
+```
+
+---
+
+## Programmatic Usage Example
+
+```python
+from expense_classifier.pipeline import Pipeline
+
+pipeline = Pipeline(
+    bank_code="SBI",
+    file_path="my_statement.csv",
+    account_name="Savings Account",
+    paytm_lookup=True,
+    paytm_file_path="paytm.xlsx"
+)
+pipeline.ingest()
+pipeline.transform()
+pipeline.join_paytm()
+pipeline.categorize()
+pipeline.file_correction()
+final_df, final_table, final_path = pipeline.publish_data()
+```
+
+---
+
+## Configuration & Customization
+
+- **Bank Codes:** Supports multiple banks via `--bank-code` or `bank_code` param.
+- **Column Names:** Override with `--date-col`, `--credit-col`, `--debit-col`, `--desc-col`.
+- **Categories/Keywords:** Extend via file/manual correction or DB edits.
+- **Paytm Lookup:** Enable with `--paytm-lookup` and `--paytm-file`.
+- **Database:** Store results with `--store-in-db`.
+- **Progress Storage:** Enable/disable with `--store-progress`.
+
+---
+
+## Extensibility
+
+- **Pipeline Stages:** Add/replace ETL stages by extending the pipeline.
+- **Classification Logic:** Plug in AI/ML models or new rule engines.
+- **Reporting:** Integrate with BI tools or custom dashboards.
+- **Data Sources:** Add new banks, UPI providers, or enrichment sources.
+
+---
+
+## Real-World Use Cases
+
+- **Personal Finance:** Automated expense tracking and budgeting.
+- **Business Accounting:** Streamline reconciliation and reporting.
+- **Tax Preparation:** Categorize and export data for tax filing.
+- **Financial Analytics:** Feed clean data into BI tools for insights.
+
+---
+
+For more, see [README.md](README.md) or [ARCHITECTURE.md](ARCHITECTURE.md). 
diff --git a/README.md b/README.md
diff --git a/VISUALIZATION.md b/VISUALIZATION.md