Skip to content

Commit b23df20

Browse files
committed
Update Docs
1 parent 5b668d3 commit b23df20

File tree

6 files changed

+508
-153
lines changed

6 files changed

+508
-153
lines changed

ARCHITECTURE.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# System Architecture: Expense Classifier
2+
3+
## Overview
4+
5+
The Expense Classifier is architected as a modular, extensible ETL pipeline for financial transaction data. It is designed for robustness, scalability, and ease of extension, supporting both CLI and programmatic use cases.
6+
7+
---
8+
9+
## Architecture Diagram
10+
11+
```mermaid
12+
flowchart TD
13+
A["User Input (CLI/Script)"] --> B["Ingestor"]
14+
B --> C["Transformer"]
15+
C --> D["Paytm Lookup (optional)"]
16+
D --> E["Classifier"]
17+
E --> F["File Correction / Manual Correction"]
18+
F --> G["Database (SQLAlchemy)"]
19+
F --> H["Reporting/Export"]
20+
G --> H
21+
H --> I["Final Output (CSV/DB/Report)"]
22+
style A fill:#f9f,stroke:#333,stroke-width:2px
23+
style I fill:#bbf,stroke:#333,stroke-width:2px
24+
```
25+
26+
---
27+
28+
## Component Breakdown
29+
30+
### 1. Ingestor
31+
- Loads raw bank/Paytm files (CSV/XLSX)
32+
- Standardizes columns and cleans data
33+
- Handles messy, real-world input formats
34+
35+
### 2. Transformer
36+
- Extracts transaction details (mode, payee, UPI ID, etc.)
37+
- Adds derived columns (e.g., group, account, fiscal period)
38+
- Normalizes and enriches data for downstream processing
39+
40+
### 3. Paytm Lookup (Optional)
41+
- Matches uncategorized transactions with Paytm UPI data
42+
- Enriches records for improved classification
43+
- Supports both file and DB sources
44+
45+
### 4. Classifier
46+
- Assigns categories using a rule-based engine (keyword-driven)
47+
- Extensible for AI/ML-based classification
48+
- Handles both expense and income categories
49+
50+
### 5. File/Manual Correction
51+
- Exports uncategorized transactions for user review
52+
- Allows users to add new keywords/categories
53+
- Updates mappings for future automation
54+
55+
### 6. Database (SQLAlchemy)
56+
- Stores all pipeline stages and category mappings
57+
- Enables persistent analytics and reporting
58+
- Alembic migrations for schema evolution
59+
60+
### 7. Reporting/Export
61+
- Outputs clean, categorized data to CSV and/or database
62+
- (Planned) hooks for dashboards and BI tools
63+
64+
---
65+
66+
## Data Flow & Extensibility
67+
68+
- **Pipeline Orchestration:** Each stage is a class with a clear interface, enabling easy extension or replacement.
69+
- **Configurable:** Supports custom columns, banks, and enrichment steps via parameters.
70+
- **Extensible:** Add new data sources, transformation logic, or classification methods with minimal code changes.
71+
- **Persistence:** All intermediate and final data can be stored for auditability and reproducibility.
72+
73+
---
74+
75+
## Engineering Highlights
76+
77+
- **Separation of Concerns:** Each module handles a single responsibility.
78+
- **Testability:** Modular design enables unit and integration testing.
79+
- **Performance:** Vectorized operations and batch DB writes.
80+
- **User-Centric:** Interactive correction and easy customization.
81+
- **Scalability:** Designed to handle large datasets and evolving requirements.
82+
83+
---
84+
85+
## Scalability, Maintainability, Extensibility
86+
87+
- **Scalability:** Efficient pandas operations, batch processing, and DB integration support large data volumes.
88+
- **Maintainability:** Clear module boundaries, docstrings, and type hints.
89+
- **Extensibility:** Plug-and-play pipeline stages, easy to add new banks, categories, or enrichment logic.
90+
91+
---
92+
93+
For more, see the [main README](README.md) or [DATA_ENGINEERING.md](DATA_ENGINEERING.md).

CONTRIBUTING.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Contributing to Expense Classifier
2+
3+
Thank you for your interest in contributing! Your help is welcome to make this project better for everyone.
4+
5+
---
6+
7+
## Getting Started
8+
9+
1. **Fork the repository** and clone your fork.
10+
2. **Set up a virtual environment:**
11+
```bash
12+
python -m venv venv
13+
source venv/bin/activate
14+
pip install -r requirements.txt
15+
```
16+
3. **Install in editable mode:**
17+
```bash
18+
pip install -e .
19+
```
20+
4. **Create a new branch** for your feature or bugfix.
21+
22+
---
23+
24+
## Code Style & Best Practices
25+
26+
- Follow [PEP8](https://www.python.org/dev/peps/pep-0008/) for Python code.
27+
- Write clear docstrings and comments.
28+
- Keep functions and classes small and focused.
29+
- Use type hints where possible.
30+
- Modularize code for testability and reuse.
31+
32+
---
33+
34+
## Submitting Issues & Pull Requests
35+
36+
- **Issues:** Please use GitHub Issues for bugs, feature requests, or questions.
37+
- **Pull Requests:**
38+
- Reference the issue you are addressing (if any).
39+
- Describe your changes clearly.
40+
- Ensure your code runs and passes all checks.
41+
- Add/modify tests if relevant (testing suite in progress).
42+
43+
---
44+
45+
## Testing (Planned)
46+
47+
- Automated tests will use `pytest`.
48+
- Please include tests for new features or bugfixes when the suite is available.
49+
50+
---
51+
52+
## Code of Conduct
53+
54+
- Be respectful and inclusive.
55+
- Provide constructive feedback.
56+
- Help others learn and grow.
57+
58+
---
59+
60+
## Contact
61+
62+
For questions or suggestions, open an issue or contact [Piyush Upreti](mailto:piyushupreti@gmail.com).

DATA_ENGINEERING.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Data Engineering Deep Dive: Expense Classifier
2+
3+
## ETL Pipeline Overview
4+
5+
The pipeline is designed as a modular, extensible sequence of stages:
6+
7+
1. **Ingestion**: Load and standardize raw bank/Paytm files (CSV/XLSX)
8+
2. **Transformation**: Extract and normalize transaction details
9+
3. **Enrichment**: Paytm lookup and manual/file correction
10+
4. **Classification**: Assign categories using rule-based logic
11+
5. **Publishing**: Export to CSV and/or database
12+
13+
---
14+
15+
## 1. Ingestion
16+
17+
- **File Support:** CSV, XLSX (bank statements, Paytm UPI)
18+
- **Standardization:** Renames columns, parses dates, cleans numeric fields
19+
- **Validation:** Drops empty rows, handles invalid/missing data
20+
21+
```python
22+
from expense_classifier.ingestor import Ingestor
23+
24+
df = Ingestor('my_statement.csv').get_data()
25+
```
26+
27+
---
28+
29+
## 2. Transformation
30+
31+
- **Feature Extraction:** Payment mode, payee, UPI ID, etc.
32+
- **Derived Columns:** Group (Income/Expense), Account, Fiscal Period
33+
- **Normalization:** Lowercases descriptions, standardizes formats
34+
35+
```python
36+
from expense_classifier.transformer import Transformer
37+
38+
df = Transformer(df, bank, account_name).transform()
39+
```
40+
41+
---
42+
43+
## 3. Enrichment
44+
45+
### Paytm Lookup
46+
- Matches uncategorized transactions with Paytm UPI data
47+
- Supports both file and DB sources
48+
49+
```python
50+
from expense_classifier.paytm_lookup import PaytmLookup
51+
52+
lookup = PaytmLookup(df, 'paytm.xlsx')
53+
df = lookup.perform_lookup()
54+
```
55+
56+
### Manual/File Correction
57+
- Exports uncategorized records for user review
58+
- Allows user to add new keywords/categories
59+
- Updates mappings for future automation
60+
61+
---
62+
63+
## 4. Classification
64+
65+
- **Rule-Based Engine:** Keyword-driven, extensible
66+
- **Handles:** Both expense and income categories
67+
- **Extensible:** Plug in AI/ML models for advanced classification
68+
69+
```python
70+
from expense_classifier.classifier import Classifier
71+
72+
df = Classifier().classify(df)
73+
```
74+
75+
---
76+
77+
## 5. Publishing
78+
79+
- **Export:** Clean, categorized data to CSV and/or database
80+
- **Persistence:** All stages can be stored for auditability
81+
82+
---
83+
84+
## Error Handling & Validation
85+
86+
- **Input Validation:** Checks file formats, required columns, and data types
87+
- **Error Logging:** Handles and logs invalid/missing data
88+
- **User Prompts:** Interactive correction for ambiguous cases
89+
90+
---
91+
92+
## Performance Optimizations
93+
94+
- **Vectorized Operations:** Uses pandas for fast, efficient ETL
95+
- **Batch DB Writes:** Efficient storage of large datasets
96+
- **Progress Storage:** Optionally saves intermediate results
97+
98+
---
99+
100+
## Extensibility
101+
102+
- **Add New Banks:** Extend bank_utils and mappings
103+
- **Custom Logic:** Plug in new transformation or classification modules
104+
- **Enrichment:** Add new data sources (e.g., other UPI providers)
105+
106+
---
107+
108+
For more, see [ARCHITECTURE.md](ARCHITECTURE.md) or [README.md](README.md).

FEATURES.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Features & Usage: Expense Classifier
2+
3+
## Feature Overview
4+
5+
| Category | Feature | Description |
6+
|------------|----------------------------------------------|------------------------------------------------------------------|
7+
| Core | CLI Processing | One-command classification of bank statements |
8+
| Core | Programmatic API | Use pipeline in Python scripts |
9+
| Core | Modular ETL Pipeline | Ingest, transform, classify, correct, and export |
10+
| Core | Database Integration | SQLAlchemy ORM, persistent storage, Alembic migrations |
11+
| Core | Manual/File Correction | Export uncategorized records for user review and enrichment |
12+
| Core | Multi-bank Support | Works with multiple banks, custom columns |
13+
| Advanced | Paytm UPI Lookup | Enrich uncategorized records with Paytm data |
14+
| Advanced | Progress Storage | Save intermediate results for auditability |
15+
| Advanced | Custom Categories/Keywords | User-driven enrichment and learning |
16+
| Advanced | Reporting/Export | Clean CSVs, DB tables for analysis |
17+
| Advanced | Error Handling | Validates input, handles edge cases |
18+
| Planned | AI/ML Classification | Plug-in for AI-based categorization |
19+
| Planned | Visualization/Dashboards | Hooks for BI tools and dashboards |
20+
| Planned | Automated Testing | pytest-based test suite |
21+
22+
---
23+
24+
## CLI Usage Examples
25+
26+
```bash
27+
# Basic usage
28+
expense-classifier --path my_statement.csv --bank-code SBI --account "Savings Account"
29+
30+
# With Paytm lookup and manual correction
31+
expense-classifier --path my_statement.csv --bank-code SBI --account "Savings Account" --paytm-lookup --paytm-file paytm.xlsx
32+
33+
# Store results in database
34+
expense-classifier --path my_statement.csv --bank-code SBI --account "Savings Account" --store-in-db
35+
```
36+
37+
---
38+
39+
## Programmatic Usage Example
40+
41+
```python
42+
from expense_classifier.pipeline import Pipeline
43+
44+
pipeline = Pipeline(
45+
bank_code="SBI",
46+
file_path="my_statement.csv",
47+
account_name="Savings Account",
48+
paytm_lookup=True,
49+
paytm_file_path="paytm.xlsx"
50+
)
51+
pipeline.ingest()
52+
pipeline.transform()
53+
pipeline.join_paytm()
54+
pipeline.categorize()
55+
pipeline.file_correction()
56+
final_df, final_table, final_path = pipeline.publish_data()
57+
```
58+
59+
---
60+
61+
## Configuration & Customization
62+
63+
- **Bank Codes:** Supports multiple banks via `--bank-code` or `bank_code` param.
64+
- **Column Names:** Override with `--date-col`, `--credit-col`, `--debit-col`, `--desc-col`.
65+
- **Categories/Keywords:** Extend via file/manual correction or DB edits.
66+
- **Paytm Lookup:** Enable with `--paytm-lookup` and `--paytm-file`.
67+
- **Database:** Store results with `--store-in-db`.
68+
- **Progress Storage:** Enable/disable with `--store-progress`.
69+
70+
---
71+
72+
## Extensibility
73+
74+
- **Pipeline Stages:** Add/replace ETL stages by extending the pipeline.
75+
- **Classification Logic:** Plug in AI/ML models or new rule engines.
76+
- **Reporting:** Integrate with BI tools or custom dashboards.
77+
- **Data Sources:** Add new banks, UPI providers, or enrichment sources.
78+
79+
---
80+
81+
## Real-World Use Cases
82+
83+
- **Personal Finance:** Automated expense tracking and budgeting.
84+
- **Business Accounting:** Streamline reconciliation and reporting.
85+
- **Tax Preparation:** Categorize and export data for tax filing.
86+
- **Financial Analytics:** Feed clean data into BI tools for insights.
87+
88+
---
89+
90+
For more, see [README.md](README.md) or [ARCHITECTURE.md](ARCHITECTURE.md).

0 commit comments

Comments
 (0)