A self-contained data pipeline project that ingests raw CSVs, cleans and transforms the data, runs analysis, and serves results through an interactive React dashboard.
data-pipeline-dashboard/
├── generate_data.py # generates sample raw CSVs (run first)
├── clean_data.py # part 1: data cleaning
├── analyze.py # part 2: merging & analysis
├── backend/
│ ├── app.py # FastAPI REST API
│ └── requirements.txt # python dependencies
├── frontend/ # React + Vite dashboard
│ ├── src/
│ │ ├── main.jsx # entry point
│ │ ├── App.jsx # main dashboard component
│ │ ├── index.css # global styles
│ │ └── components/
│ │ ├── RevenueChart.jsx
│ │ ├── TopCustomers.jsx
│ │ ├── CategoryChart.jsx
│ │ └── RegionSummary.jsx
│ ├── index.html
│ ├── vite.config.js
│ └── package.json
├── data/
│ ├── raw/ # original CSVs
│ └── processed/ # cleaned & analysis output CSVs
├── tests/
│ └── test_clean_data.py # pytest unit tests
└── README.md
- Python 3.9+
- Node.js 18+
- pip, npm
pip install pandas numpy fastapi uvicorn pytestSince we don't have pre-supplied CSVs, run this first to create realistic sample data with deliberate dirty entries:
python generate_data.pyThis creates customers.csv, orders.csv, and products.csv in data/raw/.
python clean_data.pyOutputs:
data/processed/customers_clean.csvdata/processed/orders_clean.csv- Cleaning report printed to stdout
python analyze.pyOutputs in data/processed/:
monthly_revenue.csvtop_customers.csvcategory_performance.csvregional_analysis.csv
You can override file paths with arguments:
python analyze.py --customers path/to/customers.csv --orders path/to/orders.csv --products path/to/products.csv --output path/to/output/cd backend
uvicorn app:app --reload --port 8000API endpoints:
GET /health— health checkGET /api/revenue— monthly revenue dataGET /api/top-customers— top 10 customersGET /api/categories— category performanceGET /api/regions— regional analysis
cd frontend
npm install
npm run devOpen http://localhost:5500 in your browser.
python -m pytest tests/ -v- Revenue Trend — Recharts area chart with date-range filter (bonus)
- Top Customers — sortable table with search box (bonus)
- Category Breakdown — bar chart of revenue by category
- Region Summary — card-based KPI view
- Sample data is generated with a fixed random seed (42) for reproducibility.
- The "last 90 days" churn calculation is relative to the latest
order_datein the dataset. - Status normalization maps common variants (e.g., "done" → "completed", "canceled" → "cancelled"). Unrecognized statuses are kept as-is.
- For the multi-format date parser, when a date like "03-05-2024" is ambiguous, it's parsed as MM-DD-YYYY per the assignment spec.
- Missing
amountvalues are filled with the median amount grouped by product; if a product has no valid amounts, the overall median is used.
- Data processing: Python, pandas, numpy
- Backend: FastAPI, uvicorn
- Frontend: React, Vite, Recharts
- Testing: pytest