A Python 3.11 data-processing pipeline that links Dubai Hills property-owner records to DLD transaction records with ≥ 98% precision + recall. The pipeline features three automated matching tiers plus a lightweight manual review UI.
- Three-tier matching system: Deterministic, fuzzy, and manual review
- High precision matching: ≥ 98% precision and recall target
- Automated preprocessing: Data cleaning, normalization, and composite key generation
- Fuzzy string matching: Advanced similarity scoring for non-exact matches
- Manual review interface: Streamlit-based UI for human validation
- Comprehensive reporting: QA reports and match statistics
- Modular architecture: Clean separation of concerns
- Memory-efficient processing: Handles large CSV files without loading them entirely into memory
- Column removal: Remove irrelevant columns from CSV data
- File splitting: Split large files into smaller chunks based on size limits
- RESTful API: FastAPI-based web service with automatic documentation
repo-root/
│
├── data/
│ ├── raw/
│ │ ├── owners/20250716/Dubai Hills.xlsx
│ │ └── transactions/20250716/
│ │ ├── corrected_transactions_part_001.csv
│ │ └── …
│ ├── processed/ # pipeline output written here
│ └── review/ # parquet with human decisions
│
├── matching/
│ ├── __init__.py
│ ├── preprocess.py # cleaning + normalisation helpers
│ ├── deterministic.py # tier-1 exact join
│ ├── fuzzy.py # tier-2 fuzzy scorer
│ ├── review_helpers.py # load & merge human approvals
│ └── pipeline.py # main pipeline orchestration
│
├── ui/
│ └── review_app.py # Streamlit manual-review interface
│
├── tests/
│ └── test_preprocess.py # pytest unit tests
│
├── Dockerfile
├── requirements.txt
└── README.md
On macOS/Linux:
chmod +x install.sh
./install.shOn Windows:
install.bat- Clone or download the project files
- Create a virtual environment:
python -m venv venv- Activate the virtual environment:
On macOS/Linux:
source venv/bin/activateOn Windows:
venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtAfter installation, activate the virtual environment:
On macOS/Linux:
source activate_env.shOn Windows:
activate_env.bat- Run the example pipeline:
python example_matching.pyThis will:
- Create sample data
- Run the complete matching pipeline
- Generate outputs in
data/processed/ - Display match statistics
- Use your own data:
from matching.pipeline import run_matching_pipeline
results = run_matching_pipeline(
owners_file="path/to/owners.csv",
transactions_file="path/to/transactions.csv"
)- Start the review UI (if manual review is needed):
streamlit run ui/review_app.pypython main.pyThe server will start on http://localhost:8000
- Method: Exact join on composite keys
- Composite Key:
project_clean + building_clean + unit_no - Validation: Area difference ≤ 1%
- Confidence: 1.00 (High bucket)
- Action: Auto-accept
- Method: Fuzzy string similarity with scoring
- Blocking: Same project
- Scoring Formula:
building_sim = token_set_ratio(bldg_owner, bldg_txn) / 100 unit_match = 1.0 if unit_exact else 0.0 area_score = max(0, 1 - abs(area_pct_diff) / 0.02) score = 0.5*building_sim + 0.3*unit_match + 0.2*area_score - Thresholds:
- High: ≥ 0.90 (auto-accept)
- Medium: 0.85-0.90 (optional review)
- Low: 0.75-0.85 (manual review required)
- Reject: < 0.75
- Input: Low confidence matches + multiple candidates
- Interface: Streamlit web application
- Actions: Approve/Reject/Skip
- Output: Human decisions merged back into pipeline
The pipeline generates the following outputs in data/processed/:
- matches.parquet: Final matches with confidence scores
- owners_unmatched.csv: Owners that couldn't be matched
- transactions_unmatched.csv: Transactions that couldn't be matched
- qa_report.md: Comprehensive quality assurance report
Owner Records should include:
project: Property project namebuilding: Building/tower nameunit_number: Unit or apartment numberarea: Property area (sqm or sqft)owner_name: Owner name
Transaction Records should include:
project: Property project namebuilding: Building/tower nameunit_number: Unit or apartment numberarea: Property area (sqm or sqft)buyer_name: Buyer name
If your data uses different column names, update the column_mapping dictionaries in:
matching/preprocess.py(lines ~150 and ~200)
Run the test suite:
pytest tests/Run specific tests:
pytest tests/test_preprocess.py -vOnce the server is running, visit:
- Interactive API docs:
http://localhost:8000/docs - Alternative docs:
http://localhost:8000/redoc
from matching.pipeline import run_matching_pipeline
# Run complete pipeline
results = run_matching_pipeline(
owners_file="data/raw/owners/20250716/Dubai Hills.xlsx",
transactions_file="data/raw/transactions/20250716/transactions.csv"
)
# Access results
print(f"Total matches: {results['data_volumes']['total_matches']}")
print(f"Match rate: {results['match_rates']['owner_match_rate']:.1%}")- Start the review application:
streamlit run ui/review_app.py- Load review pairs from
data/review/pairs.parquet - Review each pair and make decisions
- Save decisions to
data/review/decisions_*.parquet
from matching.preprocess import preprocess_owners, preprocess_transactions
from matching.deterministic import tier1_deterministic_match
from matching.fuzzy import tier2_fuzzy_match
# Preprocess data
owners_clean = preprocess_owners(owners_df)
transactions_clean = preprocess_transactions(transactions_df)
# Run individual tiers
tier1_matches, unmatched_owners, unmatched_txns = tier1_deterministic_match(
owners_clean, transactions_clean
)
tier2_matches, final_unmatched_owners, final_unmatched_txns = tier2_fuzzy_match(
unmatched_owners, unmatched_txns
)- Logs: Written to
logs/matching_pipeline.log - Metrics: Pipeline statistics in QA reports
- Progress: Real-time logging of each pipeline step
- Memory efficient: Processes large datasets in chunks
- Scalable: Modular design allows for parallel processing
- Fast: Optimized algorithms for deterministic and fuzzy matching
- Follow the existing code style (snake_case for Python)
- Add tests for new functionality
- Update documentation for any changes
- Ensure all tests pass before submitting
This project is licensed under the MIT License - see the LICENSE file for details.