Dubai Hills Property Matching Pipeline

A Python 3.11 data-processing pipeline that links Dubai Hills property-owner records to DLD transaction records with ≥ 98% precision + recall. The pipeline features three automated matching tiers plus a lightweight manual review UI.

Features

Property Matching Pipeline

Three-tier matching system: Deterministic, fuzzy, and manual review
High precision matching: ≥ 98% precision and recall target
Automated preprocessing: Data cleaning, normalization, and composite key generation
Fuzzy string matching: Advanced similarity scoring for non-exact matches
Manual review interface: Streamlit-based UI for human validation
Comprehensive reporting: QA reports and match statistics
Modular architecture: Clean separation of concerns

CSV Processing Backend (Legacy)

Memory-efficient processing: Handles large CSV files without loading them entirely into memory
Column removal: Remove irrelevant columns from CSV data
File splitting: Split large files into smaller chunks based on size limits
RESTful API: FastAPI-based web service with automatic documentation

Repository Structure

repo-root/
│
├── data/
│   ├── raw/
│   │   ├── owners/20250716/Dubai Hills.xlsx
│   │   └── transactions/20250716/
│   │       ├── corrected_transactions_part_001.csv
│   │       └── …
│   ├── processed/          # pipeline output written here
│   └── review/             # parquet with human decisions
│
├── matching/
│   ├── __init__.py
│   ├── preprocess.py       # cleaning + normalisation helpers
│   ├── deterministic.py    # tier-1 exact join
│   ├── fuzzy.py            # tier-2 fuzzy scorer
│   ├── review_helpers.py   # load & merge human approvals
│   └── pipeline.py         # main pipeline orchestration
│
├── ui/
│   └── review_app.py       # Streamlit manual-review interface
│
├── tests/
│   └── test_preprocess.py  # pytest unit tests
│
├── Dockerfile
├── requirements.txt
└── README.md

Installation

Option 1: One-Click Installation (Recommended)

On macOS/Linux:

chmod +x install.sh
./install.sh

On Windows:

install.bat

Option 2: Manual Setup

Clone or download the project files
Create a virtual environment:

python -m venv venv

Activate the virtual environment:

On macOS/Linux:

source venv/bin/activate

On Windows:

venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Activating the Virtual Environment

After installation, activate the virtual environment:

On macOS/Linux:

source activate_env.sh

On Windows:

activate_env.bat

Quick Start

Property Matching Pipeline

Run the example pipeline:

python example_matching.py

This will:

Create sample data
Run the complete matching pipeline
Generate outputs in data/processed/
Display match statistics

Use your own data:

from matching.pipeline import run_matching_pipeline

results = run_matching_pipeline(
    owners_file="path/to/owners.csv",
    transactions_file="path/to/transactions.csv"
)

Start the review UI (if manual review is needed):

streamlit run ui/review_app.py

CSV Processing Backend (Legacy)

python main.py

The server will start on http://localhost:8000

Matching Pipeline Details

Tier 1: Deterministic Matching

Method: Exact join on composite keys
Composite Key: project_clean + building_clean + unit_no
Validation: Area difference ≤ 1%
Confidence: 1.00 (High bucket)
Action: Auto-accept

Tier 2: Fuzzy Matching

Method: Fuzzy string similarity with scoring
Blocking: Same project

Scoring Formula:

building_sim = token_set_ratio(bldg_owner, bldg_txn) / 100
unit_match   = 1.0 if unit_exact else 0.0
area_score   = max(0, 1 - abs(area_pct_diff) / 0.02)
score        = 0.5*building_sim + 0.3*unit_match + 0.2*area_score

Thresholds:
- High: ≥ 0.90 (auto-accept)
- Medium: 0.85-0.90 (optional review)
- Low: 0.75-0.85 (manual review required)
- Reject: < 0.75

Tier 3: Manual Review

Input: Low confidence matches + multiple candidates
Interface: Streamlit web application
Actions: Approve/Reject/Skip
Output: Human decisions merged back into pipeline

Outputs

The pipeline generates the following outputs in data/processed/:

matches.parquet: Final matches with confidence scores
owners_unmatched.csv: Owners that couldn't be matched
transactions_unmatched.csv: Transactions that couldn't be matched
qa_report.md: Comprehensive quality assurance report

Configuration

Data Format Requirements

Owner Records should include:

project: Property project name
building: Building/tower name
unit_number: Unit or apartment number
area: Property area (sqm or sqft)
owner_name: Owner name

Transaction Records should include:

project: Property project name
building: Building/tower name
unit_number: Unit or apartment number
area: Property area (sqm or sqft)
buyer_name: Buyer name

Column Mapping

If your data uses different column names, update the column_mapping dictionaries in:

matching/preprocess.py (lines ~150 and ~200)

Testing

Run the test suite:

pytest tests/

Run specific tests:

pytest tests/test_preprocess.py -v

API Documentation (Legacy CSV Processing)

Once the server is running, visit:

Interactive API docs: http://localhost:8000/docs
Alternative docs: http://localhost:8000/redoc

Usage Examples

Property Matching

from matching.pipeline import run_matching_pipeline

# Run complete pipeline
results = run_matching_pipeline(
    owners_file="data/raw/owners/20250716/Dubai Hills.xlsx",
    transactions_file="data/raw/transactions/20250716/transactions.csv"
)

# Access results
print(f"Total matches: {results['data_volumes']['total_matches']}")
print(f"Match rate: {results['match_rates']['owner_match_rate']:.1%}")

Manual Review

Start the review application:

streamlit run ui/review_app.py

Load review pairs from data/review/pairs.parquet
Review each pair and make decisions
Save decisions to data/review/decisions_*.parquet

Individual Components

from matching.preprocess import preprocess_owners, preprocess_transactions
from matching.deterministic import tier1_deterministic_match
from matching.fuzzy import tier2_fuzzy_match

# Preprocess data
owners_clean = preprocess_owners(owners_df)
transactions_clean = preprocess_transactions(transactions_df)

# Run individual tiers
tier1_matches, unmatched_owners, unmatched_txns = tier1_deterministic_match(
    owners_clean, transactions_clean
)

tier2_matches, final_unmatched_owners, final_unmatched_txns = tier2_fuzzy_match(
    unmatched_owners, unmatched_txns
)

Monitoring and Logging

Logs: Written to logs/matching_pipeline.log
Metrics: Pipeline statistics in QA reports
Progress: Real-time logging of each pipeline step

Performance

Memory efficient: Processes large datasets in chunks
Scalable: Modular design allows for parallel processing
Fast: Optimized algorithms for deterministic and fuzzy matching

Contributing

Follow the existing code style (snake_case for Python)
Add tests for new functionality
Update documentation for any changes
Ensure all tests pass before submitting

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
docs/guides		docs/guides
examples		examples
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dubai Hills Property Matching Pipeline

Features

Property Matching Pipeline

CSV Processing Backend (Legacy)

Repository Structure

Installation

Option 1: One-Click Installation (Recommended)

Option 2: Manual Setup

Activating the Virtual Environment

Quick Start

Property Matching Pipeline

CSV Processing Backend (Legacy)

Matching Pipeline Details

Tier 1: Deterministic Matching

Tier 2: Fuzzy Matching

Tier 3: Manual Review

Outputs

Configuration

Data Format Requirements

Column Mapping

Testing

API Documentation (Legacy CSV Processing)

Usage Examples

Property Matching

Manual Review

Individual Components

Monitoring and Logging

Performance

Contributing

License

About

Uh oh!

Releases

Packages

Languages

EmPi44/csv_matching

Folders and files

Latest commit

History

Repository files navigation

Dubai Hills Property Matching Pipeline

Features

Property Matching Pipeline

CSV Processing Backend (Legacy)

Repository Structure

Installation

Option 1: One-Click Installation (Recommended)

Option 2: Manual Setup

Activating the Virtual Environment

Quick Start

Property Matching Pipeline

CSV Processing Backend (Legacy)

Matching Pipeline Details

Tier 1: Deterministic Matching

Tier 2: Fuzzy Matching

Tier 3: Manual Review

Outputs

Configuration

Data Format Requirements

Column Mapping

Testing

API Documentation (Legacy CSV Processing)

Usage Examples

Property Matching

Manual Review

Individual Components

Monitoring and Logging

Performance

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages