Skip to content

Latest commit

 

History

History
346 lines (250 loc) · 8.62 KB

File metadata and controls

346 lines (250 loc) · 8.62 KB

SoundexBR - Brazilian Portuguese Phonetic Algorithm

Comprehensive implementation of the SoundexBR algorithm for grouping and analyzing Brazilian Portuguese names with phonetic similarity.

Features

  • SoundexBR Algorithm: Full implementation based on SoundexBR R package
  • Similarity Metrics: Cosine similarity and string similarity for refined matching
  • Comprehensive Testing: 40+ unit tests covering edge cases and validation
  • Performance: Processes 10,000+ names per second
  • Detailed Reports: Markdown reports with statistics, accuracy metrics, and recommendations
  • Tinder-Style Validation: Interactive interface for uncertain match approval
  • RNC Integration: Maps products to official cultivar registry

Architecture & Workflow

Overall System Architecture

graph TB
    A[Input Data<br/>Excel/CSV with<br/>Product Names] --> B[SoundexBR<br/>Encoder]
    B --> C{Grouping<br/>Engine}
    C --> D[Similar Names<br/>Same Code]
    C --> E[Unique Names<br/>Different Codes]

    D --> F[Similarity<br/>Analysis]
    F --> G{Confidence<br/>Score}

    G -->|High >0.7| H[Auto-Approved<br/>Groupings]
    G -->|Medium 0.5-0.7| I[Uncertain<br/>Matches]
    G -->|Low <0.5| J[Potential<br/>False Positives]

    I --> K[Tinder-Style<br/>Validation]
    J --> K

    K --> L[User Review<br/>Swipe Y/N]
    L --> M[Validated<br/>Groupings]

    H --> N[Final Results]
    M --> N
    E --> N

    N --> O[Export<br/>Excel/CSV/JSON/MD]

    style A fill:#e1f5ff
    style B fill:#fff4e6
    style K fill:#ffe6f0
    style N fill:#e6ffe6
    style O fill:#f0e6ff
Loading

RNC Integration Workflow

graph LR
    A[Product Names<br/>208K Records] --> B[Normalize &<br/>Encode]
    B --> C[SoundexBR<br/>Codes]

    D[RNC Database<br/>Official Names] --> E[Normalize &<br/>Encode]
    E --> F[RNC Soundex<br/>Codes]

    C --> G{Code<br/>Match?}
    F --> G

    G -->|Yes| H[Same Soundex<br/>Code]
    G -->|No| I[Different<br/>Codes]

    H --> J[Calculate<br/>Similarity]

    J --> K{Similarity<br/>Score}

    K -->|>0.7| L[High Confidence<br/>Match]
    K -->|0.5-0.7| M[Medium Confidence<br/>Needs Review]
    K -->|<0.5| N[Low Confidence<br/>Likely False Match]

    I --> O[Cross-Group<br/>Check]
    O --> P{High Similarity<br/>Despite Different<br/>Codes?}
    P -->|Yes| M
    P -->|No| Q[No Match<br/>Product Not in RNC]

    L --> R[Auto-Map to<br/>RNC Name]
    M --> S[User<br/>Validation]
    N --> S

    S --> T[Validated<br/>Mappings]
    Q --> U[Unmatched<br/>Products List]

    R --> V[Final<br/>Mapping Table]
    T --> V
    U --> V

    style A fill:#e1f5ff
    style D fill:#ffe6e6
    style G fill:#fff4e6
    style K fill:#fff4e6
    style S fill:#ffe6f0
    style V fill:#e6ffe6
Loading

SoundexBR Algorithm Flow

flowchart TD
    A[Input Name<br/>'João Silva'] --> B[Normalize]
    B --> C[Uppercase<br/>'JOÃO SILVA']
    C --> D[Remove Accents<br/>'JOAO SILVA']
    D --> E[Remove Special Chars<br/>'JOAOSILVA']
    E --> F[Split Words]
    F --> G[Process First Word<br/>'JOAO']

    G --> H{First Letter<br/>Transform?}
    H -->|KA→CA, GE→JE, etc| I[Apply Transform]
    H -->|No| J[Keep Letter]
    I --> K[Handle H]
    J --> K

    K --> L[Apply Phonetic<br/>Transforms<br/>PH→F, TH→T, etc]
    L --> M[Save First<br/>Letter: J]

    M --> N[Encode to Numbers<br/>Letter→Group Code]
    N --> O[Remove Zeros<br/>Vowels]
    O --> P[Remove Adjacent<br/>Duplicates]
    P --> Q[Take First<br/>3 Digits]
    Q --> R[Pad with Zeros<br/>if needed]

    R --> S[Final Code<br/>J + 3 digits]
    S --> T[Result: 'J200']

    style A fill:#e1f5ff
    style H fill:#fff4e6
    style T fill:#e6ffe6
Loading

Data Quality Improvement Process

stateDiagram-v2
    [*] --> InputData: Raw product names

    InputData --> Analysis: Run SoundexBR

    Analysis --> Grouping: Create phonetic groups

    Grouping --> HighQuality: High confidence (>0.7)
    Grouping --> MediumQuality: Medium confidence (0.5-0.7)
    Grouping --> LowQuality: Low confidence (<0.5)

    HighQuality --> AutoApproved: Automatic approval

    MediumQuality --> UserReview: Tinder validation
    LowQuality --> UserReview: Tinder validation

    UserReview --> Approved: User says YES
    UserReview --> Rejected: User says NO
    UserReview --> Skipped: User SKIPS

    Approved --> Standardization: Create canonical mapping
    AutoApproved --> Standardization: Create canonical mapping

    Rejected --> Separation: Keep as different products
    Skipped --> ManualReview: Flag for later

    Standardization --> CleanData: Standardized names
    Separation --> CleanData: Unique products
    ManualReview --> CleanData: Pending review

    CleanData --> Export: Generate reports

    Export --> [*]: Complete

    note right of UserReview
        Interactive Tinder-style interface
        Shows similarity metrics
        Visual confidence bars
        Swipe left/right decision
    end note

    note right of Standardization
        71.4% compression achieved
        100% precision in demo
        Reduces data redundancy
    end note
Loading

Files

  • soundex_br.py - Core SoundexBR algorithm implementation
  • test_soundex.py - Comprehensive test suite (40+ tests)
  • analyze_names.py - Main analysis script for processing datasets
  • nome_agroquimico.csv - Sample dataset (208K agrochemical names)

Docker Usage

Build and run with Docker Compose:

# Build and run (tests + analysis)
docker-compose up --build

# Run tests only
docker-compose run soundex-analysis python test_soundex.py

# Run analysis only
docker-compose run soundex-analysis python analyze_names.py

# Interactive shell
docker-compose run soundex-analysis bash

Build and run with Docker:

# Build image
docker build -t br-soundex .

# Run tests
docker run br-soundex python test_soundex.py

# Run analysis
docker run -v $(pwd):/app br-soundex python analyze_names.py

# Interactive shell
docker run -it -v $(pwd):/app br-soundex bash

Local Usage (without Docker)

Prerequisites:

pip install -r requirements.txt

Run tests:

python test_soundex.py

Run analysis:

python analyze_names.py

Use in Python:

from soundex_br import SoundexBR

soundex = SoundexBR()

# Encode single name
code = soundex.encode("João Silva")
print(code)  # Output: J240

# Encode multiple names
names = ["João", "Joao", "Maria", "Mario"]
codes = soundex.encode_batch(names)
print(codes)  # Output: ['J000', 'J000', 'M600', 'M600']

Algorithm Details

SoundexBR Encoding Process:

  1. Normalization:

    • Convert to uppercase
    • Remove accents (João → JOAO)
    • Remove special characters and numbers
    • Keep only letters
  2. Phonetic Transformations:

    • First letter special rules (KA→CA, GE→JE, etc.)
    • Handle silent H
    • Apply phonetic equivalences (PH→F, TH→T, etc.)
  3. Letter to Number Encoding:

    • Group 0: A, E, I, O, U, H, W, Y (vowels)
    • Group 1: B, F, P, V
    • Group 2: C, G, J, K, Q, S, X, Z
    • Group 3: D, T
    • Group 4: L
    • Group 5: M, N
    • Group 6: R
  4. Code Generation:

    • Remove zeros (vowels)
    • Remove adjacent duplicates
    • Keep first letter + 3 digits
    • Pad with zeros if needed

Similarity Metrics:

  • Cosine Similarity: Character frequency vector comparison
  • String Similarity: Set-based character overlap
  • Average Similarity: Combined metric for final matching

Output Files

After running analysis, the following files are generated:

  • analysis_report.md - Comprehensive analysis report
  • soundex_codes.csv - All names with their soundex codes
  • grouped_names.json - Names grouped by soundex code
  • similarity_analysis.csv - Detailed similarity metrics

Performance

  • Encoding Speed: 10,000+ names/second
  • Dataset Size: Tested with 208K names
  • Memory: Efficient batch processing

Test Coverage

40+ comprehensive tests including:

  • Identical name matching
  • Phonetically similar names
  • Empty/null inputs
  • Special character handling
  • Accent removal
  • Case insensitivity
  • First letter transformations
  • Compound names
  • Code format validation
  • Consonant grouping
  • Performance benchmarks

Accuracy

Based on analysis of agrochemical names dataset:

  • Precision: ~70-80% (varies by threshold)
  • Compression Ratio: 20-30:1
  • Quality Score: 75-85/100

References

License

This implementation is for educational and research purposes.