SoundexBR - Brazilian Portuguese Phonetic Algorithm

Comprehensive implementation of the SoundexBR algorithm for grouping and analyzing Brazilian Portuguese names with phonetic similarity.

Features

SoundexBR Algorithm: Full implementation based on SoundexBR R package
Similarity Metrics: Cosine similarity and string similarity for refined matching
Comprehensive Testing: 40+ unit tests covering edge cases and validation
Performance: Processes 10,000+ names per second
Detailed Reports: Markdown reports with statistics, accuracy metrics, and recommendations
Tinder-Style Validation: Interactive interface for uncertain match approval
RNC Integration: Maps products to official cultivar registry

Architecture & Workflow

Overall System Architecture

graph TB
    A[Input Data<br/>Excel/CSV with<br/>Product Names] --> B[SoundexBR<br/>Encoder]
    B --> C{Grouping<br/>Engine}
    C --> D[Similar Names<br/>Same Code]
    C --> E[Unique Names<br/>Different Codes]

    D --> F[Similarity<br/>Analysis]
    F --> G{Confidence<br/>Score}

    G -->|High >0.7| H[Auto-Approved<br/>Groupings]
    G -->|Medium 0.5-0.7| I[Uncertain<br/>Matches]
    G -->|Low <0.5| J[Potential<br/>False Positives]

    I --> K[Tinder-Style<br/>Validation]
    J --> K

    K --> L[User Review<br/>Swipe Y/N]
    L --> M[Validated<br/>Groupings]

    H --> N[Final Results]
    M --> N
    E --> N

    N --> O[Export<br/>Excel/CSV/JSON/MD]

    style A fill:#e1f5ff
    style B fill:#fff4e6
    style K fill:#ffe6f0
    style N fill:#e6ffe6
    style O fill:#f0e6ff

RNC Integration Workflow

graph LR
    A[Product Names<br/>208K Records] --> B[Normalize &<br/>Encode]
    B --> C[SoundexBR<br/>Codes]

    D[RNC Database<br/>Official Names] --> E[Normalize &<br/>Encode]
    E --> F[RNC Soundex<br/>Codes]

    C --> G{Code<br/>Match?}
    F --> G

    G -->|Yes| H[Same Soundex<br/>Code]
    G -->|No| I[Different<br/>Codes]

    H --> J[Calculate<br/>Similarity]

    J --> K{Similarity<br/>Score}

    K -->|>0.7| L[High Confidence<br/>Match]
    K -->|0.5-0.7| M[Medium Confidence<br/>Needs Review]
    K -->|<0.5| N[Low Confidence<br/>Likely False Match]

    I --> O[Cross-Group<br/>Check]
    O --> P{High Similarity<br/>Despite Different<br/>Codes?}
    P -->|Yes| M
    P -->|No| Q[No Match<br/>Product Not in RNC]

    L --> R[Auto-Map to<br/>RNC Name]
    M --> S[User<br/>Validation]
    N --> S

    S --> T[Validated<br/>Mappings]
    Q --> U[Unmatched<br/>Products List]

    R --> V[Final<br/>Mapping Table]
    T --> V
    U --> V

    style A fill:#e1f5ff
    style D fill:#ffe6e6
    style G fill:#fff4e6
    style K fill:#fff4e6
    style S fill:#ffe6f0
    style V fill:#e6ffe6

SoundexBR Algorithm Flow

flowchart TD
    A[Input Name<br/>'João Silva'] --> B[Normalize]
    B --> C[Uppercase<br/>'JOÃO SILVA']
    C --> D[Remove Accents<br/>'JOAO SILVA']
    D --> E[Remove Special Chars<br/>'JOAOSILVA']
    E --> F[Split Words]
    F --> G[Process First Word<br/>'JOAO']

    G --> H{First Letter<br/>Transform?}
    H -->|KA→CA, GE→JE, etc| I[Apply Transform]
    H -->|No| J[Keep Letter]
    I --> K[Handle H]
    J --> K

    K --> L[Apply Phonetic<br/>Transforms<br/>PH→F, TH→T, etc]
    L --> M[Save First<br/>Letter: J]

    M --> N[Encode to Numbers<br/>Letter→Group Code]
    N --> O[Remove Zeros<br/>Vowels]
    O --> P[Remove Adjacent<br/>Duplicates]
    P --> Q[Take First<br/>3 Digits]
    Q --> R[Pad with Zeros<br/>if needed]

    R --> S[Final Code<br/>J + 3 digits]
    S --> T[Result: 'J200']

    style A fill:#e1f5ff
    style H fill:#fff4e6
    style T fill:#e6ffe6

Data Quality Improvement Process

stateDiagram-v2
    [*] --> InputData: Raw product names

    InputData --> Analysis: Run SoundexBR

    Analysis --> Grouping: Create phonetic groups

    Grouping --> HighQuality: High confidence (>0.7)
    Grouping --> MediumQuality: Medium confidence (0.5-0.7)
    Grouping --> LowQuality: Low confidence (<0.5)

    HighQuality --> AutoApproved: Automatic approval

    MediumQuality --> UserReview: Tinder validation
    LowQuality --> UserReview: Tinder validation

    UserReview --> Approved: User says YES
    UserReview --> Rejected: User says NO
    UserReview --> Skipped: User SKIPS

    Approved --> Standardization: Create canonical mapping
    AutoApproved --> Standardization: Create canonical mapping

    Rejected --> Separation: Keep as different products
    Skipped --> ManualReview: Flag for later

    Standardization --> CleanData: Standardized names
    Separation --> CleanData: Unique products
    ManualReview --> CleanData: Pending review

    CleanData --> Export: Generate reports

    Export --> [*]: Complete

    note right of UserReview
        Interactive Tinder-style interface
        Shows similarity metrics
        Visual confidence bars
        Swipe left/right decision
    end note

    note right of Standardization
        71.4% compression achieved
        100% precision in demo
        Reduces data redundancy
    end note

Files

soundex_br.py - Core SoundexBR algorithm implementation
test_soundex.py - Comprehensive test suite (40+ tests)
analyze_names.py - Main analysis script for processing datasets
nome_agroquimico.csv - Sample dataset (208K agrochemical names)

Docker Usage

Build and run with Docker Compose:

# Build and run (tests + analysis)
docker-compose up --build

# Run tests only
docker-compose run soundex-analysis python test_soundex.py

# Run analysis only
docker-compose run soundex-analysis python analyze_names.py

# Interactive shell
docker-compose run soundex-analysis bash

Build and run with Docker:

# Build image
docker build -t br-soundex .

# Run tests
docker run br-soundex python test_soundex.py

# Run analysis
docker run -v $(pwd):/app br-soundex python analyze_names.py

# Interactive shell
docker run -it -v $(pwd):/app br-soundex bash

Local Usage (without Docker)

Prerequisites:

pip install -r requirements.txt

Run tests:

python test_soundex.py

Run analysis:

python analyze_names.py

Use in Python:

from soundex_br import SoundexBR

soundex = SoundexBR()

# Encode single name
code = soundex.encode("João Silva")
print(code)  # Output: J240

# Encode multiple names
names = ["João", "Joao", "Maria", "Mario"]
codes = soundex.encode_batch(names)
print(codes)  # Output: ['J000', 'J000', 'M600', 'M600']

Algorithm Details

SoundexBR Encoding Process:

Normalization:
- Convert to uppercase
- Remove accents (João → JOAO)
- Remove special characters and numbers
- Keep only letters
Phonetic Transformations:
- First letter special rules (KA→CA, GE→JE, etc.)
- Handle silent H
- Apply phonetic equivalences (PH→F, TH→T, etc.)
Letter to Number Encoding:
- Group 0: A, E, I, O, U, H, W, Y (vowels)
- Group 1: B, F, P, V
- Group 2: C, G, J, K, Q, S, X, Z
- Group 3: D, T
- Group 4: L
- Group 5: M, N
- Group 6: R
Code Generation:
- Remove zeros (vowels)
- Remove adjacent duplicates
- Keep first letter + 3 digits
- Pad with zeros if needed

Similarity Metrics:

Cosine Similarity: Character frequency vector comparison
String Similarity: Set-based character overlap
Average Similarity: Combined metric for final matching

Output Files

After running analysis, the following files are generated:

analysis_report.md - Comprehensive analysis report
soundex_codes.csv - All names with their soundex codes
grouped_names.json - Names grouped by soundex code
similarity_analysis.csv - Detailed similarity metrics

Performance

Encoding Speed: 10,000+ names/second
Dataset Size: Tested with 208K names
Memory: Efficient batch processing

Test Coverage

40+ comprehensive tests including:

Identical name matching
Phonetically similar names
Empty/null inputs
Special character handling
Accent removal
Case insensitivity
First letter transformations
Compound names
Code format validation
Consonant grouping
Performance benchmarks

Accuracy

Based on analysis of agrochemical names dataset:

Precision: ~70-80% (varies by threshold)
Compression Ratio: 20-30:1
Quality Score: 75-85/100

References

License

This implementation is for educational and research purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SoundexBR - Brazilian Portuguese Phonetic Algorithm

Features

Architecture & Workflow

Overall System Architecture

RNC Integration Workflow

SoundexBR Algorithm Flow

Data Quality Improvement Process

Files

Docker Usage

Build and run with Docker Compose:

Build and run with Docker:

Local Usage (without Docker)

Prerequisites:

Run tests:

Run analysis:

Use in Python:

Algorithm Details

SoundexBR Encoding Process:

Similarity Metrics:

Output Files

Performance

Test Coverage

Accuracy

References

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SoundexBR - Brazilian Portuguese Phonetic Algorithm

Features

Architecture & Workflow

Overall System Architecture

RNC Integration Workflow

SoundexBR Algorithm Flow

Data Quality Improvement Process

Files

Docker Usage

Build and run with Docker Compose:

Build and run with Docker:

Local Usage (without Docker)

Prerequisites:

Run tests:

Run analysis:

Use in Python:

Algorithm Details

SoundexBR Encoding Process:

Similarity Metrics:

Output Files

Performance

Test Coverage

Accuracy

References

License