Comprehensive implementation of the SoundexBR algorithm for grouping and analyzing Brazilian Portuguese names with phonetic similarity.
- SoundexBR Algorithm: Full implementation based on SoundexBR R package
- Similarity Metrics: Cosine similarity and string similarity for refined matching
- Comprehensive Testing: 40+ unit tests covering edge cases and validation
- Performance: Processes 10,000+ names per second
- Detailed Reports: Markdown reports with statistics, accuracy metrics, and recommendations
- Tinder-Style Validation: Interactive interface for uncertain match approval
- RNC Integration: Maps products to official cultivar registry
graph TB
A[Input Data<br/>Excel/CSV with<br/>Product Names] --> B[SoundexBR<br/>Encoder]
B --> C{Grouping<br/>Engine}
C --> D[Similar Names<br/>Same Code]
C --> E[Unique Names<br/>Different Codes]
D --> F[Similarity<br/>Analysis]
F --> G{Confidence<br/>Score}
G -->|High >0.7| H[Auto-Approved<br/>Groupings]
G -->|Medium 0.5-0.7| I[Uncertain<br/>Matches]
G -->|Low <0.5| J[Potential<br/>False Positives]
I --> K[Tinder-Style<br/>Validation]
J --> K
K --> L[User Review<br/>Swipe Y/N]
L --> M[Validated<br/>Groupings]
H --> N[Final Results]
M --> N
E --> N
N --> O[Export<br/>Excel/CSV/JSON/MD]
style A fill:#e1f5ff
style B fill:#fff4e6
style K fill:#ffe6f0
style N fill:#e6ffe6
style O fill:#f0e6ff
graph LR
A[Product Names<br/>208K Records] --> B[Normalize &<br/>Encode]
B --> C[SoundexBR<br/>Codes]
D[RNC Database<br/>Official Names] --> E[Normalize &<br/>Encode]
E --> F[RNC Soundex<br/>Codes]
C --> G{Code<br/>Match?}
F --> G
G -->|Yes| H[Same Soundex<br/>Code]
G -->|No| I[Different<br/>Codes]
H --> J[Calculate<br/>Similarity]
J --> K{Similarity<br/>Score}
K -->|>0.7| L[High Confidence<br/>Match]
K -->|0.5-0.7| M[Medium Confidence<br/>Needs Review]
K -->|<0.5| N[Low Confidence<br/>Likely False Match]
I --> O[Cross-Group<br/>Check]
O --> P{High Similarity<br/>Despite Different<br/>Codes?}
P -->|Yes| M
P -->|No| Q[No Match<br/>Product Not in RNC]
L --> R[Auto-Map to<br/>RNC Name]
M --> S[User<br/>Validation]
N --> S
S --> T[Validated<br/>Mappings]
Q --> U[Unmatched<br/>Products List]
R --> V[Final<br/>Mapping Table]
T --> V
U --> V
style A fill:#e1f5ff
style D fill:#ffe6e6
style G fill:#fff4e6
style K fill:#fff4e6
style S fill:#ffe6f0
style V fill:#e6ffe6
flowchart TD
A[Input Name<br/>'João Silva'] --> B[Normalize]
B --> C[Uppercase<br/>'JOÃO SILVA']
C --> D[Remove Accents<br/>'JOAO SILVA']
D --> E[Remove Special Chars<br/>'JOAOSILVA']
E --> F[Split Words]
F --> G[Process First Word<br/>'JOAO']
G --> H{First Letter<br/>Transform?}
H -->|KA→CA, GE→JE, etc| I[Apply Transform]
H -->|No| J[Keep Letter]
I --> K[Handle H]
J --> K
K --> L[Apply Phonetic<br/>Transforms<br/>PH→F, TH→T, etc]
L --> M[Save First<br/>Letter: J]
M --> N[Encode to Numbers<br/>Letter→Group Code]
N --> O[Remove Zeros<br/>Vowels]
O --> P[Remove Adjacent<br/>Duplicates]
P --> Q[Take First<br/>3 Digits]
Q --> R[Pad with Zeros<br/>if needed]
R --> S[Final Code<br/>J + 3 digits]
S --> T[Result: 'J200']
style A fill:#e1f5ff
style H fill:#fff4e6
style T fill:#e6ffe6
stateDiagram-v2
[*] --> InputData: Raw product names
InputData --> Analysis: Run SoundexBR
Analysis --> Grouping: Create phonetic groups
Grouping --> HighQuality: High confidence (>0.7)
Grouping --> MediumQuality: Medium confidence (0.5-0.7)
Grouping --> LowQuality: Low confidence (<0.5)
HighQuality --> AutoApproved: Automatic approval
MediumQuality --> UserReview: Tinder validation
LowQuality --> UserReview: Tinder validation
UserReview --> Approved: User says YES
UserReview --> Rejected: User says NO
UserReview --> Skipped: User SKIPS
Approved --> Standardization: Create canonical mapping
AutoApproved --> Standardization: Create canonical mapping
Rejected --> Separation: Keep as different products
Skipped --> ManualReview: Flag for later
Standardization --> CleanData: Standardized names
Separation --> CleanData: Unique products
ManualReview --> CleanData: Pending review
CleanData --> Export: Generate reports
Export --> [*]: Complete
note right of UserReview
Interactive Tinder-style interface
Shows similarity metrics
Visual confidence bars
Swipe left/right decision
end note
note right of Standardization
71.4% compression achieved
100% precision in demo
Reduces data redundancy
end note
soundex_br.py- Core SoundexBR algorithm implementationtest_soundex.py- Comprehensive test suite (40+ tests)analyze_names.py- Main analysis script for processing datasetsnome_agroquimico.csv- Sample dataset (208K agrochemical names)
# Build and run (tests + analysis)
docker-compose up --build
# Run tests only
docker-compose run soundex-analysis python test_soundex.py
# Run analysis only
docker-compose run soundex-analysis python analyze_names.py
# Interactive shell
docker-compose run soundex-analysis bash# Build image
docker build -t br-soundex .
# Run tests
docker run br-soundex python test_soundex.py
# Run analysis
docker run -v $(pwd):/app br-soundex python analyze_names.py
# Interactive shell
docker run -it -v $(pwd):/app br-soundex bashpip install -r requirements.txtpython test_soundex.pypython analyze_names.pyfrom soundex_br import SoundexBR
soundex = SoundexBR()
# Encode single name
code = soundex.encode("João Silva")
print(code) # Output: J240
# Encode multiple names
names = ["João", "Joao", "Maria", "Mario"]
codes = soundex.encode_batch(names)
print(codes) # Output: ['J000', 'J000', 'M600', 'M600']-
Normalization:
- Convert to uppercase
- Remove accents (João → JOAO)
- Remove special characters and numbers
- Keep only letters
-
Phonetic Transformations:
- First letter special rules (KA→CA, GE→JE, etc.)
- Handle silent H
- Apply phonetic equivalences (PH→F, TH→T, etc.)
-
Letter to Number Encoding:
- Group 0: A, E, I, O, U, H, W, Y (vowels)
- Group 1: B, F, P, V
- Group 2: C, G, J, K, Q, S, X, Z
- Group 3: D, T
- Group 4: L
- Group 5: M, N
- Group 6: R
-
Code Generation:
- Remove zeros (vowels)
- Remove adjacent duplicates
- Keep first letter + 3 digits
- Pad with zeros if needed
- Cosine Similarity: Character frequency vector comparison
- String Similarity: Set-based character overlap
- Average Similarity: Combined metric for final matching
After running analysis, the following files are generated:
analysis_report.md- Comprehensive analysis reportsoundex_codes.csv- All names with their soundex codesgrouped_names.json- Names grouped by soundex codesimilarity_analysis.csv- Detailed similarity metrics
- Encoding Speed: 10,000+ names/second
- Dataset Size: Tested with 208K names
- Memory: Efficient batch processing
40+ comprehensive tests including:
- Identical name matching
- Phonetically similar names
- Empty/null inputs
- Special character handling
- Accent removal
- Case insensitivity
- First letter transformations
- Compound names
- Code format validation
- Consonant grouping
- Performance benchmarks
Based on analysis of agrochemical names dataset:
- Precision: ~70-80% (varies by threshold)
- Compression Ratio: 20-30:1
- Quality Score: 75-85/100
This implementation is for educational and research purposes.