A Python CLI tool for parsing, validating, and formatting US addresses according to USPS Publication 28 standards.
- Address Parsing: Uses the
usaddresslibrary for NLP-based address parsing - Offline Validation: Validates ZIP codes, state abbreviations, and address completeness
- USPS Formatting: Formats addresses according to USPS Publication 28 standards
- Multiple Output Formats: Supports CSV, JSON, and Excel output formats (input: CSV only)
- Batch Processing: Process large CSV files with progress tracking
- Single Address Processing: Process individual addresses from command line
- Column Preservation: Preserve original CSV columns while adding cleaned address fields
- Flexible Input: Accept addresses in a single column OR auto-detect and combine separate columns
Get up and running in 3 steps:
-
Install dependencies:
bash install.sh
-
Test with sample data:
# If installed: address-cleanser batch --input out/sample_input.csv --output test_results.csv --format csv # If from source: python3 cli.py batch --input out/sample_input.csv --output test_results.csv --format csv
-
Process a single address:
# If installed: address-cleanser single --single "123 Main Street, Austin, TX 78701" # If from source: python3 cli.py single --single "123 Main Street, Austin, TX 78701"
Expected Output Preview:
Original: 123 Main Street, Austin, TX 78701
Formatted: 123 MAIN ST, AUSTIN, TX, 78701
Valid: Yes
Confidence: 90.0%
- Python 3.8 or higher
- pip package manager
- ~50MB disk space for dependencies
Recommended (automatic):
bash install.shManual installation:
pip install -r requirements.txtOr install manually:
pip install usaddress pandas openpyxl click tqdm psutilCommand Invocation:
- If installed as a package or via Homebrew: use
address-cleanser - If running from source: use
python cli.pyorpython3 cli.py
Command Structure:
address-cleanser [OPTIONS] COMMAND [ARGS]...The tool provides two main commands:
batch- Process addresses from a CSV filesingle- Process a single address
Process addresses from a CSV file (CSV input only; output can be CSV, JSON, or Excel):
address-cleanser batch --input addresses.csv --output cleaned_addresses.csv --format csvOptions:
--input, -i: Input CSV file path (required) - Note: Only CSV files are supported as input--output, -o: Output file path (required)--format, -f: Output format - csv, json, or excel (default: csv)--address-column, -c: Name of the address column in CSV (auto-detected if not specified)--address-columns, -C: Comma-separated list of columns to combine (e.g.,"Address,City,State,Zip")--preserve-columns, -p: Preserve all original CSV columns in output--update-in-place: Mirror input structure with cleaned values (perfect for client returns)--auto-combine, -a: Auto-detect and combine separate address columns--report, -r: Generate validation report file (optional)--chunk-size: Process addresses in chunks of this size (default: 1000)
Examples:
# Process CSV file and output to CSV
address-cleanser batch --input addresses.csv --output cleaned.csv --format csv
# Process CSV file and output to Excel with validation report
address-cleanser batch --input addresses.csv --output results.xlsx --format excel --report validation.txt
# Process with custom address column name (case-sensitive)
address-cleanser batch --input data.csv --output output.csv --address-column "Address"
# Example: If your CSV has "full_address" instead of "address"
address-cleanser batch --input data.csv --output output.csv --address-column "full_address"
# Process CSV with separate address columns and preserve original columns
address-cleanser batch --input data.csv --output cleaned.csv --preserve-columns --auto-combine
# Explicitly specify which columns to combine
address-cleanser batch --input data.csv --output cleaned.csv --preserve-columns --address-columns "Address,City,State,Zip"
# Preserve columns and output to Excel
address-cleanser batch --input client_data.csv --output cleaned.xlsx --format excel --preserve-columns --auto-combine
# Update in place - mirror client's structure with cleaned values (perfect for Fiverr/client returns)
address-cleanser batch --input client_data.csv --output cleaned.csv --update-in-placeProcess a single address:
address-cleanser single --single "123 Main Street, Austin, TX 78701" --format jsonOptions:
--single, -s: Single address to process (required)--format, -f: Output format - csv, json, or excel (default: json)--output, -o: Output file path (optional, prints to console if not specified)
Examples:
# Process single address and print to console
address-cleanser single --single "123 Main Street, Austin, TX 78701"
# Process single address and save to file
address-cleanser single --single "PO Box 123, Austin, TX 78701" --output result.json --format jsonInput Requirements: The CLI accepts CSV files only as input. Excel files (.xlsx) are not supported for input, but you can output to Excel format. To process an Excel file, first export it to CSV format.
Address Column Formats: The tool supports two input formats:
-
Single Combined Column (default): One column with complete address string
address 123 Main Street, Austin, TX 78701
-
Separate Columns: Address components in multiple columns
Address,City,State,Zip 123 Main Street,Austin,TX,78701
Use
--auto-combineto auto-detect and combine, or--address-columns "Address,City,State,Zip"to specify manually.
Column Detection:
- By default, looks for a column named
address(case-sensitive, lowercase) - Use
--address-columnto specify a different single column name - Use
--auto-combineto automatically detect and combine separate address columns - Use
--address-columnsto explicitly specify which columns to combine (comma-separated)
Column Preservation:
- Use
--preserve-columnsto keep all original CSV columns in the output - Original columns appear first, followed by
cleaned_*columns with parsed/cleaned data - Perfect for maintaining client data structures while adding cleaned address fields
Update In-Place Mode:
- Use
--update-in-placeto mirror the client's exact input structure with cleaned values - Same columns, same order - only the address data is cleaned/standardized
- Perfect for Fiverr clients or when returning data that needs to plug right back into their systems
- No extra columns added - address fields are updated with cleaned values in place
- Works with any column format: separate columns (Address, City, State, Zip) or combined (Full_Address)
Example CSV Formats:
Single combined column (preferred):
address
123 Main Street, Austin, TX 78701
456 Oak Avenue, Dallas, TX 75201Separate columns (also supported):
Name,Address,City,State,Zip,Phone
John Smith,123 Main St,Austin,TX,78701,555-0101Standard Output (without --preserve-columns):
The CSV output includes the following columns:
original_address: Original input addressstreet_number: Parsed street numberstreet_name: Parsed street namestreet_type: Parsed street type (ST, AVE, etc.)city: Parsed city namestate: Parsed state abbreviationzip_code: Parsed ZIP codeunit: Unit/apartment informationpo_box: PO Box numberformatted_address: USPS-formatted single-line addressconfidence_score: Parsing confidence (0-100)validation_status: Valid or Invalidissues: List of validation issuesaddress_type: Type of address (Street Address, PO Box, etc.)
With Column Preservation (using --preserve-columns):
- All original columns are preserved in their original order
- New
cleaned_*columns are appended:cleaned_original_addresscleaned_street_numbercleaned_street_namecleaned_street_typecleaned_citycleaned_statecleaned_zip_codecleaned_unitcleaned_po_boxcleaned_formatted_addresscleaned_confidence_scorecleaned_validation_statuscleaned_issuescleaned_address_type
This allows you to maintain your original data structure while adding cleaned address fields.
Standard Output (without --preserve-columns):
The JSON output includes:
results: Array of processing resultssummary: Processing statisticstimestamp: Processing timestamp
Each result contains:
original: Original addressparsed: Parsed address componentsformatted: USPS-formatted componentssingle_line: Single-line formatted addressmulti_line: Multi-line formatted addressconfidence: Confidence scorevalid: Validation statusissues: List of issuesaddress_type: Address type
With Column Preservation (using --preserve-columns):
results: Array of processing results (same as above)original_data: Array of original CSV rows with all original columnssummary: Processing statisticstimestamp: Processing timestamp
Standard Output (without --preserve-columns):
- Addresses sheet: Parsed address columns (Original Address, Street Number, etc.)
- Summary sheet: Processing statistics and metrics
With Column Preservation (using --preserve-columns):
- Addresses sheet: All original columns + cleaned address columns (with "Cleaned " prefix)
- Summary sheet: Processing statistics and metrics
The tool provides confidence scores (0-100) to help you assess result quality:
- 90-100%: High confidence - Results are very reliable
- 70-89%: Medium confidence - Results are generally good, minor review recommended
- 50-69%: Low confidence - Results may have issues, manual review recommended
- 0-49%: Very low confidence - Results likely have significant issues
- Valid: Address meets all validation criteria (ZIP code format, state abbreviation, completeness)
- Invalid: Address fails one or more validation checks
Review addresses with:
- Confidence scores below 70%
- Validation status "Invalid"
- Non-empty "issues" field
- Unusual address types or formats
{
"original": "123 Main Street, Austin, TX 78701",
"confidence": 90.0,
"valid": true,
"issues": [],
"formatted_address": "123 MAIN ST, AUSTIN, TX, 78701"
}✅ High confidence, valid address - Ready to use
{
"original": "123 Main St Austin TX",
"confidence": 65.0,
"valid": false,
"issues": ["Missing required fields: zip_code"],
"formatted_address": "123 MAIN ST, AUSTIN, TX"
}- Single address: ~0.4 seconds
- Small files (100 addresses): ~10 seconds
- Medium files (1,000 addresses): ~2 minutes
- Large files (10,000 addresses): ~20 minutes
- Base memory: ~50MB
- Per 1,000 addresses: ~5MB additional
- Recommended: 100MB+ available memory for files >5,000 addresses
- Maximum recommended: 50,000 addresses per file
- Chunk processing: Automatically handles large files
- Memory optimization: Use smaller chunk sizes (500-1000) for very large files
- Small files (<1,000 addresses): Default (1000) is fine
- Medium files (1,000-10,000 addresses): Use 1000-2000
- Large files (>10,000 addresses): Use 500-1000
- Memory-constrained systems: Use 250-500
# Process the included sample file
address-cleanser batch --input out/sample_input.csv --output results.csv --format csv --report report.txtExample 1: CSV with separate address columns
# Auto-detect and combine address columns, preserve all original columns
address-cleanser batch \
--input client_data.csv \
--output cleaned.csv \
--preserve-columns \
--auto-combineExample 2: Explicitly specify which columns to combine
# Combine specific columns and preserve original structure
address-cleanser batch \
--input data.csv \
--output cleaned.csv \
--preserve-columns \
--address-columns "Address,City,State,Zip"Example 3: Combined address column (standard)
# Standard processing with combined address column
address-cleanser batch \
--input addresses.csv \
--output cleaned.csv \
--address-column "Full_Address"Example 4: Preserve columns and export to Excel
# Maintain original structure, add cleaned fields, output to Excel
address-cleanser batch \
--input client_data.csv \
--output cleaned.xlsx \
--format excel \
--preserve-columns \
--auto-combineExample 5: Update in-place (perfect for client returns)
# Mirror client's structure with cleaned values - no extra columns
address-cleanser batch \
--input client_data.csv \
--output cleaned.csv \
--update-in-placeBefore (input with 7 columns):
Name,Address,City,State,Zip,Phone,Email
John Smith,123 main st apt#5a,Austin,TX,78701-123,555-0101,john@example.comAfter (output with same 7 columns):
Name,Address,City,State,Zip,Phone,Email
John Smith,123 MAIN ST, APT # 5A,AUSTIN,TX,78701-123,555-0101,john@example.comNote: Address, City, and State are cleaned; non-address columns (Name, Phone, Email) are preserved unchanged
The tool handles various address formats:
- Standard addresses:
123 Main Street, Austin, TX 78701 - PO Box addresses:
PO Box 123, Austin, TX 78701 - Apartment addresses:
123 Main St Apt 456, Austin, TX 78701 - Addresses with directionals:
123 North Main Street, Austin, TX 78701 - ZIP+4 codes:
123 Main St, Austin TX 78701-1234
The tool validates:
- ZIP codes: Must be 5 digits or ZIP+4 format
- State abbreviations: Must be valid US state abbreviations
- Address completeness: Requires street number, street name, city, and state
- PO Box addresses: Have different completeness requirements
Error: "No such file or directory"
- Cause: Input file doesn't exist or path is incorrect
- Solution: Check file path and ensure file exists
Error: "Address column 'address' not found"
- Cause: CSV doesn't have a column named "address" (column names are case-sensitive)
- Solution:
- Use
--address-columnto specify a single column name (e.g.,--address-column "Address") - Use
--auto-combineto auto-detect and combine separate address columns - Use
--address-columnsto explicitly specify multiple columns to combine (e.g.,--address-columns "Address,City,State,Zip")
- Use
- Note: Column names are case-sensitive and must match exactly
Error: "Invalid input file format"
- Cause: File is not a CSV
- Solution: Ensure input file has .csv extension
Error: "Failed to install dependencies"
- Cause: pip installation issues
- Solution: Try manual installation or check Python/pip setup
Linux/macOS:
- Ensure Python 3.8+ is installed
- Use
python3command instead ofpython
Windows:
- Use
pythoninstead ofpython3 - Ensure PATH includes Python installation
If installation fails:
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtSlow processing:
- Reduce chunk size:
--chunk-size 500 - Check available memory
- Close other applications
Memory errors:
- Use smaller chunk sizes
- Process files in smaller batches
- Increase system memory if possible
The tool uses a modular architecture with three main components:
- Parser (
src/parser.py): Usesusaddresslibrary for NLP-based address parsing - Validator (
src/validator.py): Offline validation of address components - Formatter (
src/formatter.py): USPS Publication 28 standard formatting
The implementation uses usaddress instead of the originally planned libpostal for several reasons:
- Easier installation: No system-level dependencies required
- Better Python integration: Native Python library
- Sufficient accuracy: Meets requirements for US address parsing
- Maintenance: Actively maintained Python package
Supported:
- US addresses only (including territories)
- Standard street addresses
- PO Box addresses
- Apartment/suite addresses
- ZIP and ZIP+4 codes
Not supported:
- International addresses
- Military addresses (APO/FPO)
- Rural route addresses
- Some non-standard formats
Typical accuracy:
- Standard addresses: 90-95%
- PO Box addresses: 95-98%
- Apartment addresses: 85-90%
- Malformed addresses: 60-80%
The tool operates completely offline:
- No internet connection required
- No external API calls
- All validation uses built-in rules
- Fast processing without network delays
Run the test suite:
python -m pytest tests/ -vRun tests with coverage:
python -m pytest tests/ --cov=src --cov-report=htmladdress-cleaner/
├── src/
│ ├── __init__.py
│ ├── parser.py # Address parsing with usaddress
│ ├── validator.py # Offline validation rules
│ ├── formatter.py # USPS formatting standards
│ └── utils.py # Helper functions
├── cli.py # CLI interface
├── tests/
│ ├── __init__.py
│ ├── test_parser.py
│ ├── test_validator.py
│ └── test_integration.py
├── out/ # Sample data (contains sample_input.csv)
├── logs/ # Runtime logs
├── requirements.txt
├── install.sh
├── run.sh
├── README.md
└── LICENSE
The tool supports configurable logging:
# Set log level
address-cleanser --log-level DEBUG batch --input addresses.csv --output results.csv
# Log to file
address-cleanser --log-file logs/processing.log batch --input addresses.csv --output results.csvFor large files, the tool processes addresses in chunks to manage memory usage. The default chunk size is 1000 addresses, but you can adjust this:
address-cleanser batch --input large_file.csv --output results.csv --chunk-size 5000The tool handles various error conditions gracefully:
- Invalid input files: Clear error messages for missing or malformed files
- Parsing errors: Fallback strategies for ambiguous addresses
- Validation errors: Detailed error messages for invalid components
- File I/O errors: Proper error handling for file operations
- US addresses only: The tool is designed specifically for US addresses
- Offline validation: No real-time USPS API validation in MVP version
- English only: Address parsing works best with English text
- USPS API integration for real-time validation
- International address support
- Geocoding capabilities
- Web API endpoint
- Docker containerization
- Update In-Place Mode: New
--update-in-placeflag for Fiverr/client work- Mirrors client's exact input structure (same columns, same order)
- Only address fields are cleaned - no extra columns added
- Perfect for returning data that needs to plug right back into client systems
- Works with any column format: separate (Address, City, State, Zip) or combined (Full_Address)
- Automatically detects and intelligently maps column names (Street_Address, City_Name, etc.)
- Preserves all non-address columns unchanged
- Fixed PyInstaller Executable Issues:
- Fixed
ModuleNotFoundError: No module named 'mmap'that caused crashes when writing CSV output - Fixed address parsing failures by including usaddress model files in executable bundles
- Executables now work correctly with all features
- Fixed
- Column Preservation: Preserve original CSV columns with
--preserve-columnsflag - Auto-Detection: Automatically detect and combine separate address columns with
--auto-combine - Manual Column Combination: Explicitly specify columns to combine with
--address-columns - Enhanced Output: Original data structure maintained while adding cleaned address fields
- Flexible Input: Now supports both single-column addresses and separate address components
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite
- Submit a pull request
For issues and questions, please create an issue in the project repository.# address-cleanser