Address Cleanser

A Python CLI tool for parsing, validating, and formatting US addresses according to USPS Publication 28 standards.

Features

Address Parsing: Uses the usaddress library for NLP-based address parsing
Offline Validation: Validates ZIP codes, state abbreviations, and address completeness
USPS Formatting: Formats addresses according to USPS Publication 28 standards
Multiple Output Formats: Supports CSV, JSON, and Excel output formats (input: CSV only)
Batch Processing: Process large CSV files with progress tracking
Single Address Processing: Process individual addresses from command line
Column Preservation: Preserve original CSV columns while adding cleaned address fields
Flexible Input: Accept addresses in a single column OR auto-detect and combine separate columns

Quick Start

Get up and running in 3 steps:

Install dependencies:
```
bash install.sh
```

Test with sample data:

# If installed: address-cleanser batch --input out/sample_input.csv --output test_results.csv --format csv
# If from source: python3 cli.py batch --input out/sample_input.csv --output test_results.csv --format csv

Process a single address:

# If installed: address-cleanser single --single "123 Main Street, Austin, TX 78701"
# If from source: python3 cli.py single --single "123 Main Street, Austin, TX 78701"

Expected Output Preview:

Original: 123 Main Street, Austin, TX 78701
Formatted: 123 MAIN ST, AUSTIN, TX, 78701
Valid: Yes
Confidence: 90.0%

Installation

Prerequisites

Python 3.8 or higher
pip package manager
~50MB disk space for dependencies

Install Dependencies

Recommended (automatic):

bash install.sh

Manual installation:

pip install -r requirements.txt

Or install manually:

pip install usaddress pandas openpyxl click tqdm psutil

Usage

Command Line Interface

Command Invocation:

If installed as a package or via Homebrew: use address-cleanser
If running from source: use python cli.py or python3 cli.py

Command Structure:

address-cleanser [OPTIONS] COMMAND [ARGS]...

The tool provides two main commands:

batch - Process addresses from a CSV file
single - Process a single address

Batch Processing

Process addresses from a CSV file (CSV input only; output can be CSV, JSON, or Excel):

address-cleanser batch --input addresses.csv --output cleaned_addresses.csv --format csv

Options:

--input, -i: Input CSV file path (required) - Note: Only CSV files are supported as input
--output, -o: Output file path (required)
--format, -f: Output format - csv, json, or excel (default: csv)
--address-column, -c: Name of the address column in CSV (auto-detected if not specified)
--address-columns, -C: Comma-separated list of columns to combine (e.g., "Address,City,State,Zip")
--preserve-columns, -p: Preserve all original CSV columns in output
--update-in-place: Mirror input structure with cleaned values (perfect for client returns)
--auto-combine, -a: Auto-detect and combine separate address columns
--report, -r: Generate validation report file (optional)
--chunk-size: Process addresses in chunks of this size (default: 1000)

Examples:

# Process CSV file and output to CSV
address-cleanser batch --input addresses.csv --output cleaned.csv --format csv

# Process CSV file and output to Excel with validation report
address-cleanser batch --input addresses.csv --output results.xlsx --format excel --report validation.txt

# Process with custom address column name (case-sensitive)
address-cleanser batch --input data.csv --output output.csv --address-column "Address"

# Example: If your CSV has "full_address" instead of "address"
address-cleanser batch --input data.csv --output output.csv --address-column "full_address"

# Process CSV with separate address columns and preserve original columns
address-cleanser batch --input data.csv --output cleaned.csv --preserve-columns --auto-combine

# Explicitly specify which columns to combine
address-cleanser batch --input data.csv --output cleaned.csv --preserve-columns --address-columns "Address,City,State,Zip"

# Preserve columns and output to Excel
address-cleanser batch --input client_data.csv --output cleaned.xlsx --format excel --preserve-columns --auto-combine

# Update in place - mirror client's structure with cleaned values (perfect for Fiverr/client returns)
address-cleanser batch --input client_data.csv --output cleaned.csv --update-in-place

Single Address Processing

Process a single address:

address-cleanser single --single "123 Main Street, Austin, TX 78701" --format json

Options:

--single, -s: Single address to process (required)
--format, -f: Output format - csv, json, or excel (default: json)
--output, -o: Output file path (optional, prints to console if not specified)

Examples:

# Process single address and print to console
address-cleanser single --single "123 Main Street, Austin, TX 78701"

# Process single address and save to file
address-cleanser single --single "PO Box 123, Austin, TX 78701" --output result.json --format json

Input Format

Input Requirements: The CLI accepts CSV files only as input. Excel files (.xlsx) are not supported for input, but you can output to Excel format. To process an Excel file, first export it to CSV format.

Address Column Formats: The tool supports two input formats:

Single Combined Column (default): One column with complete address string
```
address
123 Main Street, Austin, TX 78701
```
Separate Columns: Address components in multiple columns
```
Address,City,State,Zip
123 Main Street,Austin,TX,78701
```
Use --auto-combine to auto-detect and combine, or --address-columns "Address,City,State,Zip" to specify manually.

Column Detection:

By default, looks for a column named address (case-sensitive, lowercase)
Use --address-column to specify a different single column name
Use --auto-combine to automatically detect and combine separate address columns
Use --address-columns to explicitly specify which columns to combine (comma-separated)

Column Preservation:

Use --preserve-columns to keep all original CSV columns in the output
Original columns appear first, followed by cleaned_* columns with parsed/cleaned data
Perfect for maintaining client data structures while adding cleaned address fields

Update In-Place Mode:

Use --update-in-place to mirror the client's exact input structure with cleaned values
Same columns, same order - only the address data is cleaned/standardized
Perfect for Fiverr clients or when returning data that needs to plug right back into their systems
No extra columns added - address fields are updated with cleaned values in place
Works with any column format: separate columns (Address, City, State, Zip) or combined (Full_Address)

Example CSV Formats:

Single combined column (preferred):

address
123 Main Street, Austin, TX 78701
456 Oak Avenue, Dallas, TX 75201

Separate columns (also supported):

Name,Address,City,State,Zip,Phone
John Smith,123 Main St,Austin,TX,78701,555-0101

Output Formats

CSV Output

Standard Output (without --preserve-columns): The CSV output includes the following columns:

original_address: Original input address
street_number: Parsed street number
street_name: Parsed street name
street_type: Parsed street type (ST, AVE, etc.)
city: Parsed city name
state: Parsed state abbreviation
zip_code: Parsed ZIP code
unit: Unit/apartment information
po_box: PO Box number
formatted_address: USPS-formatted single-line address
confidence_score: Parsing confidence (0-100)
validation_status: Valid or Invalid
issues: List of validation issues
address_type: Type of address (Street Address, PO Box, etc.)

With Column Preservation (using --preserve-columns):

All original columns are preserved in their original order
New cleaned_* columns are appended:
- cleaned_original_address
- cleaned_street_number
- cleaned_street_name
- cleaned_street_type
- cleaned_city
- cleaned_state
- cleaned_zip_code
- cleaned_unit
- cleaned_po_box
- cleaned_formatted_address
- cleaned_confidence_score
- cleaned_validation_status
- cleaned_issues
- cleaned_address_type

This allows you to maintain your original data structure while adding cleaned address fields.

JSON Output

Standard Output (without --preserve-columns): The JSON output includes:

results: Array of processing results
summary: Processing statistics
timestamp: Processing timestamp

Each result contains:

original: Original address
parsed: Parsed address components
formatted: USPS-formatted components
single_line: Single-line formatted address
multi_line: Multi-line formatted address
confidence: Confidence score
valid: Validation status
issues: List of issues
address_type: Address type

With Column Preservation (using --preserve-columns):

results: Array of processing results (same as above)
original_data: Array of original CSV rows with all original columns
summary: Processing statistics
timestamp: Processing timestamp

Excel Output

Standard Output (without --preserve-columns):

Addresses sheet: Parsed address columns (Original Address, Street Number, etc.)
Summary sheet: Processing statistics and metrics

With Column Preservation (using --preserve-columns):

Addresses sheet: All original columns + cleaned address columns (with "Cleaned " prefix)
Summary sheet: Processing statistics and metrics

Understanding Results

Confidence Scores

The tool provides confidence scores (0-100) to help you assess result quality:

90-100%: High confidence - Results are very reliable
70-89%: Medium confidence - Results are generally good, minor review recommended
50-69%: Low confidence - Results may have issues, manual review recommended
0-49%: Very low confidence - Results likely have significant issues

Validation Status

Valid: Address meets all validation criteria (ZIP code format, state abbreviation, completeness)
Invalid: Address fails one or more validation checks

When Manual Review is Needed

Review addresses with:

Confidence scores below 70%
Validation status "Invalid"
Non-empty "issues" field
Unusual address types or formats

Example Interpretations

{
  "original": "123 Main Street, Austin, TX 78701",
  "confidence": 90.0,
  "valid": true,
  "issues": [],
  "formatted_address": "123 MAIN ST, AUSTIN, TX, 78701"
}

✅ High confidence, valid address - Ready to use

{
  "original": "123 Main St Austin TX",
  "confidence": 65.0,
  "valid": false,
  "issues": ["Missing required fields: zip_code"],
  "formatted_address": "123 MAIN ST, AUSTIN, TX"
}

⚠️ Medium confidence, missing ZIP - Add ZIP code if known

Performance Guidelines

Expected Processing Times

Single address: ~0.4 seconds
Small files (100 addresses): ~10 seconds
Medium files (1,000 addresses): ~2 minutes
Large files (10,000 addresses): ~20 minutes

Memory Requirements

Base memory: ~50MB
Per 1,000 addresses: ~5MB additional
Recommended: 100MB+ available memory for files >5,000 addresses

File Size Limitations

Maximum recommended: 50,000 addresses per file
Chunk processing: Automatically handles large files
Memory optimization: Use smaller chunk sizes (500-1000) for very large files

Chunk Size Recommendations

Small files (<1,000 addresses): Default (1000) is fine
Medium files (1,000-10,000 addresses): Use 1000-2000
Large files (>10,000 addresses): Use 500-1000
Memory-constrained systems: Use 250-500

Examples

Processing a Sample File

# Process the included sample file
address-cleanser batch --input out/sample_input.csv --output results.csv --format csv --report report.txt

Working with Different CSV Structures

Example 1: CSV with separate address columns

# Auto-detect and combine address columns, preserve all original columns
address-cleanser batch \
  --input client_data.csv \
  --output cleaned.csv \
  --preserve-columns \
  --auto-combine

Example 2: Explicitly specify which columns to combine

# Combine specific columns and preserve original structure
address-cleanser batch \
  --input data.csv \
  --output cleaned.csv \
  --preserve-columns \
  --address-columns "Address,City,State,Zip"

Example 3: Combined address column (standard)

# Standard processing with combined address column
address-cleanser batch \
  --input addresses.csv \
  --output cleaned.csv \
  --address-column "Full_Address"

Example 4: Preserve columns and export to Excel

# Maintain original structure, add cleaned fields, output to Excel
address-cleanser batch \
  --input client_data.csv \
  --output cleaned.xlsx \
  --format excel \
  --preserve-columns \
  --auto-combine

Example 5: Update in-place (perfect for client returns)

# Mirror client's structure with cleaned values - no extra columns
address-cleanser batch \
  --input client_data.csv \
  --output cleaned.csv \
  --update-in-place

Before (input with 7 columns):

Name,Address,City,State,Zip,Phone,Email
John Smith,123 main st apt#5a,Austin,TX,78701-123,555-0101,john@example.com

After (output with same 7 columns):

Name,Address,City,State,Zip,Phone,Email
John Smith,123 MAIN ST, APT # 5A,AUSTIN,TX,78701-123,555-0101,john@example.com

Note: Address, City, and State are cleaned; non-address columns (Name, Phone, Email) are preserved unchanged

Processing Different Address Types

The tool handles various address formats:

Standard addresses: 123 Main Street, Austin, TX 78701
PO Box addresses: PO Box 123, Austin, TX 78701
Apartment addresses: 123 Main St Apt 456, Austin, TX 78701
Addresses with directionals: 123 North Main Street, Austin, TX 78701
ZIP+4 codes: 123 Main St, Austin TX 78701-1234

Validation Examples

The tool validates:

ZIP codes: Must be 5 digits or ZIP+4 format
State abbreviations: Must be valid US state abbreviations
Address completeness: Requires street number, street name, city, and state
PO Box addresses: Have different completeness requirements

Troubleshooting

Common Errors and Solutions

Error: "No such file or directory"

Cause: Input file doesn't exist or path is incorrect
Solution: Check file path and ensure file exists

Error: "Address column 'address' not found"

Cause: CSV doesn't have a column named "address" (column names are case-sensitive)
Solution:
- Use --address-column to specify a single column name (e.g., --address-column "Address")
- Use --auto-combine to auto-detect and combine separate address columns
- Use --address-columns to explicitly specify multiple columns to combine (e.g., --address-columns "Address,City,State,Zip")
Note: Column names are case-sensitive and must match exactly

Error: "Invalid input file format"

Cause: File is not a CSV
Solution: Ensure input file has .csv extension

Error: "Failed to install dependencies"

Cause: pip installation issues
Solution: Try manual installation or check Python/pip setup

Platform-Specific Issues

Linux/macOS:

Ensure Python 3.8+ is installed
Use python3 command instead of python

Windows:

Use python instead of python3
Ensure PATH includes Python installation

Dependency Conflicts

If installation fails:

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Performance Problems

Slow processing:

Reduce chunk size: --chunk-size 500
Check available memory
Close other applications

Memory errors:

Use smaller chunk sizes
Process files in smaller batches
Increase system memory if possible

Technical Details

Architecture

The tool uses a modular architecture with three main components:

Parser (src/parser.py): Uses usaddress library for NLP-based address parsing
Validator (src/validator.py): Offline validation of address components
Formatter (src/formatter.py): USPS Publication 28 standard formatting

Why usaddress instead of libpostal?

The implementation uses usaddress instead of the originally planned libpostal for several reasons:

Easier installation: No system-level dependencies required
Better Python integration: Native Python library
Sufficient accuracy: Meets requirements for US address parsing
Maintenance: Actively maintained Python package

Limitations and Accuracy

Supported:

US addresses only (including territories)
Standard street addresses
PO Box addresses
Apartment/suite addresses
ZIP and ZIP+4 codes

Not supported:

International addresses
Military addresses (APO/FPO)
Rural route addresses
Some non-standard formats

Typical accuracy:

Standard addresses: 90-95%
PO Box addresses: 95-98%
Apartment addresses: 85-90%
Malformed addresses: 60-80%

Offline Operation

The tool operates completely offline:

No internet connection required
No external API calls
All validation uses built-in rules
Fast processing without network delays

Testing

Run the test suite:

python -m pytest tests/ -v

Run tests with coverage:

python -m pytest tests/ --cov=src --cov-report=html

Project Structure

address-cleaner/
├── src/
│   ├── __init__.py
│   ├── parser.py          # Address parsing with usaddress
│   ├── validator.py       # Offline validation rules
│   ├── formatter.py       # USPS formatting standards
│   └── utils.py           # Helper functions
├── cli.py                 # CLI interface
├── tests/
│   ├── __init__.py
│   ├── test_parser.py
│   ├── test_validator.py
│   └── test_integration.py
├── out/                   # Sample data (contains sample_input.csv)
├── logs/                  # Runtime logs
├── requirements.txt
├── install.sh
├── run.sh
├── README.md
└── LICENSE

Configuration

Logging

The tool supports configurable logging:

# Set log level
address-cleanser --log-level DEBUG batch --input addresses.csv --output results.csv

# Log to file
address-cleanser --log-file logs/processing.log batch --input addresses.csv --output results.csv

Performance

For large files, the tool processes addresses in chunks to manage memory usage. The default chunk size is 1000 addresses, but you can adjust this:

address-cleanser batch --input large_file.csv --output results.csv --chunk-size 5000

Error Handling

The tool handles various error conditions gracefully:

Invalid input files: Clear error messages for missing or malformed files
Parsing errors: Fallback strategies for ambiguous addresses
Validation errors: Detailed error messages for invalid components
File I/O errors: Proper error handling for file operations

Limitations

US addresses only: The tool is designed specifically for US addresses
Offline validation: No real-time USPS API validation in MVP version
English only: Address parsing works best with English text

Future Enhancements

USPS API integration for real-time validation
International address support
Geocoding capabilities
Web API endpoint
Docker containerization

Recent Updates

Version 1.0.16 Features

Update In-Place Mode: New --update-in-place flag for Fiverr/client work
- Mirrors client's exact input structure (same columns, same order)
- Only address fields are cleaned - no extra columns added
- Perfect for returning data that needs to plug right back into client systems
- Works with any column format: separate (Address, City, State, Zip) or combined (Full_Address)
- Automatically detects and intelligently maps column names (Street_Address, City_Name, etc.)
- Preserves all non-address columns unchanged

Version 1.0.14 Features

Fixed PyInstaller Executable Issues:
- Fixed ModuleNotFoundError: No module named 'mmap' that caused crashes when writing CSV output
- Fixed address parsing failures by including usaddress model files in executable bundles
- Executables now work correctly with all features

Version 1.0.13 Features

Column Preservation: Preserve original CSV columns with --preserve-columns flag
Auto-Detection: Automatically detect and combine separate address columns with --auto-combine
Manual Column Combination: Explicitly specify columns to combine with --address-columns
Enhanced Output: Original data structure maintained while adding cleaned address fields
Flexible Input: Now supports both single-column addresses and separate address components

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Run the test suite
Submit a pull request

Support

For issues and questions, please create an issue in the project repository.# address-cleanser

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github		.github
hooks		hooks
logs		logs
out		out
packaging		packaging
scripts		scripts
src		src
test_data		test_data
tests		tests
.coveragerc		.coveragerc
.directory		.directory
.gitignore		.gitignore
API_README.md		API_README.md
BUILD_EXECUTABLE.md		BUILD_EXECUTABLE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
EXTENSION_GUIDE.md		EXTENSION_GUIDE.md
LICENSE		LICENSE
NEXT_STEPS.md		NEXT_STEPS.md
README.md		README.md
address-cleanser.rb		address-cleanser.rb
api_server.py		api_server.py
cli.py		cli.py
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements-build.txt		requirements-build.txt
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run.sh		run.sh
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Address Cleanser

Features

Quick Start

Installation

Prerequisites

Install Dependencies

Usage

Command Line Interface

Batch Processing

Single Address Processing

Input Format

Output Formats

CSV Output

JSON Output

Excel Output

Understanding Results

Confidence Scores

Validation Status

When Manual Review is Needed

Example Interpretations

Performance Guidelines

Expected Processing Times

Memory Requirements

File Size Limitations

Chunk Size Recommendations

Examples

Processing a Sample File

Working with Different CSV Structures

Processing Different Address Types

Validation Examples

Troubleshooting

Common Errors and Solutions

Platform-Specific Issues

Dependency Conflicts

Performance Problems

Technical Details

Architecture

Why usaddress instead of libpostal?

Limitations and Accuracy

Offline Operation

Testing

Project Structure

Configuration

Logging

Performance

Error Handling

Limitations

Future Enhancements

Recent Updates

Version 1.0.16 Features

Version 1.0.14 Features

Version 1.0.13 Features

License

Contributing

Support

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 27

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages