Skip to content

Latest commit

 

History

History
166 lines (121 loc) · 6.9 KB

File metadata and controls

166 lines (121 loc) · 6.9 KB

Usage Instructions for AI CSV Column Matching Tool

Getting Started

Prerequisites

  • Python 3.8 or higher
  • Recommended: Python 3.11
  • pip package manager
  • Recommended: pip>24.0

Installation

  1. Navigate to the project directory:

    cd ai_csv_column_matching_tool
    
  2. Create and Activate Virtual environment:

    py -3.11 -m venv .venv-csv-match 
    # or, python -m venv .venv-csv-match
    .\.venv-csv-match\Scripts\activate
    
  3. Install the required dependencies:

    pip install -r requirements.txt
    

Running the Application

Web Interface (Recommended)

The tool has a user-friendly web interface built with Streamlit.

To start the web application:

streamlit run main.py

Note:

  • On First run, it will download the embeddings models to the PC, which might take few seconds to about a minute, depending on internet speed.

Once the application is running, open your web browser and go to http://localhost:8501.

Using the Web Interface

  1. Upload Files: Click the "Browse files" buttons to upload two CSV files
  2. Set Parameters: Adjust the similarity threshold (0.0-1.0)
  3. Run AI Matching: Click the "Find AI Column Matches" button to process the files using AI
  4. Manual Matching: Use the visual interface to manually match unmatched columns:
    • "Best Potential Matches" section shows sorted recommendations (highest similarity first)
    • Select columns from the dropdowns (sorted by similarity when a column is selected)
    • The similarity score automatically calculates based on column names and sample values
    • See sample values from each column to help with matching decisions
    • Click "Add Manual Match" to pair them together with the calculated similarity score
    • Matched pairs are automatically removed from dropdowns after selection
    • Review all matches in the "All Column Matches" section
  5. Generate Output: Use one of two options:
    • "Generate Merged Output (Mapped Columns Only)": Creates output with only matched columns
    • "Generate Full Output (Mapped + Unmapped Columns)": Creates output with matched columns first, then unmapped columns from file 1 (with _filename suffix), then unmapped columns from file 2 (with _filename suffix)
  6. Download Output: Use the download buttons to save the output files:
    • For mapped columns only: file1_{first_10_chars_of_file2}mapped_only.csv and file1{first_10_chars_of_file2}_mapped_only_schema.json
    • For full output: file1_{first_10_chars_of_file2}full_columns_merged.csv and file1{first_10_chars_of_file2}_full_schema.json

How the Matching Works

The tool uses lightweight AI embeddings to compare both:

  • Column names (e.g., "Operational_costs" vs "Total_Expenses")
  • Sample values from each column (e.g., "CC101" vs "ORG101")

The system uses efficient models under 300MB that are optimized for semantic similarity detection. When similar columns are detected, the tool will merge the files accordingly.

Output Files

The tool generates two types of output files based on the selected option:

Mapped Columns Only

  1. mapped_only.csv: Contains only the matched columns with data from both input files concatenated (first 50 rows from file 1, next 50 rows from file 2)
  2. mapped_only_schema.json: Contains metadata about the matched columns

Full Output

  1. full_columns_merged.csv: Contains all data with columns ordered as: matched columns first, then unmapped columns from file 1 (with _filename suffix), then unmapped columns from file 2 (with _filename suffix)
  2. full_schema.json: Contains metadata about all columns including unmatched ones

Both schema files contain:

  • Unified column names
  • Original columns and their source files
  • Confidence scores for each column match
  • Unmatched columns from each file

Command Line Usage (Advanced)

For programmatic use, you can also use the modules directly:

from csv_matcher import CSVIngestor
from embeddings_matcher import EmbeddingsMatcher
from csv_merger import CSVMerger

# Load your CSV files
ingestor = CSVIngestor()
df1 = pd.read_csv("path/to/file1.csv")
df2 = pd.read_csv("path/to/file2.csv")
ingestor.load_dataframes(df1, df2)

# Get column information
file1_info = ingestor.get_column_info(is_file1=True)
file2_info = ingestor.get_column_info(is_file1=False)

# Perform matching
matcher = EmbeddingsMatcher()
matches = matcher.match_columns(file1_info, file2_info, threshold=0.5)

# Merge and export (mapped columns only)
merger = CSVMerger()
merged_df = merger.merge_csv_files(df1, df2, matches, include_unmatched=False, include_matched_only=True, file1_name="file1.csv", file2_name="file2.csv")
schema = merger.create_schema(df1.columns.tolist(), df2.columns.tolist(), matches, file1_name="file1.csv", file2_name="file2.csv")
csv_path, schema_path = merger.export_results(merged_df, schema)

# Or merge and export (full data with mapped + unmapped columns)
merged_df_full = merger.merge_csv_files(df1, df2, matches, include_unmatched=True, include_matched_only=False, file1_name="file1.csv", file2_name="file2.csv")
schema_full = merger.create_schema(df1.columns.tolist(), df2.columns.tolist(), matches, file1_name="file1.csv", file2_name="file2.csv")
csv_path_full, schema_path_full = merger.export_results(merged_df_full, schema_full)

Troubleshooting

Common Issues

  1. Import Errors: Make sure all dependencies are installed with pip install -r requirements.txt

  2. Memory Issues: Large CSV files may require more memory. Consider processing smaller chunks or optimizing your system's memory settings.

  3. Model Loading Issues: The first time you run the application, it needs to download the embedding model, which might take some time depending on your internet connection.

  4. Very few or No Matches Found: If no matches are found, try:

    • Lowering the similarity threshold
    • Ensuring your CSV files have meaningful column names and data
    • Checking that both files contain text or categorical data (the model works better with non-numeric column headers)

Performance Tips

  • The tool works best with columns containing text or categorical data
  • For large files, the matching process may take longer
  • Consider running on a machine with more RAM for large datasets

Example Scenario

Problem: You have two files with similar data but different column names:

  • File 1: emp_id, dept_code, total_cost
  • File 2: employee_id, department_code, actual_spend

Solution:

  1. The AI tool recognizes that emp_idemployee_id, dept_codedepartment_code, and total_costactual_spend
  2. For any unmatched columns, you can manually select them from the dropdowns and pair them using the visual interface
  3. The tool merges the files based on both AI matches and your manual selections

Support

If you encounter issues, please:

  1. Check that all dependencies are properly installed
  2. Verify your CSV files are properly formatted
  3. Ensure your system meets the prerequisites