Usage Instructions for AI CSV Column Matching Tool

Getting Started

Prerequisites

Python 3.8 or higher
Recommended: Python 3.11
pip package manager
Recommended: pip>24.0

Installation

Navigate to the project directory:
```
cd ai_csv_column_matching_tool
```

Create and Activate Virtual environment:

py -3.11 -m venv .venv-csv-match 
# or, python -m venv .venv-csv-match
.\.venv-csv-match\Scripts\activate

Install the required dependencies:
```
pip install -r requirements.txt
```

Running the Application

Web Interface (Recommended)

The tool has a user-friendly web interface built with Streamlit.

To start the web application:

streamlit run main.py

Note:

On First run, it will download the embeddings models to the PC, which might take few seconds to about a minute, depending on internet speed.

Once the application is running, open your web browser and go to http://localhost:8501.

Using the Web Interface

Upload Files: Click the "Browse files" buttons to upload two CSV files
Set Parameters: Adjust the similarity threshold (0.0-1.0)
Run AI Matching: Click the "Find AI Column Matches" button to process the files using AI
Manual Matching: Use the visual interface to manually match unmatched columns:
- "Best Potential Matches" section shows sorted recommendations (highest similarity first)
- Select columns from the dropdowns (sorted by similarity when a column is selected)
- The similarity score automatically calculates based on column names and sample values
- See sample values from each column to help with matching decisions
- Click "Add Manual Match" to pair them together with the calculated similarity score
- Matched pairs are automatically removed from dropdowns after selection
- Review all matches in the "All Column Matches" section
Generate Output: Use one of two options:
- "Generate Merged Output (Mapped Columns Only)": Creates output with only matched columns
- "Generate Full Output (Mapped + Unmapped Columns)": Creates output with matched columns first, then unmapped columns from file 1 (with _filename suffix), then unmapped columns from file 2 (with _filename suffix)
Download Output: Use the download buttons to save the output files:
- For mapped columns only: file1_{first_10_chars_of_file2}mapped_only.csv and file1{first_10_chars_of_file2}_mapped_only_schema.json
- For full output: file1_{first_10_chars_of_file2}full_columns_merged.csv and file1{first_10_chars_of_file2}_full_schema.json

How the Matching Works

The tool uses lightweight AI embeddings to compare both:

Column names (e.g., "Operational_costs" vs "Total_Expenses")
Sample values from each column (e.g., "CC101" vs "ORG101")

The system uses efficient models under 300MB that are optimized for semantic similarity detection. When similar columns are detected, the tool will merge the files accordingly.

Output Files

The tool generates two types of output files based on the selected option:

Mapped Columns Only

mapped_only.csv: Contains only the matched columns with data from both input files concatenated (first 50 rows from file 1, next 50 rows from file 2)
mapped_only_schema.json: Contains metadata about the matched columns

Full Output

full_columns_merged.csv: Contains all data with columns ordered as: matched columns first, then unmapped columns from file 1 (with _filename suffix), then unmapped columns from file 2 (with _filename suffix)
full_schema.json: Contains metadata about all columns including unmatched ones

Both schema files contain:

Unified column names
Original columns and their source files
Confidence scores for each column match
Unmatched columns from each file

Command Line Usage (Advanced)

For programmatic use, you can also use the modules directly:

from csv_matcher import CSVIngestor
from embeddings_matcher import EmbeddingsMatcher
from csv_merger import CSVMerger

# Load your CSV files
ingestor = CSVIngestor()
df1 = pd.read_csv("path/to/file1.csv")
df2 = pd.read_csv("path/to/file2.csv")
ingestor.load_dataframes(df1, df2)

# Get column information
file1_info = ingestor.get_column_info(is_file1=True)
file2_info = ingestor.get_column_info(is_file1=False)

# Perform matching
matcher = EmbeddingsMatcher()
matches = matcher.match_columns(file1_info, file2_info, threshold=0.5)

# Merge and export (mapped columns only)
merger = CSVMerger()
merged_df = merger.merge_csv_files(df1, df2, matches, include_unmatched=False, include_matched_only=True, file1_name="file1.csv", file2_name="file2.csv")
schema = merger.create_schema(df1.columns.tolist(), df2.columns.tolist(), matches, file1_name="file1.csv", file2_name="file2.csv")
csv_path, schema_path = merger.export_results(merged_df, schema)

# Or merge and export (full data with mapped + unmapped columns)
merged_df_full = merger.merge_csv_files(df1, df2, matches, include_unmatched=True, include_matched_only=False, file1_name="file1.csv", file2_name="file2.csv")
schema_full = merger.create_schema(df1.columns.tolist(), df2.columns.tolist(), matches, file1_name="file1.csv", file2_name="file2.csv")
csv_path_full, schema_path_full = merger.export_results(merged_df_full, schema_full)

Troubleshooting

Common Issues

Import Errors: Make sure all dependencies are installed with pip install -r requirements.txt
Memory Issues: Large CSV files may require more memory. Consider processing smaller chunks or optimizing your system's memory settings.
Model Loading Issues: The first time you run the application, it needs to download the embedding model, which might take some time depending on your internet connection.
Very few or No Matches Found: If no matches are found, try:
- Lowering the similarity threshold
- Ensuring your CSV files have meaningful column names and data
- Checking that both files contain text or categorical data (the model works better with non-numeric column headers)

Performance Tips

The tool works best with columns containing text or categorical data
For large files, the matching process may take longer
Consider running on a machine with more RAM for large datasets

Example Scenario

Problem: You have two files with similar data but different column names:

File 1: emp_id, dept_code, total_cost
File 2: employee_id, department_code, actual_spend

Solution:

The AI tool recognizes that emp_id ≈ employee_id, dept_code ≈ department_code, and total_cost ≈ actual_spend
For any unmatched columns, you can manually select them from the dropdowns and pair them using the visual interface
The tool merges the files based on both AI matches and your manual selections

Support

If you encounter issues, please:

Check that all dependencies are properly installed
Verify your CSV files are properly formatted
Ensure your system meets the prerequisites

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage Instructions for AI CSV Column Matching Tool

Getting Started

Prerequisites

Installation

Running the Application

Web Interface (Recommended)

Note:

Using the Web Interface

How the Matching Works

Output Files

Mapped Columns Only

Full Output

Command Line Usage (Advanced)

Troubleshooting

Common Issues

Performance Tips

Example Scenario

Support

FilesExpand file tree

USAGE_INSTRUCTIONS.md

Latest commit

History

USAGE_INSTRUCTIONS.md

File metadata and controls

Usage Instructions for AI CSV Column Matching Tool

Getting Started

Prerequisites

Installation

Running the Application

Web Interface (Recommended)

Note:

Using the Web Interface

How the Matching Works

Output Files

Mapped Columns Only

Full Output

Command Line Usage (Advanced)

Troubleshooting

Common Issues

Performance Tips

Example Scenario

Support