- Python 3.8 or higher
- Recommended: Python 3.11
- pip package manager
- Recommended: pip>24.0
-
Navigate to the project directory:
cd ai_csv_column_matching_tool -
Create and Activate Virtual environment:
py -3.11 -m venv .venv-csv-match # or, python -m venv .venv-csv-match .\.venv-csv-match\Scripts\activate -
Install the required dependencies:
pip install -r requirements.txt
The tool has a user-friendly web interface built with Streamlit.
To start the web application:
streamlit run main.py
- On First run, it will download the embeddings models to the PC, which might take few seconds to about a minute, depending on internet speed.
Once the application is running, open your web browser and go to http://localhost:8501.
- Upload Files: Click the "Browse files" buttons to upload two CSV files
- Set Parameters: Adjust the similarity threshold (0.0-1.0)
- Run AI Matching: Click the "Find AI Column Matches" button to process the files using AI
- Manual Matching: Use the visual interface to manually match unmatched columns:
- "Best Potential Matches" section shows sorted recommendations (highest similarity first)
- Select columns from the dropdowns (sorted by similarity when a column is selected)
- The similarity score automatically calculates based on column names and sample values
- See sample values from each column to help with matching decisions
- Click "Add Manual Match" to pair them together with the calculated similarity score
- Matched pairs are automatically removed from dropdowns after selection
- Review all matches in the "All Column Matches" section
- Generate Output: Use one of two options:
- "Generate Merged Output (Mapped Columns Only)": Creates output with only matched columns
- "Generate Full Output (Mapped + Unmapped Columns)": Creates output with matched columns first, then unmapped columns from file 1 (with _filename suffix), then unmapped columns from file 2 (with _filename suffix)
- Download Output: Use the download buttons to save the output files:
- For mapped columns only: file1_{first_10_chars_of_file2}mapped_only.csv and file1{first_10_chars_of_file2}_mapped_only_schema.json
- For full output: file1_{first_10_chars_of_file2}full_columns_merged.csv and file1{first_10_chars_of_file2}_full_schema.json
The tool uses lightweight AI embeddings to compare both:
- Column names (e.g., "Operational_costs" vs "Total_Expenses")
- Sample values from each column (e.g., "CC101" vs "ORG101")
The system uses efficient models under 300MB that are optimized for semantic similarity detection. When similar columns are detected, the tool will merge the files accordingly.
The tool generates two types of output files based on the selected option:
- mapped_only.csv: Contains only the matched columns with data from both input files concatenated (first 50 rows from file 1, next 50 rows from file 2)
- mapped_only_schema.json: Contains metadata about the matched columns
- full_columns_merged.csv: Contains all data with columns ordered as: matched columns first, then unmapped columns from file 1 (with _filename suffix), then unmapped columns from file 2 (with _filename suffix)
- full_schema.json: Contains metadata about all columns including unmatched ones
Both schema files contain:
- Unified column names
- Original columns and their source files
- Confidence scores for each column match
- Unmatched columns from each file
For programmatic use, you can also use the modules directly:
from csv_matcher import CSVIngestor
from embeddings_matcher import EmbeddingsMatcher
from csv_merger import CSVMerger
# Load your CSV files
ingestor = CSVIngestor()
df1 = pd.read_csv("path/to/file1.csv")
df2 = pd.read_csv("path/to/file2.csv")
ingestor.load_dataframes(df1, df2)
# Get column information
file1_info = ingestor.get_column_info(is_file1=True)
file2_info = ingestor.get_column_info(is_file1=False)
# Perform matching
matcher = EmbeddingsMatcher()
matches = matcher.match_columns(file1_info, file2_info, threshold=0.5)
# Merge and export (mapped columns only)
merger = CSVMerger()
merged_df = merger.merge_csv_files(df1, df2, matches, include_unmatched=False, include_matched_only=True, file1_name="file1.csv", file2_name="file2.csv")
schema = merger.create_schema(df1.columns.tolist(), df2.columns.tolist(), matches, file1_name="file1.csv", file2_name="file2.csv")
csv_path, schema_path = merger.export_results(merged_df, schema)
# Or merge and export (full data with mapped + unmapped columns)
merged_df_full = merger.merge_csv_files(df1, df2, matches, include_unmatched=True, include_matched_only=False, file1_name="file1.csv", file2_name="file2.csv")
schema_full = merger.create_schema(df1.columns.tolist(), df2.columns.tolist(), matches, file1_name="file1.csv", file2_name="file2.csv")
csv_path_full, schema_path_full = merger.export_results(merged_df_full, schema_full)-
Import Errors: Make sure all dependencies are installed with
pip install -r requirements.txt -
Memory Issues: Large CSV files may require more memory. Consider processing smaller chunks or optimizing your system's memory settings.
-
Model Loading Issues: The first time you run the application, it needs to download the embedding model, which might take some time depending on your internet connection.
-
Very few or No Matches Found: If no matches are found, try:
- Lowering the similarity threshold
- Ensuring your CSV files have meaningful column names and data
- Checking that both files contain text or categorical data (the model works better with non-numeric column headers)
- The tool works best with columns containing text or categorical data
- For large files, the matching process may take longer
- Consider running on a machine with more RAM for large datasets
Problem: You have two files with similar data but different column names:
- File 1:
emp_id,dept_code,total_cost - File 2:
employee_id,department_code,actual_spend
Solution:
- The AI tool recognizes that
emp_id≈employee_id,dept_code≈department_code, andtotal_cost≈actual_spend - For any unmatched columns, you can manually select them from the dropdowns and pair them using the visual interface
- The tool merges the files based on both AI matches and your manual selections
If you encounter issues, please:
- Check that all dependencies are properly installed
- Verify your CSV files are properly formatted
- Ensure your system meets the prerequisites