This project processes and analyzes customer messages from CSV files, categorizing them into predefined topics using LDA (Latent Dirichlet Allocation) and keyword-based matching. It also generates visualizations to summarize the results.
-
Data Preprocessing:
- Combines multiple CSV files containing customer messages (from the
data/folder). - Filters messages from customers and tokenizes text using both English and Chinese tokenizers (
nltkandjieba). - Removes stopwords and non-essential characters.
- Combines multiple CSV files containing customer messages (from the
-
Topic Modeling:
- Uses LDA to identify latent topics in the messages.
- Matches messages to predefined categories using keyword-based scoring.
-
Categorization:
- Maps messages to one of the following categories:
- 訂房 (Booking)
- 優惠 (Promotion)
- 訂餐 (Food Ordering)
- 設施 (Facility)
- 行程 (Itinerary)
- 服務 (Service)
- 附近 (Nearby)
- 抱怨 (Complaint)
- 其他 (Other)
- Maps messages to one of the following categories:
-
Visualization:
- Generates pie charts for:
- Distribution of customer issue categories.
- Platform comparison based on unique customer counts.
- Plots daily unique customer counts over time.
- Generates pie charts for:
-
Export:
- Saves categorized messages to a CSV file (
categorized_customer_messages.csv).
- Saves categorized messages to a CSV file (
-
Prepare Data:
- Place your CSV files in the
data/folder. Do not include real or sensitive company/customer data in the repository. - Each CSV should have the following columns:
Platform,Source,Customer ID,Message Content,Send TIme
- You may use the provided
example_data.csvas a template (with dummy data).
- Place your CSV files in the
-
Install Requirements:
- Install Python 3.7+ and the required packages:
pip install pandas matplotlib nltk jieba gensim
- Install Python 3.7+ and the required packages:
-
Run the Script:
- Execute the script:
python Categorization.py
- Execute the script:
-
View Results:
- Visualizations will be displayed and saved as PNG files.
- The categorized results will be saved as
categorized_customer_messages.csv.
- No real company/customer data is included in this repository.
- The
data/folder is excluded via.gitignoreto prevent accidental upload of sensitive files. - Only use example or dummy data for demonstration and sharing.
You can create a file like data/example_data.csv:
Platform,Source,Customer ID,Message Content,Send TIme
LINE,customer,12345,"I want to book a room",2025-01-01 12:00:00
WhatsApp,customer,67890,"Do you have any discounts?",2025-01-02 13:45:00License Apache 2.0