Skip to content

Commit a7b72da

Browse files
committed
add explainability files and finish readme
1 parent 1a7f3bc commit a7b72da

File tree

2 files changed

+107
-22
lines changed

2 files changed

+107
-22
lines changed

README.md

Lines changed: 52 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,74 @@
1-
# Bravium-NLP
1+
# Olist Sentiment Analysis with Natural Language Processing
22

3-
General Steps:
4-
5-
Env config -> EDA -> Data Cleaning -> NLP -> Benchmark
3+
## About this project
4+
This project works with Olist's Brazilian E-Commerce Public Dataset. Our goal is to use this dataset to develop a classification model, which will identify if a customer review was positive or negative. Furthermore, we wish to gather explainability from the model, in this case, applying LIME and SHAP. We provide a Dockerfile and Poetry configuration file for ease of running and reproductibility.
65

76
## Environment Setup
8-
It is always recommended to have a separate python environment for different projects. This projects utilizes `Python 3.12`. We walk you through the environment configuration with multiple package managers, but recommend building the project's Docker image for ease of use.
9-
10-
We also try to ensure the notebooks can be executed on Colab, although some packages may need to be installed manually
7+
It is always recommended to have a separate python environment for different projects. This projects utilizes `Python 3.11.5`. We walk you through the environment configuration with Poetry and the highly recommended Docker image. pip and Conda were failing to build the project due to unresolved dependency issues with Numba, hence, their usage is not recommend - but feel free to try.
118

129
### Docker
13-
TBD
10+
We provide a Docker image which runs our training script and allows you to interact with the files. Running the `docker build` command will build the Python 3.11 image, install Poetry and run `train.py`, which generates de .pkl models.
11+
12+
```bash
13+
docker build -t bravium_heitor .
14+
```
15+
16+
Running this Docker run command will allow you to interact with the image. Running with these `-v` flags allows you to access the files on the container locally, so that they may be persisted on your local machine.
17+
18+
Inside of the image, you may run `poetry run python explainability.py` to run LIME and SHAP and get the results. Other than that, you may play around with the files freely.
1419

15-
### Colab
16-
Resolve missing package erorrs by running
1720
```bash
18-
!pip install <package-name>
19-
```
20-
Will be fixed on the final release.
21+
docker run -it --rm \
22+
-v $(pwd)/data:/app/data \
23+
-v $(pwd)/explainability:/app/explainability \
24+
-v $(pwd)/metrics:/app/metrics \
25+
-v $(pwd)/model:/app/model \
26+
-v $(pwd)/processed_csvs:/app/processed_csvs \
27+
bravium_heitor
28+
```
2129

2230
### Poetry
31+
Poetry is our preferred Python package manager and we recommend its use for this project. You should have it installed locally with pipx. There are plenty of [guides](https://www.sarahglasmacher.com/how-to-set-up-poetry-for-python/) available on this topic.
32+
33+
With poetry installed, just run
34+
2335
```bash
2436
poetry install no-root
25-
```
37+
```
2638

27-
### Conda
39+
and the environment will be fully operational. The order in which we recommend running the codes is:
2840

29-
### Others
41+
1. Getting the dataset from kaggle -> `get_kaggle_dataset.py`.
42+
2. Following the `data_analysis.ipynb` and `data_cleaning.ipynb`.
43+
3. Running `train.py` and `explainability.py` codes.
3044

45+
However, following that order is not necessary, since we've uploaded our processed .csv files to the `processed_csvs` folder.
3146

32-
## Exploratory Data Analysis
33-
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/heitornolla/Bravium-NLP/blob/main/data_analysis.ipynb)
47+
The only file that is dependent on having a `.pkl` model on the `/models` folder is the `explainability.py` folder. As such, if you're unable to run the `train.py` script but still want to explore the code (or just want to access or model) you can download the pickle files [here](https://drive.google.com/drive/folders/1sQta4E4-mDGpDftF9fItM-u4O9BVzguk?usp=sharing)
48+
49+
50+
## Exploratory Data Analysis (EDA)
51+
During the EDA phase, our main goal is to understand the dataset's features and their relationships with eachother. We exclude multiple files and records from the dataset, either due to them not being suited for the analysis or having missing data, and save a much smaller sample of the dataset for the cleaning stage.
3452

3553

3654
## Data Cleaning / Pre-Processing
37-
TBD
55+
Using the .csv file resulting from our EDA, we apply essential pre-processing steps on this stage, such as removing trailing whitespaces, emojis, special characters, and stemming.
3856

3957

4058
## NLP Model
41-
TBD
59+
The model is defined on the `train.py` phase. The goal is to automatically classify reviews as positive or negative based on their text content. On this stage, we first transformed text into numerical features with TF–IDF vectorization, and then train the model.
60+
61+
The model is a Logistic Regression classifier, trained using GridSearchCV to find the best hyperparameters (C, penalty, class_weight). The training set and test set are split 80-20. The model is optimized for F1-score, which balances performance across classes.
62+
63+
64+
## Evaluation and Explainability
65+
To evaluate the model, we generate a classification graph, showcasing precision, recall and F1-score per class. We also save a confusion matrix (true vs predicted labels). Both are saved as .png images under the `/metrics` directory.
66+
67+
For explainability, we want to know why a review was classified as positive or negative. We use two complementary tools:
68+
69+
### LIME (Local Interpretable Model-agnostic Explanations)
70+
LIME works on individual predictions. For a given review, it identifies the top words that influenced the classification. In the code, the explanation is converted into a matplotlib figure and saved as lime.png.
71+
72+
### SHAP (SHapley Additive exPlanations)
73+
SHAP provides a more general view. Instead of only explaining one prediction, it highlights the most influential words across many reviews. The `explainability.py` file loads the trained Logistic Regression model and the TF–IDF pipeline and samples reviews, generating LIME and SHAP visualizations in the `/explainability` folder.
4274

43-
## Evaluation
44-
TBD

explainability.py

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
import joblib
2+
import pandas as pd
3+
import lime
4+
import lime.lime_text
5+
import shap
6+
from sklearn.pipeline import make_pipeline
7+
8+
import matplotlib.pyplot as plt
9+
10+
11+
def load_model_and_pipeline():
12+
model = joblib.load("model/sentiment_model.pkl")
13+
vectorizer = joblib.load("model/tfidf_pipeline.pkl")
14+
return model, vectorizer
15+
16+
17+
def run_lime_example(df, model, vectorizer, sample_idx=0):
18+
pipeline = make_pipeline(vectorizer, model)
19+
explainer = lime.lime_text.LimeTextExplainer(class_names=["negativo", "positive"])
20+
21+
sample_text = df["comments"].iloc[sample_idx]
22+
23+
exp = explainer.explain_instance(
24+
sample_text, pipeline.predict_proba, num_features=10
25+
)
26+
27+
fig = exp.as_pyplot_figure()
28+
plt.savefig("explainability/lime.png", dpi=300, bbox_inches="tight")
29+
plt.close()
30+
31+
32+
def run_shap_example(df, model, vectorizer, sample_size=100):
33+
import matplotlib.pyplot as plt
34+
35+
X_sample = df["comments"].sample(sample_size, random_state=42).tolist()
36+
X_transformed = vectorizer.transform(X_sample)
37+
38+
explainer = shap.Explainer(model, X_transformed)
39+
shap_values = explainer(X_transformed)
40+
41+
# Run SHAP on the first review and get results
42+
shap.plots.bar(shap_values[0], max_display=10, show=False)
43+
plt.savefig("explainability/shap.png", dpi=300, bbox_inches="tight")
44+
plt.close()
45+
46+
47+
if __name__ == "__main__":
48+
df = pd.read_csv("processed_csvs/customer_reviews_preprocessed.csv")
49+
model, vectorizer = load_model_and_pipeline()
50+
51+
# Run LIME on first review
52+
run_lime_example(df, model, vectorizer, sample_idx=0)
53+
54+
# Run SHAP on sample of reviews
55+
run_shap_example(df, model, vectorizer, sample_size=100)

0 commit comments

Comments
 (0)