Update chatbot/train.py and chatbot/app.py to improve model performance by baranwalayush · Pull Request #200 · AOSSIE-Org/Agora-Blockchain

baranwalayush · 2025-12-13T14:25:28Z

Description

In this PR, I have

used standard tokenization and stemming, in place of the simple and naive tokenization and stemming.
calculated and print the Average Epoch Loss (instead of just the final batch loss), which better logs the performance.
Updated the chatbot/app.py to work with the improved chatbot/train.py

Let me know if there is anything else I need to take care of.

Fixes #198

Type of change

Please mark the options that are relevant.

Updated UI/UX
Improved the business logic of code
Added new feature
Other

Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings

Summary by CodeRabbit

New Features
- Enhanced natural language processing with improved tokenization and stemming for better intent recognition.
- Improved training visibility with epoch-level average loss reporting for clearer model performance insights.
Refactor
- Streamlined configuration management for easier model and data customization.
- Improved internal data handling consistency across components.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-13T14:25:40Z

Walkthrough

The chatbot module is enhanced with NLTK-based NLP standardization (word_tokenize and PorterStemmer) integrated into both training and inference pipelines. Training configuration is externalized into constants, epoch-level loss monitoring is added, and the dataset now returns PyTorch tensors instead of raw numpy arrays for consistency.

Changes

Cohort / File(s)	Summary
NLP Standardization `chatbot/app.py`, `chatbot/train.py`	Integrated nltk.word_tokenize for tokenization and PorterStemmer for stemming across both modules. Updated bag_of_words to operate on stemmed tokens, producing float tensors.
Training Configuration `chatbot/train.py`	Extracted hardcoded file paths into module-level constants: `INTENTS_FILE` and `MODEL_SAVE_FILE`. Updated file I/O and model saving operations to use these constants.
Training Loop Instrumentation `chatbot/train.py`	Enhanced loss tracking by accumulating total_loss per epoch and computing average loss. Updated logging to report average epoch loss instead of per-batch loss.
Data Type Consistency `chatbot/train.py`	Modified `ChatDataset.__getitem__()` to return PyTorch tensors instead of raw numpy arrays for training data.
Data Loading `chatbot/app.py`	Changed model loading from `torch.load(FILE, weights_only=True)` to full state loading via `torch.load(FILE)`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas requiring attention:

Verify that PorterStemmer output integrates correctly with the bag_of_words logic and model expectations
Confirm the change from weights_only=True to full model loading in app.py doesn't introduce security or compatibility issues
Validate that the ChatDataset.__getitem__() return type change (numpy arrays → tensors) is compatible with downstream PyTorch operations
Ensure constants (INTENTS_FILE, MODEL_SAVE_FILE) are correctly referenced throughout both training and deployment code

Poem

🐇 With tokenizers fine and stemmers sound,
Our words are normalized, tidy and bound,
NLTK whispers through neural threads,
While epoch losses dance in our heads—
Refactored and cleaner, our model takes flight! ✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The PR addresses the main coding requirements from issue #198: implements NLTK-based tokenization and stemming, updates training logging to show Average Epoch Loss. However, num_workers parallelization in DataLoader is not implemented.	Implement DataLoader num_workers > 0 configuration in train.py to enable parallel data fetching as specified in issue #198.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main objective of the changeset: updating both train.py and app.py to improve model performance through NLP standardization and training refinements.
Out of Scope Changes check	✅ Passed	The PR includes reasonable scope-related changes like extracting hardcoded file paths to constants (INTENTS_FILE, MODEL_SAVE_FILE) and modifying getitem return type, which support the main objectives without introducing unrelated features.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

chatbot/train.py (2)
84-91: Type mismatch: bag_of_words returns tensors, but code expects numpy arrays.

bag_of_words returns a torch.tensor (line 32), but these are collected into a list and converted to np.array (line 90). This creates a numpy object array of tensors. Later, __getitem__ (line 111) calls torch.from_numpy() on these tensor objects, which will fail because from_numpy expects a numpy ndarray, not a PyTorch tensor.

Either convert bag_of_words output to numpy before appending, or handle tensors directly:

Option 1: Convert to numpy in bag_of_words (consistent with original design)
 def bag_of_words(tokenized_sentence, words):
     sentence_words = [stem(word) for word in tokenized_sentence]
     
     bag = [1.0 if word in sentence_words else 0.0 for word in words]
     
-    return torch.tensor(bag, dtype=torch.float32)
+    return np.array(bag, dtype=np.float32)
Option 2: Stack tensors directly (avoid numpy conversion)
-X_train = np.array(X_train)
+X_train = torch.stack(X_train)
 y_train = np.array(y_train)
And update __getitem__:
     def __getitem__(self, index):
-        return torch.from_numpy(self.x_data[index]), torch.tensor(self.y_data[index])
+        return self.x_data[index], torch.tensor(self.y_data[index])
117-121: num_workers=0 does not meet the linked issue requirement.

Issue #198 specifically requested "set DataLoader num_workers > 0 to enable parallel data fetching" for performance improvement. The current value is 0, which disables parallel loading.
 train_loader = DataLoader(dataset=dataset,
                           batch_size=batch_size,
                           shuffle=True,
-                          num_workers=0)
+                          num_workers=2)
Note: If Windows compatibility is a concern, consider using a conditional:
import os
num_workers = 0 if os.name == 'nt' else 2

🧹 Nitpick comments (2)

chatbot/train.py (2)
8-11: Consider guarding NLTK resource download.

Calling nltk.download('punkt') at module import time can cause unexpected network calls in production. Consider wrapping this in a try/except that checks if the resource already exists, or move it to a setup/initialization function.
 import nltk
 from nltk.stem.porter import PorterStemmer

-nltk.download('punkt')
+try:
+    nltk.data.find('tokenizers/punkt')
+except LookupError:
+    nltk.download('punkt')
20-32: Code duplication with app.py.

The tokenize, stem, and bag_of_words functions are duplicated in app.py (lines 14-24). Consider extracting these into a shared module (e.g., nlp_utils.py) to ensure consistency and reduce maintenance burden.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d3a4dd and 296f735.

📒 Files selected for processing (2)

chatbot/app.py (3 hunks)
chatbot/train.py (7 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

chatbot/app.py (1)

chatbot/train.py (3)

stem (24-25)

tokenize (21-22)

bag_of_words (27-32)

chatbot/train.py (1)

chatbot/app.py (3)

stem (18-19)

tokenize (15-16)

bag_of_words (21-24)

🪛 Ruff (0.14.8)

chatbot/app.py

96-96: Possible binding to all interfaces

(S104)

96-96: Use of debug=True in Flask app detected

(S201)

🔇 Additional comments (6)

chatbot/train.py (3)

13-18: Good use of configuration constants.

Extracting INTENTS_FILE and MODEL_SAVE_FILE into constants improves maintainability. The global stemmer instance aligns with the approach in app.py.

132-159: Good implementation of epoch-averaged loss tracking.

The weighted loss accumulation (loss.item() * words.size(0)) and division by len(dataset) correctly computes the average loss per sample across the epoch, fulfilling the linked issue requirement.

161-172: LGTM!

Model saving correctly uses the MODEL_SAVE_FILE constant and includes all necessary metadata for inference.
chatbot/app.py (3)
8-24: LGTM - NLP preprocessing aligns with training.

The tokenization and stemming implementation correctly mirrors train.py, ensuring consistent preprocessing between training and inference.

47-48: Verify torch.load behavior with current PyTorch version.

In PyTorch 2.6+, torch.load defaults to weights_only=True, which would fail here since the saved data includes non-tensor objects (all_words, tags). Consider explicitly setting the parameter for forward compatibility.
 FILE = "data.pth"
-data = torch.load(FILE)
+data = torch.load(FILE, weights_only=False)
Alternatively, for better security, use torch.load(FILE, weights_only=True, map_location=device) and save metadata separately in a JSON file.

63-93: Chat endpoint logic is correct.

The inference pipeline properly uses the updated NLP functions and handles uncertain predictions gracefully.

One minor consideration: returning raw exception messages (str(e)) in line 93 could expose internal details. In production, consider logging the full error server-side and returning a generic message to clients.

coderabbitai · 2025-12-13T14:28:47Z

chatbot/app.py

 if __name__ == '__main__':
-    app.run(host="0.0.0.0", port=5000,debug=True)
+    app.run(host="0.0.0.0", port=5000, debug=True)


⚠️ Potential issue | 🟡 Minor

Development server configuration flagged by static analysis.

debug=True and binding to 0.0.0.0 are appropriate for development but pose security risks in production:

debug=True enables the interactive debugger, which can execute arbitrary code

0.0.0.0 exposes the service to all network interfaces

Consider using environment variables or a configuration flag:

+import os + if __name__ == '__main__': - app.run(host="0.0.0.0", port=5000, debug=True) + debug_mode = os.environ.get('FLASK_DEBUG', 'false').lower() == 'true' + app.run(host="0.0.0.0", port=5000, debug=debug_mode)

For production, use a WSGI server (e.g., Gunicorn) instead of the Flask development server.

🧰 Tools

🪛 ast-grep (0.40.0)

[warning] 95-95: Running flask app with host 0.0.0.0 could expose the server publicly.
Context: app.run(host="0.0.0.0", port=5000, debug=True)
Note: [CWE-668]: Exposure of Resource to Wrong Sphere [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://owasp.org/Top10/A01_2021-Broken_Access_Control

(avoid_app_run_with_bad_host-python)

[warning] 95-95: Detected Flask app with debug=True. Do not deploy to production with this flag enabled as it will leak sensitive information. Instead, consider using Flask configuration variables or setting 'debug' using system environment variables.
Context: app.run(host="0.0.0.0", port=5000, debug=True)
Note: [CWE-489] Active Debug Code. [REFERENCES]
- https://labs.detectify.com/2015/10/02/how-patreon-got-hacked-publicly-exposed-werkzeug-debugger/

(debug-enabled-python)

🪛 Ruff (0.14.8)

96-96: Possible binding to all interfaces

(S104)

96-96: Use of debug=True in Flask app detected

(S201)

🤖 Prompt for AI Agents

In chatbot/app.py around lines 95-96, the app is started with debug=True and host="0.0.0.0", which is unsafe for production; change the startup to read configuration from environment variables (e.g., FLASK_ENV or APP_DEBUG and APP_HOST/APP_PORT), default debug to False, and only enable debug and binding to 0.0.0.0 when an explicit development flag is set; for production remove the built-in server invocation and run under a WSGI server like Gunicorn (document example command) so the code no longer enables the interactive debugger or indiscriminately binds all interfaces by default.

coderabbitai · 2025-12-13T14:28:47Z

chatbot/train.py

@@ -1,23 +1,34 @@
+from operator import index


⚠️ Potential issue | 🟡 Minor

Remove unused import.

The index import from operator is never used in this file.

-from operator import index

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from operator import index

🤖 Prompt for AI Agents

In chatbot/train.py at line 1, the file imports `index` from the standard library `operator` module but never uses it; remove the unused import line `from operator import index` to clean up imports and avoid lint warnings, ensuring there are no other references to `index` elsewhere in the file before committing the change.

Update chatbot/train.py and chatbot/app.py to improve model performance

296f735

coderabbitai bot reviewed Dec 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update chatbot/train.py and chatbot/app.py to improve model performance#200

Update chatbot/train.py and chatbot/app.py to improve model performance#200
baranwalayush wants to merge 1 commit intoAOSSIE-Org:mainfrom
baranwalayush:improveChatbot

baranwalayush commented Dec 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 13, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 13, 2025

Uh oh!

coderabbitai bot Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

baranwalayush commented Dec 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

baranwalayush commented Dec 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 13, 2025 •

edited

Loading