For any questions or feedback:
-
Email: [email protected]
-
Buy Me a coffee here --->: https://www.paypal.com/donate/?hosted_button_id=5URJR262Y77BQ
This project provides a Python-based solution for two main use cases:
-
Detect Dangerous Content: Detect harmful or suspicious content in posts, such as:
- Dangerous keywords (e.g., "attack", "violence").
- Suspicious URLs from predefined blacklisted domains.
- Excessive special characters or emojis.
-
Text Classification using Machine Learning:
- Apply various Machine Learning algorithms to classify text data.
- Compare two vectorization techniques: TF-IDF and Bag of Words.
- Content Filtering: Preprocess posts and flag those containing dangerous content or suspicious patterns.
- Text Vectorization: Transform textual data into numerical representations using:
- Bag of Words
- TF-IDF
- Machine Learning Models:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machines (SVM)
- Naive Bayes
- Multi-Layer Perceptron (MLP)
- Random Forest
- Gradient Boosting
- XGBoost
- Model Evaluation: Automatically computes metrics such as:
- Accuracy
- Precision
- Recall
- F1-Score
- Customizable Pipeline: Add keywords, blacklisted domains, or new ML models with ease.
classification_X/
├── data_x_posts.json # Example dataset with posts
├── classification_X.py # Main script for filtering and ML classification
├── README.md # Project documentation
├── requirements.txt # Python dependencies
- git clone https://github.com/your_username/classification_X.git
- cd classification_X
- pip install -r requirements.txt
- python classification_X.py
- Python Version: Python 3.7 or higher.
- Libraries: The script requires several Python libraries. Install them using:
pip install -r requirements.txt
-
Metrics for each model.
-
Visualizations comparing model performance.
A sample table summarizing the results:
The script generates bar plots comparing model performances.
The project leverages the following algorithms with two vectorization techniques:
-
TF-IDF
-
Bag of Words
Modify the contains_dangerous_keywords function in x_classification.py:
codedangerous_keywords = ['attack', 'violence', 'bomb', 'hacking', 'danger'] `
Update the contains_suspicious_links function or you can use regex ...
To add a new ML model:
-
Define the model in the relevant section of the script.
-
Update the evaluation pipeline to include the new model.
Contributions are welcome! If you encounter any issues or have suggestions for improvements, please:
-
Open an issue on GitHub.
-
Submit a pull request with detailed changes.
To further enhance the project, consider adding the following feature:
Generic Program or API with Endpoints ✅ Develop a generic program or RESTful API with endpoints to: ✅
- Compare different machine learning models. ✅
- Switch between TF-IDF and Bag of Words vectorization techniques. ✅
- Serve predictions for new text data. ✅
Example API Endpoints
-
GET /models: Retrieve a list of available ML models and their metrics. ✅
-
POST /predict: Accept a text input and return predictions from a specified model and vector type. ✅
-
GET /compare: Generate a comparison chart of model performances for both vectorization techniques. ✅
This addition would make the project highly extensible and allow users to interact programmatically with the machine learning pipeline.
This project is licensed under the Apache License 2.0. See the LICENSE file for full details.