This project focuses on classifying text data using Natural Language Processing (NLP) techniques and machine learning models. It explores the effectiveness of TF-IDF and CountVectorizer for feature extraction and utilizes Gradient Boosting as the primary classifier. Model optimization is conducted using GridSearchCV with k-fold cross-validation.
- Preprocess and vectorize text data
- Compare feature extraction methods (TF-IDF vs CountVectorizer)
- Train and optimize a Gradient Boosting Classifier
- Evaluate model performance using cross-validation
- Text preprocessing: tokenization, stopword removal, lowercasing
- Feature extraction using:
- TF-IDF Vectorizer
- CountVectorizer
- Model training using:
- GradientBoostingClassifier
- Hyperparameter tuning using:
- GridSearchCV
- Model evaluation using:
- k-Fold Cross-Validation
- Python 3.x
- scikit-learn
- pandas
- numpy
- matplotlib / seaborn (for visualization)
- nltk (optional for preprocessing)
Evaluation metrics used:
- Accuracy
- Precision
- Recall
- F1-Score
- Confusion Matrix
- Cross-validation scores
- Clone the repository:
git clone https://github.com/yourusername/nlp-text-classification.git