Add TF-IDF + LinearSVC TweetEval sentiment example tuned with Optuna (with smoke test & README) #333
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This example adds a short, self-contained demonstration of using Optuna to tune a TF-IDF + LinearSVC pipeline for sentiment classification on the TweetEval sentiment dataset (three labels: negative, neutral, positive).
What it does
Loads the TweetEval sentiment dataset using the datasets library.
Tunes both the TF-IDF vectorizer (feature count, n-gram range, etc.) and LinearSVC parameters (C, loss, class_weight) using Optuna.
Uses macro-F1 on the validation split as the objective (1 – macro-F1 minimized).
Retrains the best configuration on train + validation and prints a test report.
Files added
examples/sklearn/svm_tfidf_tweeteval_sentiment.py – main script
examples/sklearn/svm_tfidf_tweeteval_sentiment.md – short usage notes
tests/test_svm_tfidf_tweeteval.py – quick smoke test for CI
How to run
python examples/sklearn/svm_tfidf_tweeteval_sentiment.py --n-trials 20 --max-train 20000
pytest -q # optional quick test
Notes
Keeps runtime light by allowing the --max-train argument to limit samples.
Demonstrates how Optuna can help search SVM + text-feature spaces efficiently.
No external dependencies beyond datasets, scikit-learn, and optuna.