First of all, create three empty folders: ./DatasetTLDRHQ
,./ProcessedData
and ./Dataset_splitted
.
Download annotations from the official Google Drive Folder and extract them in ./DatasetTLDRHQ
, resulting in a folder tree like this:
project_folder
└───Dataset_TLDRHQ
├───dataset-m0
├───dataset-m1
├───dataset-m2
├───dataset-m2021
├───dataset-m3
├───dataset-m4
└───dataset-m6
Run the PP_cleaning.py
script which will perform data cleaning (removing duplicates), splits the dataset into training/validation and test sets (splitting the training set so that it is easier to manage) and then save it splitted into .JSON
files in ./Dataset_splitted
. You get a directory tree like this:
project_folder
└───Dataset_splitted
├───test.json
├───train_1.json
├───train_10.json
├───train_2.json
├───train_3.json
├───train_4.json
├───train_5.json
├───train_6.json
├───train_7.json
├───train_8.json
├───train_9.json
└───val.json
Run the PP_normalizing.py
script which will perform senteces splitting, text normalization, tokenization, stop-words removal, lemmatization and POS tagging on document
variable, containing reddit posts. Then save it splitted into various .JSON
files in ./ProcessedData
. You get a directory tree like this:
project_folder
└───ProcessedData
├───test.json
├───train_1.json
├───train_10.json
├───train_2.json
├───train_3.json
├───train_4.json
├───train_5.json
├───train_6.json
├───train_7.json
├───train_8.json
├───train_9.json
└───val.json
The text normalisation operations performed include, in order: Sentence Splitting, HTML tags and entities removal, Extra White spaces Removal, URLs Removal, Emoji Removal, User Age Processing (e.g. 25m becomes 25 male), Numbers Processing, Control Characters Removal, Case Folding, Repeated characters processing (e.g. reallllly becomes really), Fix and Expand English contradictions, Special Characters and Punctuation Removal, Tokenization (Uni-Grams), Stop-Words and 1-character tokens, Lemmatization and POS tagging.
Run notebook TS_Preprocessing for summarization.ipynb
in order to:
- remove document without summary
- remove document with a single sentence
- split train dataset
project_folder
└───Processed Data For Summarization
├───test_0.json
├───test_1.json
├───test_2.json
├───train_1_0.json
├───train_1_1.json
├───train_1_2.json
├───train_2_0.json
├───train_2_1.json
├───train_2_2.json
├─── ...
├───train_8_0.json
├───train_8_1.json
├───train_8_2.json
├───train_9_0.json
├───val_0.json
├───val_1.json
└───val_2.json
Run TS_featureMatrixGeneration.py
obtaining feature matrices (sentences x features). You get a directory tree like this:
project_folder
└───Feature Matrices
├───test_0.csv
├───test_1.csv
├───test_2.csv
├───train_1_0.csv
├───train_1_1.csv
├───train_1_2.csv
├───train_2_0.csv
├───train_2_1.csv
├───train_2_2.csv
├─── ...
├───train_8_0.csv
├───train_8_1.csv
├───train_8_2.csv
├───train_9_0.csv
├───val_0.csv
├───val_1.csv
└───val_2.csv
Run the notebook TS_featureMatrixGeneration2.ipynb
to join train, val and test datasets. You get a directory tree like this:
project_folder
└───Feature Matrices
├───test_0.csv
├───test_1.csv
├───test_2.csv
├───train_1_0.csv
├───train_1_1.csv
├───train_1_2.csv
├───train_2_0.csv
├───train_2_1.csv
├───train_2_2.csv
├─── ...
├───train_8_0.csv
├───train_8_1.csv
├───train_8_2.csv
├───train_9_0.csv
├───val_0.csv
├───val_1.csv
├───val_2.csv
├───test.csv
├───train.csv
└───val.csv
Features generated at this step are the following:
- sentence_relative_positions
- sentence_similarity_score_1_gram
- word_in_sentence_relative
- NOUN_tag_ratio
- VERB_tag_ratio
- ADJ_tag_ratio
- ADV_tag_ratio
- TF_ISF
Run notebook TS_featureMatrixUndersampling.ipynb
in order to perform CUR undersampling on both train and validation data sets. You get a directory tree like this:
project_folder
└───Undersampled Data
├───trainAndValMinorityClass.csv
└───trainAndValMajorityClassUndersampled.csv
Majority and minority class are splitted because CUR undersampling works only on the majority class
Run notebook TS_featureMatrixAnalysis.ipynb
to perform EEN undersampling. You get a directory tree like this:
project_folder
└───Undersampled Data
├───trainAndValUndersampledENN3.csv
├───trainAndValMinorityClass.csv
└───trainAndValMajorityClassUndersampled.csv
Run notebook TS_featureMatrixAnalysis.ipynb
to perform a RandomizedSearcCV over the following models
- RandomForestClassifier
- LogisticRegression
- HistGradientBoostingClassifier
with a few possible parameters configuration.
Then, evaluate the resulting best model on the test set with respect to:
- ROC curve
- Recall
- Precision
- Accuracy
Run notebook TS_featureMatrixAnalysis.ipynb
to perform MMR and obtain an extractive summary for each document in the test set.
Run notebook TS_featureMatrixAnalysis.ipynb
to measure summaries quality by means of
- Rouge1
- Rouge2
- RougeL
Run the TM_Preprocessing for topic modeling.ipynb
script to process and extract only the useful data. The output is saved here:
project_folder
└───processed_dataset
├───test.json
Run the TM_topic_modeling.ipynb
script which will perform LDA (with grid search of the best hyper-parameters) and LSA. The script saves 9 CSV files, 3 for LSA and 6 for LDA (UMass and CV coherence measures), containing: document-topic matrix, topic-term matrix and a table with topic insights.
project_folder
└───Results_topic_modeling
├───lda_doc_topic.csv
├───lda_doc_topic_CV.csv
├───lda_top_terms.csv
├───lda_top_terms_CV.csv
├───lda_topic_term.csv
├───lda_topic_term_cv.csv
├───lsa_doc_topic.csv
├───lsa_top_terms.csv
├───lsa_topic_term.csv
Saves images regarding the number of words per document and wordcloud in
project_folder
└───Images
Saves hyperparameters grid search results for UMass and CV coherence in
project_folder
└───Hyperparameters
├───tuning.csv
├───tuning_CV.csv