Skip to content

gianscuri/Text_Summarization_Reddit_Posts_TLDR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extractive Text Summarization (TS) and Topic Modeling (TM) over Reddit Posts

TLDRHQ: Data and Text Pre-processing (PP)

Step 0. Prepare Folders

First of all, create three empty folders: ./DatasetTLDRHQ,./ProcessedData and ./Dataset_splitted.

Step 1. Download and extract the dataset

Download annotations from the official Google Drive Folder and extract them in ./DatasetTLDRHQ, resulting in a folder tree like this:

project_folder
└───Dataset_TLDRHQ
    ├───dataset-m0
    ├───dataset-m1
    ├───dataset-m2
    ├───dataset-m2021
    ├───dataset-m3
    ├───dataset-m4
    └───dataset-m6

Step 2. Perform data cleaning and splitting of the dataset

Run the PP_cleaning.py script which will perform data cleaning (removing duplicates), splits the dataset into training/validation and test sets (splitting the training set so that it is easier to manage) and then save it splitted into .JSON files in ./Dataset_splitted. You get a directory tree like this:

project_folder
└───Dataset_splitted
    ├───test.json
    ├───train_1.json
    ├───train_10.json
    ├───train_2.json
    ├───train_3.json
    ├───train_4.json
    ├───train_5.json
    ├───train_6.json
    ├───train_7.json
    ├───train_8.json
    ├───train_9.json
    └───val.json

Step 3. Perform text pre-processing on the dataset

Run the PP_normalizing.py script which will perform senteces splitting, text normalization, tokenization, stop-words removal, lemmatization and POS tagging on document variable, containing reddit posts. Then save it splitted into various .JSON files in ./ProcessedData. You get a directory tree like this:

project_folder
└───ProcessedData
    ├───test.json
    ├───train_1.json
    ├───train_10.json
    ├───train_2.json
    ├───train_3.json
    ├───train_4.json
    ├───train_5.json
    ├───train_6.json
    ├───train_7.json
    ├───train_8.json
    ├───train_9.json
    └───val.json

The text normalisation operations performed include, in order: Sentence Splitting, HTML tags and entities removal, Extra White spaces Removal, URLs Removal, Emoji Removal, User Age Processing (e.g. 25m becomes 25 male), Numbers Processing, Control Characters Removal, Case Folding, Repeated characters processing (e.g. reallllly becomes really), Fix and Expand English contradictions, Special Characters and Punctuation Removal, Tokenization (Uni-Grams), Stop-Words and 1-character tokens, Lemmatization and POS tagging.

Text Summarization task (TS)

Step 0. Split and clean'ProcessedData' for easy management

Run notebook TS_Preprocessing for summarization.ipynb in order to:

  • remove document without summary
  • remove document with a single sentence
  • split train dataset
project_folder
└───Processed Data For Summarization
    ├───test_0.json
    ├───test_1.json
    ├───test_2.json
    ├───train_1_0.json
    ├───train_1_1.json
    ├───train_1_2.json
    ├───train_2_0.json
    ├───train_2_1.json
    ├───train_2_2.json
    ├───  ...
    ├───train_8_0.json
    ├───train_8_1.json
    ├───train_8_2.json
    ├───train_9_0.json
    ├───val_0.json
    ├───val_1.json
    └───val_2.json

Step 1. Create a feature matrix for each of the JSON in 'Processed Data For Summarization'

Run TS_featureMatrixGeneration.py obtaining feature matrices (sentences x features). You get a directory tree like this:

project_folder
└───Feature Matrices
    ├───test_0.csv
    ├───test_1.csv
    ├───test_2.csv
    ├───train_1_0.csv
    ├───train_1_1.csv
    ├───train_1_2.csv
    ├───train_2_0.csv
    ├───train_2_1.csv
    ├───train_2_2.csv
    ├───  ...
    ├───train_8_0.csv
    ├───train_8_1.csv
    ├───train_8_2.csv
    ├───train_9_0.csv
    ├───val_0.csv
    ├───val_1.csv
    └───val_2.csv

Run the notebook TS_featureMatrixGeneration2.ipynb to join train, val and test datasets. You get a directory tree like this:

project_folder
└───Feature Matrices
   ├───test_0.csv
   ├───test_1.csv
   ├───test_2.csv
   ├───train_1_0.csv
   ├───train_1_1.csv
   ├───train_1_2.csv
   ├───train_2_0.csv
   ├───train_2_1.csv
   ├───train_2_2.csv
   ├───  ...
   ├───train_8_0.csv
   ├───train_8_1.csv
   ├───train_8_2.csv
   ├───train_9_0.csv
   ├───val_0.csv
   ├───val_1.csv
   ├───val_2.csv
   ├───test.csv
   ├───train.csv
   └───val.csv

Features generated at this step are the following:

  • sentence_relative_positions
  • sentence_similarity_score_1_gram
  • word_in_sentence_relative
  • NOUN_tag_ratio
  • VERB_tag_ratio
  • ADJ_tag_ratio
  • ADV_tag_ratio
  • TF_ISF

Step 2. Perform CUR undersampling

Run notebook TS_featureMatrixUndersampling.ipynb in order to perform CUR undersampling on both train and validation data sets. You get a directory tree like this:

project_folder
└───Undersampled Data
    ├───trainAndValMinorityClass.csv
    └───trainAndValMajorityClassUndersampled.csv

Majority and minority class are splitted because CUR undersampling works only on the majority class

Step 3. Perform EditedNearestNeighbours(ENN) undersamplig

Run notebook TS_featureMatrixAnalysis.ipynb to perform EEN undersampling. You get a directory tree like this:

project_folder
└───Undersampled Data
    ├───trainAndValUndersampledENN3.csv
    ├───trainAndValMinorityClass.csv
    └───trainAndValMajorityClassUndersampled.csv

Step 4. Machine Learning model selection and evaluation

Run notebook TS_featureMatrixAnalysis.ipynb to perform a RandomizedSearcCV over the following models

  • RandomForestClassifier
  • LogisticRegression
  • HistGradientBoostingClassifier

with a few possible parameters configuration.

Then, evaluate the resulting best model on the test set with respect to:

  • ROC curve
  • Recall
  • Precision
  • Accuracy

Step 5. Perform Maximal Marginal Relevance(MMR) selection

Run notebook TS_featureMatrixAnalysis.ipynb to perform MMR and obtain an extractive summary for each document in the test set.

Step 6. Summary Evaluation

Run notebook TS_featureMatrixAnalysis.ipynb to measure summaries quality by means of

  • Rouge1
  • Rouge2
  • RougeL

Topic Modeling task (TM)

Step 0. Perform preprocessing

Run the TM_Preprocessing for topic modeling.ipynb script to process and extract only the useful data. The output is saved here:

project_folder
└───processed_dataset
    ├───test.json

Step 1. Perform topic modeling on the test set

Run the TM_topic_modeling.ipynb script which will perform LDA (with grid search of the best hyper-parameters) and LSA. The script saves 9 CSV files, 3 for LSA and 6 for LDA (UMass and CV coherence measures), containing: document-topic matrix, topic-term matrix and a table with topic insights.

project_folder
└───Results_topic_modeling
    ├───lda_doc_topic.csv
    ├───lda_doc_topic_CV.csv
    ├───lda_top_terms.csv
    ├───lda_top_terms_CV.csv
    ├───lda_topic_term.csv
    ├───lda_topic_term_cv.csv
    ├───lsa_doc_topic.csv
    ├───lsa_top_terms.csv
    ├───lsa_topic_term.csv

Saves images regarding the number of words per document and wordcloud in

project_folder
└───Images

Saves hyperparameters grid search results for UMass and CV coherence in

project_folder
└───Hyperparameters
    ├───tuning.csv
    ├───tuning_CV.csv

About

Extractive text summarization and topic modeling over Reddit posts

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •