Extractive Text Summarization (TS) and Topic Modeling (TM) over Reddit Posts

TLDRHQ: Data and Text Pre-processing (PP)

Step 0. Prepare Folders

First of all, create three empty folders: ./DatasetTLDRHQ,./ProcessedData and ./Dataset_splitted.

Step 1. Download and extract the dataset

Download annotations from the official Google Drive Folder and extract them in ./DatasetTLDRHQ, resulting in a folder tree like this:

project_folder
└───Dataset_TLDRHQ
    ├───dataset-m0
    ├───dataset-m1
    ├───dataset-m2
    ├───dataset-m2021
    ├───dataset-m3
    ├───dataset-m4
    └───dataset-m6

Step 2. Perform data cleaning and splitting of the dataset

Run the PP_cleaning.py script which will perform data cleaning (removing duplicates), splits the dataset into training/validation and test sets (splitting the training set so that it is easier to manage) and then save it splitted into .JSON files in ./Dataset_splitted. You get a directory tree like this:

project_folder
└───Dataset_splitted
    ├───test.json
    ├───train_1.json
    ├───train_10.json
    ├───train_2.json
    ├───train_3.json
    ├───train_4.json
    ├───train_5.json
    ├───train_6.json
    ├───train_7.json
    ├───train_8.json
    ├───train_9.json
    └───val.json

Step 3. Perform text pre-processing on the dataset

Run the PP_normalizing.py script which will perform senteces splitting, text normalization, tokenization, stop-words removal, lemmatization and POS tagging on document variable, containing reddit posts. Then save it splitted into various .JSON files in ./ProcessedData. You get a directory tree like this:

project_folder
└───ProcessedData
    ├───test.json
    ├───train_1.json
    ├───train_10.json
    ├───train_2.json
    ├───train_3.json
    ├───train_4.json
    ├───train_5.json
    ├───train_6.json
    ├───train_7.json
    ├───train_8.json
    ├───train_9.json
    └───val.json

The text normalisation operations performed include, in order: Sentence Splitting, HTML tags and entities removal, Extra White spaces Removal, URLs Removal, Emoji Removal, User Age Processing (e.g. 25m becomes 25 male), Numbers Processing, Control Characters Removal, Case Folding, Repeated characters processing (e.g. reallllly becomes really), Fix and Expand English contradictions, Special Characters and Punctuation Removal, Tokenization (Uni-Grams), Stop-Words and 1-character tokens, Lemmatization and POS tagging.

Text Summarization task (TS)

Step 0. Split and clean'ProcessedData' for easy management

Run notebook TS_Preprocessing for summarization.ipynb in order to:

remove document without summary
remove document with a single sentence
split train dataset

project_folder
└───Processed Data For Summarization
    ├───test_0.json
    ├───test_1.json
    ├───test_2.json
    ├───train_1_0.json
    ├───train_1_1.json
    ├───train_1_2.json
    ├───train_2_0.json
    ├───train_2_1.json
    ├───train_2_2.json
    ├───  ...
    ├───train_8_0.json
    ├───train_8_1.json
    ├───train_8_2.json
    ├───train_9_0.json
    ├───val_0.json
    ├───val_1.json
    └───val_2.json

Step 1. Create a feature matrix for each of the JSON in 'Processed Data For Summarization'

Run TS_featureMatrixGeneration.py obtaining feature matrices (sentences x features). You get a directory tree like this:

project_folder
└───Feature Matrices
    ├───test_0.csv
    ├───test_1.csv
    ├───test_2.csv
    ├───train_1_0.csv
    ├───train_1_1.csv
    ├───train_1_2.csv
    ├───train_2_0.csv
    ├───train_2_1.csv
    ├───train_2_2.csv
    ├───  ...
    ├───train_8_0.csv
    ├───train_8_1.csv
    ├───train_8_2.csv
    ├───train_9_0.csv
    ├───val_0.csv
    ├───val_1.csv
    └───val_2.csv

Run the notebook TS_featureMatrixGeneration2.ipynb to join train, val and test datasets. You get a directory tree like this:

project_folder
└───Feature Matrices
   ├───test_0.csv
   ├───test_1.csv
   ├───test_2.csv
   ├───train_1_0.csv
   ├───train_1_1.csv
   ├───train_1_2.csv
   ├───train_2_0.csv
   ├───train_2_1.csv
   ├───train_2_2.csv
   ├───  ...
   ├───train_8_0.csv
   ├───train_8_1.csv
   ├───train_8_2.csv
   ├───train_9_0.csv
   ├───val_0.csv
   ├───val_1.csv
   ├───val_2.csv
   ├───test.csv
   ├───train.csv
   └───val.csv

Features generated at this step are the following:

sentence_relative_positions
sentence_similarity_score_1_gram
word_in_sentence_relative
NOUN_tag_ratio
VERB_tag_ratio
ADJ_tag_ratio
ADV_tag_ratio
TF_ISF

Step 2. Perform CUR undersampling

Run notebook TS_featureMatrixUndersampling.ipynb in order to perform CUR undersampling on both train and validation data sets. You get a directory tree like this:

project_folder
└───Undersampled Data
    ├───trainAndValMinorityClass.csv
    └───trainAndValMajorityClassUndersampled.csv

Majority and minority class are splitted because CUR undersampling works only on the majority class

Step 3. Perform EditedNearestNeighbours(ENN) undersamplig

Run notebook TS_featureMatrixAnalysis.ipynb to perform EEN undersampling. You get a directory tree like this:

project_folder
└───Undersampled Data
    ├───trainAndValUndersampledENN3.csv
    ├───trainAndValMinorityClass.csv
    └───trainAndValMajorityClassUndersampled.csv

Step 4. Machine Learning model selection and evaluation

Run notebook TS_featureMatrixAnalysis.ipynb to perform a RandomizedSearcCV over the following models

RandomForestClassifier
LogisticRegression
HistGradientBoostingClassifier

with a few possible parameters configuration.

Then, evaluate the resulting best model on the test set with respect to:

ROC curve
Recall
Precision
Accuracy

Step 5. Perform Maximal Marginal Relevance(MMR) selection

Run notebook TS_featureMatrixAnalysis.ipynb to perform MMR and obtain an extractive summary for each document in the test set.

Step 6. Summary Evaluation

Run notebook TS_featureMatrixAnalysis.ipynb to measure summaries quality by means of

Rouge1
Rouge2
RougeL

Topic Modeling task (TM)

Step 0. Perform preprocessing

Run the TM_Preprocessing for topic modeling.ipynb script to process and extract only the useful data. The output is saved here:

project_folder
└───processed_dataset
    ├───test.json

Step 1. Perform topic modeling on the test set

Run the TM_topic_modeling.ipynb script which will perform LDA (with grid search of the best hyper-parameters) and LSA. The script saves 9 CSV files, 3 for LSA and 6 for LDA (UMass and CV coherence measures), containing: document-topic matrix, topic-term matrix and a table with topic insights.

project_folder
└───Results_topic_modeling
    ├───lda_doc_topic.csv
    ├───lda_doc_topic_CV.csv
    ├───lda_top_terms.csv
    ├───lda_top_terms_CV.csv
    ├───lda_topic_term.csv
    ├───lda_topic_term_cv.csv
    ├───lsa_doc_topic.csv
    ├───lsa_top_terms.csv
    ├───lsa_topic_term.csv

Saves images regarding the number of words per document and wordcloud in

project_folder
└───Images

Saves hyperparameters grid search results for UMass and CV coherence in

project_folder
└───Hyperparameters
    ├───tuning.csv
    ├───tuning_CV.csv

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
Hyperparameters		Hyperparameters
.gitignore		.gitignore
CarboneScatassiScuri_Report.pdf		CarboneScatassiScuri_Report.pdf
CarboneScatassiScuri_Slides.pdf		CarboneScatassiScuri_Slides.pdf
LICENSE.md		LICENSE.md
PP_cleaning.py		PP_cleaning.py
PP_exploration.ipynb		PP_exploration.ipynb
PP_normalizing.py		PP_normalizing.py
PP_preprocessing.ipynb		PP_preprocessing.ipynb
README.md		README.md
TM_Preprocessing for topic modeling.ipynb		TM_Preprocessing for topic modeling.ipynb
TM_topic_modeling.ipynb		TM_topic_modeling.ipynb
TS_Preprocessing for summarization.ipynb		TS_Preprocessing for summarization.ipynb
TS_featureMatrixAnalysis.ipynb		TS_featureMatrixAnalysis.ipynb
TS_featureMatrixGeneration.py		TS_featureMatrixGeneration.py
TS_featureMatrixGeneration2.ipynb		TS_featureMatrixGeneration2.ipynb
TS_featureMatrixUndersampling.ipynb		TS_featureMatrixUndersampling.ipynb
TS_text_summarization.ipynb		TS_text_summarization.ipynb
TS_text_summarization_utility.py		TS_text_summarization_utility.py
Text Summarization and Topic Modeling over Reddit Posts.pdf		Text Summarization and Topic Modeling over Reddit Posts.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Extractive Text Summarization (TS) and Topic Modeling (TM) over Reddit Posts

TLDRHQ: Data and Text Pre-processing (PP)

Step 0. Prepare Folders

Step 1. Download and extract the dataset

Step 2. Perform data cleaning and splitting of the dataset

Step 3. Perform text pre-processing on the dataset

Text Summarization task (TS)

Step 0. Split and clean'ProcessedData' for easy management

Step 1. Create a feature matrix for each of the JSON in 'Processed Data For Summarization'

Step 2. Perform CUR undersampling

Step 3. Perform EditedNearestNeighbours(ENN) undersamplig

Step 4. Machine Learning model selection and evaluation

Step 5. Perform Maximal Marginal Relevance(MMR) selection

Step 6. Summary Evaluation

Topic Modeling task (TM)

Step 0. Perform preprocessing

Step 1. Perform topic modeling on the test set

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

gianscuri/Text_Summarization_Reddit_Posts_TLDR

Folders and files

Latest commit

History

Repository files navigation

Extractive Text Summarization (TS) and Topic Modeling (TM) over Reddit Posts

TLDRHQ: Data and Text Pre-processing (PP)

Step 0. Prepare Folders

Step 1. Download and extract the dataset

Step 2. Perform data cleaning and splitting of the dataset

Step 3. Perform text pre-processing on the dataset

Text Summarization task (TS)

Step 0. Split and clean'ProcessedData' for easy management

Step 1. Create a feature matrix for each of the JSON in 'Processed Data For Summarization'

Step 2. Perform CUR undersampling

Step 3. Perform EditedNearestNeighbours(ENN) undersamplig

Step 4. Machine Learning model selection and evaluation

Step 5. Perform Maximal Marginal Relevance(MMR) selection

Step 6. Summary Evaluation

Topic Modeling task (TM)

Step 0. Perform preprocessing

Step 1. Perform topic modeling on the test set

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages