LDA Topic Modeling for Indexers

This application allows you to explore potential topics in a document.

It provides for:

Text extraction from a PDF document
Exploratory analysis of the document corpus
Visualization of topic and document relationships

Topic generation is accomplished using Latent Dirichlet Allocation (LDA), an unsupervised machine learning technique.

Home Page

The Home page contains the page selection menu in an expandable sidebar and a short description of the application.

File Selection

The File Selection page allows you to load a PDF file and set parameters for pre-processing. This cleans the document text, converting it into a form more suitable for topic modelling.

Display Parameters

Select a PDF file: Streamlit file selection widget. Files displayed in the widget are restricted to PDF. Note that, if you switch to another page then back to the File Selection page, the widget will no longer display a selected file. As long as the Currently Selected File entry shows a file name, there is a file loaded and available for pre-processing. Ideally, load and process the selected file before switching to another page.

The selected file is parsed using SpaCy Layout (a wrapper around the IBM Docling module) to extract text spans (paragraphs) from the document. Due to the complex internal structure of a PDF file, this may take some minutes. See the PDF Association for more information.

File Processing Parameters

First Page to Process: The page at which you wish pre-processing to start. This should be the first page following the fore matter.

Last Page to Process: The page at which you wish pre-processing to end. This should be the last page before the end matter.

When the Process File button is pressed, the input file is processed using the parameters set by the user and a cleaned JSON file of text spans is created. This file is used in Text Exploration and Topic Visualization. The text processing steps undertaken are;

Remove URLs
Remove HTML
Remove bibliographic citations
Split the text into tokens
Filter the text for allowed parts of speech (proper nouns, nouns, verbs, adjectives, adverbs)
Convert text to lower case
Delete punctuation
Lemmatize the text
Delete stop words

Back to Top

Text Exploration

These text exploration techniques provide you with information about the basic structure and content of the manuscript you are indexing.

Data File

JSON File: A drop-down list containing all the JSON files found in the current working directory. The drop-down defaults to the first file found.

Document Structure

Provides basic information about the document structure, displaying characters per document and words per document. A document in this case is the equivalent of a paragraph in the manuscript. Note that pre-processing can dramatically shorten the length of a document from its unprocessed size.

Bins: The number of bins used to display manuscript structure. The range is 1 - 100 with a default of 50.

Text Exploration

Common Words

Common Words: The most common words found in the corpus of documents. The number of words displayed has a range of 1 - 100, with a default of 30.

N-Grams

An n-gram is a sequence of n order specific adjacent words in a document. If you compare the N-grams created by SpaCy with those generated by the SkLearn LDA module during topic creation, you will find that SpaCy usually produces N-grams that can be more easily interpreted.

N-Grams: The type of n-gram to display. An order 2 n-gram is known as a bigram, and an order-3 one as a trigram. The range for n-grams varies from 2 - 5, with a default of 2.

Number of N-Grams: The total number of n-grams to be displayed. These are the most common n-grams in the corpus of documents. The number of n-grams displayed can range from 1 - 100, with a default of 40.

Named Entities

Named Entity: A dropdown list containing a selection of SpaCy named entities. The list defaults to GPE.

Number of Entities: The number of the selected named entity returned. This value has a range of 1 - 100, with a default of 20.

Back to Top

Topic Visualization

This page visually displays potential topics found by SkLearn, using the Python module pyLDAvis and custom visualizations.

SkLearn uses the Latent Dirichlet Allocation (LDA) model to derive topics from a corpus of documents. Its LatentDirichletAllocation api takes a large number of parameters, some of which are exposed here for manipulation by the user. Since phrases are more informative in the interpretation of latent topics, model generation is restricted to use bigrams and trigrams in topic formation.

I recommend you create a visualization with the default parameters to see a rough approximation, then start tweaking to produce what you believe are the correct number of topics.

JSON

JSON File: A drop-down list containing all the JSON files found in the current working directory. The drop-down defaults to the first file found.

LDA Parameters

Number of Topics: The number of requested latent topics to be extracted from the training corpus. Default is 10, range is 1 - 30.

Number of Terms: The number of terms to display in the barcharts of the pyLDAvis visualization. Default is 15, range is 1 - 30.

Chunk Size: Number of documents to be used in each training chunk. Default is 100, range is 10 - 500.

Data Passes: Number of passes through the corpus during training. Default is 10, range is 1 - 50.

LDA Visualizations

A dropdown list allowing for the selection of one of the following visualization methods.

topic map: a 2-dimensional pyLDAvis visualization of the topics generated by LatentDirichletAllocation.

topic similarity: a Plotly heatmap of the relationship between topics. The darker the color of the square, the more closely related the topics.

topic barchart: a grid of Plotly bar charts with a chart for each topic showing the top 5 terms in the topic.

topic clouds: a Matplotlib grid of wordclouds with a cloud for each topic showing the top 10 words in each topic.

topic sunburst: a Plotly sunburst chart showing the relative importance of both topics and most common words within a topic.

topic treemap: a Plotly treemap chart showing the relative importance of both topics and most common words within a topic

document topics: a Plotly bar chart showing the corpus of documents grouped by the dominant topic within each document.

documents: a Plotly 2-dimensional scatter plot of the corpus of documents, color-coded by the dominant topic in each document. Dimension reduction to 2-D done using Scikit Learn t-SNE.

3-D document topics: a Plotly 3-dimensional scatter plot of the corpus of documents, color-coded by the dominant topic in each document. Dimension reduction to 3-D done using Scikit Learn t-SNE.

cluster map: a Plotly cluster map showing the relationship between topics. Allows for the examination of higher order groupings of topics.

Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
pages		pages
.gitattributes		.gitattributes
.gitignore		.gitignore
Data_Explorer.py		Data_Explorer.py
Home.md		Home.md
Home.py		Home.py
License.txt		License.txt
model.py		model.py
pdfdoc.py		pdfdoc.py
preprocess.py		preprocess.py
readme.md		readme.md
requirements.txt		requirements.txt
utility.py		utility.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LDA Topic Modeling for Indexers

Home Page

File Selection

Display Parameters

File Processing Parameters

Text Exploration

Data File

Document Structure

Text Exploration

Common Words

N-Grams

Named Entities

Topic Visualization

JSON

LDA Parameters

About

Uh oh!

Releases

Packages

Languages

License

DWHowes/LDA-Topic-Modeling

Folders and files

Latest commit

History

Repository files navigation

LDA Topic Modeling for Indexers

Home Page

File Selection

Display Parameters

File Processing Parameters

Text Exploration

Data File

Document Structure

Text Exploration

Common Words

N-Grams

Named Entities

Topic Visualization

JSON

LDA Parameters

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages