44Model fitting
55-------------
66
7- Here is a simple example of model fitting.
8- It is supposed that you have already gone through the preprocessing
9- stage: cleaned, lemmatized or stemmed your documents, and removed stop words.
7+ This example demonstrates basic model fitting.
8+ Prerequisite: your documents should be preprocessed (cleaned, lemmatized/stemmed, and stop words removed).
109
1110.. code-block :: python
1211
@@ -19,10 +18,9 @@ stage: cleaned, lemmatized or stemmed your documents, and removed stop words.
1918 ' dataset/SearchSnippets.txt.gz' , header = None , names = [' texts' ])
2019 texts = df[' texts' ].str.strip().tolist()
2120
22- # Vectorizing documents, obtaining full vocabulary and biterms
23- # Internally, btm.get_words_freqs uses CountVectorizer from sklearn
24- # You can pass any of its arguments to btm.get_words_freqs
25- # For example, you can remove stop words:
21+ # Vectorize documents and extract biterms
22+ # Uses sklearn's CountVectorizer internally - accepts its parameters
23+ # Example: stop word removal
2624 stop_words = [" word1" , " word2" , " word3" ]
2725 X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words = stop_words)
2826 docs_vec = btm.get_vectorized_docs(texts, vocabulary)
@@ -37,14 +35,13 @@ stage: cleaned, lemmatized or stemmed your documents, and removed stop words.
3735 Inference
3836---------
3937
40- Now, we will calculate documents vs topics probability matrix (make an inference).
38+ Calculate document-topic probability matrix (inference):
4139
4240.. code-block :: python
4341
4442 p_zd = model.transform(docs_vec)
4543
46- If you need to make an inference on a new dataset, you should
47- vectorize it using your vocabulary from the training set:
44+ For inference on new documents, vectorize using the training vocabulary:
4845
4946.. code-block :: python
5047
@@ -55,8 +52,7 @@ vectorize it using your vocabulary from the training set:
5552 Calculating metrics
5653-------------------
5754
58- To calculate perplexity, we must provide documents vs topics probability matrix
59- (``p_zd ``) that we calculated at the previous step.
55+ Calculate perplexity using the document-topic probability matrix (``p_zd ``) from inference:
6056
6157.. code-block :: python
6258
@@ -70,8 +66,7 @@ To calculate perplexity, we must provide documents vs topics probability matrix
7066 Visualizing results
7167-------------------
7268
73- For results visualization, we will use `tmplot
74- <https://pypi.org/project/tmplot/> `_ package.
69+ Visualize results using the `tmplot <https://pypi.org/project/tmplot/ >`_ package:
7570
7671.. code-block :: python
7772
@@ -83,12 +78,10 @@ For results visualization, we will use `tmplot
8378 Filtering stable topics
8479-----------------------
8580
86- Unsupervised topic models (such as LDA) are subject to topic instability [1 ]_
87- [2 ]_ [3 ]_. There is a special method in ``tmplot `` package for selecting stable
88- topics. It uses various distance metrics such as Kullback-Leibler divergence
89- (symmetric and non-symmetric), Hellinger distance, Jeffrey's divergence,
90- Jensen-Shannon divergence, Jaccard index, Bhattacharyya distance, Total
91- variation distance.
81+ Topic models suffer from instability across runs [1 ]_ [2 ]_ [3 ]_.
82+ The ``tmplot `` package provides methods to identify stable topics using
83+ distance metrics: Kullback-Leibler divergence, Hellinger distance, Jeffrey's divergence,
84+ Jensen-Shannon divergence, Jaccard index, Bhattacharyya distance, and Total variation distance.
9285
9386.. code-block :: python
9487
@@ -123,9 +116,7 @@ variation distance.
123116 Model loading and saving
124117------------------------
125118
126- Support for model serializing with `pickle
127- <https://docs.python.org/3/library/pickle.html> `_ was implemented in v0.5.3.
128- Here is how you can save and load a model:
119+ Models support `pickle <https://docs.python.org/3/library/pickle.html >`_ serialization (since v0.5.3):
129120
130121.. code-block :: python
131122
0 commit comments