Skip to content

Commit d9a89f7

Browse files
committed
docs: improved explanations and wordings
1 parent 8cb208e commit d9a89f7

File tree

5 files changed

+28
-40
lines changed

5 files changed

+28
-40
lines changed

docs/source/benchmarks.rst

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,9 @@
11
Benchmarks
22
----------
33

4-
In this section, the results of a series of benchmarks done on *SearchSnippets* dataset
5-
are presented. Sixteen models were trained with different iterations number
6-
(from 10 to 2000) and default model parameters. Topics number was set to 8.
7-
Semantic topic coherence (``u_mass``) and perplexity were
8-
calculated for each model.
4+
Benchmark results on the *SearchSnippets* dataset.
5+
Sixteen models were trained with varying iterations (10-2000) using default parameters and 8 topics.
6+
Metrics calculated: semantic coherence (``u_mass``) and perplexity.
97

108
.. image:: _static/perplexity.svg
119
:alt: Perplexity

docs/source/index.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
bitermplus
22
==========
33

4-
*Bitermplus* implements `Biterm topic model
4+
**Bitermplus** implements the `Biterm Topic Model (BTM)
55
<https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4032&rep=rep1&type=pdf>`_
6-
for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi
7-
Cheng. Actually, it is a cythonized version of `BTM
8-
<https://github.com/xiaohuiyan/BTM>`_. This package is also capable of computing
9-
*perplexity* and *semantic coherence* metrics.
6+
for short text analysis, developed by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi
7+
Cheng. This is a high-performance Cython implementation of the original `BTM
8+
<https://github.com/xiaohuiyan/BTM>`_ with OpenMP parallelization. The package includes
9+
comprehensive evaluation metrics including *perplexity* and *semantic coherence*.
1010

1111
.. toctree::
1212
:maxdepth: 2

docs/source/install.rst

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,16 @@
1-
Setup
2-
-----
1+
Installation
2+
------------
33

44
Linux and Windows
55
~~~~~~~~~~~~~~~~~
66

7-
There should be no issues with installing *bitermplus* under these OSes.
8-
You can install the package directly from PyPi.
7+
Install *bitermplus* directly from PyPI:
98

109
.. code-block:: bash
1110
1211
pip install bitermplus
1312
14-
Or from this repo:
13+
For the latest development version:
1514

1615
.. code-block:: bash
1716

docs/source/tutorial.rst

Lines changed: 14 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,8 @@ Tutorial
44
Model fitting
55
-------------
66

7-
Here is a simple example of model fitting.
8-
It is supposed that you have already gone through the preprocessing
9-
stage: cleaned, lemmatized or stemmed your documents, and removed stop words.
7+
This example demonstrates basic model fitting.
8+
Prerequisite: your documents should be preprocessed (cleaned, lemmatized/stemmed, and stop words removed).
109

1110
.. code-block:: python
1211
@@ -19,10 +18,9 @@ stage: cleaned, lemmatized or stemmed your documents, and removed stop words.
1918
'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
2019
texts = df['texts'].str.strip().tolist()
2120
22-
# Vectorizing documents, obtaining full vocabulary and biterms
23-
# Internally, btm.get_words_freqs uses CountVectorizer from sklearn
24-
# You can pass any of its arguments to btm.get_words_freqs
25-
# For example, you can remove stop words:
21+
# Vectorize documents and extract biterms
22+
# Uses sklearn's CountVectorizer internally - accepts its parameters
23+
# Example: stop word removal
2624
stop_words = ["word1", "word2", "word3"]
2725
X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words=stop_words)
2826
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
@@ -37,14 +35,13 @@ stage: cleaned, lemmatized or stemmed your documents, and removed stop words.
3735
Inference
3836
---------
3937

40-
Now, we will calculate documents vs topics probability matrix (make an inference).
38+
Calculate document-topic probability matrix (inference):
4139

4240
.. code-block:: python
4341
4442
p_zd = model.transform(docs_vec)
4543
46-
If you need to make an inference on a new dataset, you should
47-
vectorize it using your vocabulary from the training set:
44+
For inference on new documents, vectorize using the training vocabulary:
4845

4946
.. code-block:: python
5047
@@ -55,8 +52,7 @@ vectorize it using your vocabulary from the training set:
5552
Calculating metrics
5653
-------------------
5754

58-
To calculate perplexity, we must provide documents vs topics probability matrix
59-
(``p_zd``) that we calculated at the previous step.
55+
Calculate perplexity using the document-topic probability matrix (``p_zd``) from inference:
6056

6157
.. code-block:: python
6258
@@ -70,8 +66,7 @@ To calculate perplexity, we must provide documents vs topics probability matrix
7066
Visualizing results
7167
-------------------
7268

73-
For results visualization, we will use `tmplot
74-
<https://pypi.org/project/tmplot/>`_ package.
69+
Visualize results using the `tmplot <https://pypi.org/project/tmplot/>`_ package:
7570

7671
.. code-block:: python
7772
@@ -83,12 +78,10 @@ For results visualization, we will use `tmplot
8378
Filtering stable topics
8479
-----------------------
8580

86-
Unsupervised topic models (such as LDA) are subject to topic instability [1]_
87-
[2]_ [3]_. There is a special method in ``tmplot`` package for selecting stable
88-
topics. It uses various distance metrics such as Kullback-Leibler divergence
89-
(symmetric and non-symmetric), Hellinger distance, Jeffrey's divergence,
90-
Jensen-Shannon divergence, Jaccard index, Bhattacharyya distance, Total
91-
variation distance.
81+
Topic models suffer from instability across runs [1]_ [2]_ [3]_.
82+
The ``tmplot`` package provides methods to identify stable topics using
83+
distance metrics: Kullback-Leibler divergence, Hellinger distance, Jeffrey's divergence,
84+
Jensen-Shannon divergence, Jaccard index, Bhattacharyya distance, and Total variation distance.
9285

9386
.. code-block:: python
9487
@@ -123,9 +116,7 @@ variation distance.
123116
Model loading and saving
124117
------------------------
125118

126-
Support for model serializing with `pickle
127-
<https://docs.python.org/3/library/pickle.html>`_ was implemented in v0.5.3.
128-
Here is how you can save and load a model:
119+
Models support `pickle <https://docs.python.org/3/library/pickle.html>`_ serialization (since v0.5.3):
129120

130121
.. code-block:: python
131122

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ markers = [
116116
# Code formatting
117117
[tool.black]
118118
line-length = 88
119-
target-version = ["py38", "py39", "py310", "py311", "py312"]
119+
target-version = ["py39", "py310", "py311", "py312", "py313"]
120120
include = '\.pyi?$'
121121
extend-exclude = '''
122122
/(
@@ -146,7 +146,7 @@ src_paths = ["src", "tests"]
146146

147147
# Type checking
148148
[tool.mypy]
149-
python_version = "3.8"
149+
python_version = "3.9"
150150
warn_return_any = true
151151
warn_unused_configs = true
152152
disallow_untyped_defs = true

0 commit comments

Comments
 (0)