docs: improved explanations and wordings

maximtrp · maximtrp · commit d9a89f7761d3 · 2025-09-14T20:47:06.000+02:00
diff --git a/docs/source/benchmarks.rst b/docs/source/benchmarks.rst
@@ -1,11 +1,9 @@
 Benchmarks
 ----------
 
-In this section, the results of a series of benchmarks done on *SearchSnippets* dataset
-are presented. Sixteen models were trained with different iterations number
-(from 10 to 2000) and default model parameters. Topics number was set to 8.
-Semantic topic coherence (``u_mass``) and perplexity were
-calculated for each model.
+Benchmark results on the *SearchSnippets* dataset.
+Sixteen models were trained with varying iterations (10-2000) using default parameters and 8 topics.
+Metrics calculated: semantic coherence (``u_mass``) and perplexity.
 
 .. image:: _static/perplexity.svg
    :alt: Perplexity
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -1,12 +1,12 @@
 bitermplus
 ==========
 
-*Bitermplus* implements `Biterm topic model
+**Bitermplus** implements the `Biterm Topic Model (BTM)
 <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4032&rep=rep1&type=pdf>`_
-for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi
-Cheng. Actually, it is a cythonized version of `BTM
-<https://github.com/xiaohuiyan/BTM>`_. This package is also capable of computing
-*perplexity* and *semantic coherence* metrics.
+for short text analysis, developed by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi
+Cheng. This is a high-performance Cython implementation of the original `BTM
+<https://github.com/xiaohuiyan/BTM>`_ with OpenMP parallelization. The package includes
+comprehensive evaluation metrics including *perplexity* and *semantic coherence*.
 
 .. toctree::
    :maxdepth: 2
diff --git a/docs/source/install.rst b/docs/source/install.rst
@@ -1,17 +1,16 @@
-Setup
------
+Installation
+------------
 
 Linux and Windows
 ~~~~~~~~~~~~~~~~~
 
-There should be no issues with installing *bitermplus* under these OSes.
-You can install the package directly from PyPi.
+Install *bitermplus* directly from PyPI:
 
 .. code-block:: bash
 
     pip install bitermplus
 
-Or from this repo:
+For the latest development version:
 
 .. code-block:: bash
 
diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst
@@ -4,9 +4,8 @@ Tutorial
 Model fitting
 -------------
 
-Here is a simple example of model fitting.
-It is supposed that you have already gone through the preprocessing
-stage: cleaned, lemmatized or stemmed your documents, and removed stop words.
+This example demonstrates basic model fitting.
+Prerequisite: your documents should be preprocessed (cleaned, lemmatized/stemmed, and stop words removed).
 
 .. code-block:: python
 
@@ -19,10 +18,9 @@ stage: cleaned, lemmatized or stemmed your documents, and removed stop words.
         'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
     texts = df['texts'].str.strip().tolist()
 
-    # Vectorizing documents, obtaining full vocabulary and biterms
-    # Internally, btm.get_words_freqs uses CountVectorizer from sklearn
-    # You can pass any of its arguments to btm.get_words_freqs
-    # For example, you can remove stop words:
+    # Vectorize documents and extract biterms
+    # Uses sklearn's CountVectorizer internally - accepts its parameters
+    # Example: stop word removal
     stop_words = ["word1", "word2", "word3"]
     X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words=stop_words)
     docs_vec = btm.get_vectorized_docs(texts, vocabulary)
@@ -37,14 +35,13 @@ stage: cleaned, lemmatized or stemmed your documents, and removed stop words.
 Inference
 ---------
 
-Now, we will calculate documents vs topics probability matrix (make an inference).
+Calculate document-topic probability matrix (inference):
 
 .. code-block:: python
 
     p_zd = model.transform(docs_vec)
 
-If you need to make an inference on a new dataset, you should
-vectorize it using your vocabulary from the training set:
+For inference on new documents, vectorize using the training vocabulary:
 
 .. code-block:: python
 
@@ -55,8 +52,7 @@ vectorize it using your vocabulary from the training set:
 Calculating metrics
 -------------------
 
-To calculate perplexity, we must provide documents vs topics probability matrix
-(``p_zd``) that we calculated at the previous step. 
+Calculate perplexity using the document-topic probability matrix (``p_zd``) from inference: 
 
 .. code-block:: python
 
@@ -70,8 +66,7 @@ To calculate perplexity, we must provide documents vs topics probability matrix
 Visualizing results
 -------------------
 
-For results visualization, we will use `tmplot
-<https://pypi.org/project/tmplot/>`_ package.
+Visualize results using the `tmplot <https://pypi.org/project/tmplot/>`_ package:
 
 .. code-block:: python
 
@@ -83,12 +78,10 @@ For results visualization, we will use `tmplot
 Filtering stable topics
 -----------------------
 
-Unsupervised topic models (such as LDA) are subject to topic instability [1]_
-[2]_ [3]_. There is a special method in ``tmplot`` package for selecting stable
-topics. It uses various distance metrics such as Kullback-Leibler divergence
-(symmetric and non-symmetric), Hellinger distance, Jeffrey's divergence,
-Jensen-Shannon divergence, Jaccard index, Bhattacharyya distance, Total
-variation distance.
+Topic models suffer from instability across runs [1]_ [2]_ [3]_.
+The ``tmplot`` package provides methods to identify stable topics using
+distance metrics: Kullback-Leibler divergence, Hellinger distance, Jeffrey's divergence,
+Jensen-Shannon divergence, Jaccard index, Bhattacharyya distance, and Total variation distance.
 
 .. code-block:: python
 
@@ -123,9 +116,7 @@ variation distance.
 Model loading and saving
 ------------------------
 
-Support for model serializing with `pickle
-<https://docs.python.org/3/library/pickle.html>`_ was implemented in v0.5.3.
-Here is how you can save and load a model:
+Models support `pickle <https://docs.python.org/3/library/pickle.html>`_ serialization (since v0.5.3):
 
 .. code-block:: python
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -116,7 +116,7 @@ markers = [
 # Code formatting
 [tool.black]
 line-length = 88
-target-version = ["py38", "py39", "py310", "py311", "py312"]
+target-version = ["py39", "py310", "py311", "py312", "py313"]
 include = '\.pyi?$'
 extend-exclude = '''
 /(
@@ -146,7 +146,7 @@ src_paths = ["src", "tests"]
 
 # Type checking
 [tool.mypy]
-python_version = "3.8"
+python_version = "3.9"
 warn_return_any = true
 warn_unused_configs = true
 disallow_untyped_defs = true