Skip to content

Commit 8cb208e

Browse files
committed
docs: rewritten all docstrings with better examples and explanations
- Rewrited all class and method docstrings with detailed explanations - Added comprehensive examples and parameter guidance - Documented new epsilon parameter across all APIs - Added autodoc and viewcode extensions to Sphinx configuration - Updated sklearn_api.rst with epsilon documentation - Improved utility function docs with practical examples
1 parent faf47df commit 8cb208e

File tree

5 files changed

+388
-119
lines changed

5 files changed

+388
-119
lines changed

docs/source/conf.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,10 @@
2727
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
2828
# ones.
2929
extensions = [
30+
"sphinx.ext.autodoc",
3031
"sphinx.ext.autosummary",
3132
"sphinx.ext.napoleon",
33+
"sphinx.ext.viewcode",
3234
]
3335

3436
# Add any paths that contain templates here, relative to this directory.

docs/source/sklearn_api.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,9 @@ Parameters
113113
**vectorizer_params** : dict, default=None
114114
Parameters for the internal CountVectorizer.
115115

116+
**epsilon** : float, default=1e-10
117+
Small numerical constant to prevent division by zero and improve numerical stability.
118+
116119
Topic Analysis Methods
117120
~~~~~~~~~~~~~~~~~~~~~~
118121

@@ -282,6 +285,7 @@ Parameter Selection
282285
- **alpha**: Higher values (1.0+) create more evenly distributed topics
283286
- **beta**: Keep small (0.01-0.1) for focused topics
284287
- **max_iter**: 100-200 usually sufficient for convergence
288+
- **epsilon**: Default (1e-10) works well; increase for extreme numerical stability, decrease for higher precision
285289

286290
Performance Optimization
287291
~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -344,7 +348,8 @@ Converting from Original API
344348
coherence_window=20,
345349
alpha=50/8,
346350
beta=0.01,
347-
max_iter=600
351+
max_iter=600,
352+
epsilon=1e-10 # Numerical stability parameter
348353
)
349354
p_zd = model.fit_transform(texts)
350355

src/bitermplus/_api.py

Lines changed: 86 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -14,57 +14,122 @@
1414

1515

1616
class BTMClassifier(BaseEstimator, TransformerMixin):
17-
"""Sklearn-style Biterm Topic Model classifier.
17+
"""Sklearn-compatible Biterm Topic Model for short text analysis.
1818
1919
This class provides a scikit-learn compatible interface for the Biterm Topic Model,
20-
making it easy to integrate into existing ML pipelines and use familiar methods
21-
like fit() and transform().
20+
designed specifically for short text analysis such as tweets, reviews, and messages.
21+
Unlike traditional topic models like LDA, BTM extracts biterms (word pairs) from
22+
the entire corpus to overcome data sparsity issues in short texts.
23+
24+
The BTMClassifier automatically handles text preprocessing, vectorization, biterm
25+
generation, model training, and inference, making topic modeling as simple as
26+
calling fit() and transform().
2227
2328
Parameters
2429
----------
2530
n_topics : int, default=8
26-
Number of topics to extract.
31+
Number of topics to extract from the corpus.
2732
alpha : float, default=None
28-
Dirichlet prior parameter for topic distribution.
29-
If None, uses 50/n_topics as recommended.
33+
Dirichlet prior parameter for topic distribution. Controls topic sparsity
34+
in documents. Higher values create more uniform topic distributions.
35+
If None, uses 50/n_topics as recommended in the original paper.
3036
beta : float, default=0.01
31-
Dirichlet prior parameter for word distribution.
37+
Dirichlet prior parameter for word distribution within topics. Controls
38+
topic-word sparsity. Lower values create more focused topics.
3239
max_iter : int, default=600
33-
Maximum number of iterations for model training.
40+
Maximum number of Gibbs sampling iterations for model training.
41+
More iterations generally improve convergence but increase training time.
3442
random_state : int, default=None
35-
Random seed for reproducible results.
43+
Random seed for reproducible results. Set to an integer for consistent
44+
results across runs.
3645
window_size : int, default=15
37-
Window size for biterm generation.
46+
Window size for biterm generation. Biterms are extracted from word pairs
47+
within this window distance in each document.
3848
has_background : bool, default=False
39-
Whether to use background topic for frequent words.
49+
Whether to use a background topic to model highly frequent words that
50+
appear across many topics (e.g., stop words).
4051
coherence_window : int, default=20
41-
Number of top words for coherence calculation.
52+
Number of top words used for coherence calculation. This affects the
53+
semantic coherence metric computation.
4254
vectorizer_params : dict, default=None
43-
Parameters to pass to CountVectorizer for preprocessing.
55+
Additional parameters to pass to the internal CountVectorizer for text
56+
preprocessing. Common options include min_df, max_df, stop_words, etc.
4457
epsilon : float, default=1e-10
45-
Small constant to prevent numerical issues (division by zero, etc.).
58+
Small numerical constant to prevent division by zero and improve
59+
numerical stability in probability calculations.
4660
4761
Attributes
4862
----------
4963
model_ : BTM
50-
The fitted BTM model instance.
51-
vocabulary_ : np.ndarray
52-
Vocabulary learned from training data.
53-
feature_names_out_ : np.ndarray
64+
The fitted BTM model instance containing learned parameters.
65+
vocabulary_ : numpy.ndarray
66+
Vocabulary learned from training data (words corresponding to features).
67+
feature_names_out_ : numpy.ndarray
5468
Alias for vocabulary_ for sklearn compatibility.
5569
n_features_in_ : int
56-
Number of features (vocabulary size).
70+
Number of features (vocabulary size) after preprocessing.
5771
vectorizer_ : CountVectorizer
58-
The fitted vectorizer used for preprocessing.
72+
The fitted vectorizer used for text preprocessing.
73+
74+
Methods
75+
-------
76+
fit(X, y=None)
77+
Fit the BTM model to documents.
78+
transform(X, infer_type='sum_b')
79+
Transform documents to topic probability distributions.
80+
fit_transform(X, y=None, infer_type='sum_b')
81+
Fit model and transform documents in one step.
82+
get_topic_words(topic_id=None, n_words=10)
83+
Get top words for topics.
84+
get_document_topics(X, threshold=0.1)
85+
Get dominant topics for documents.
86+
score(X, y=None)
87+
Return mean coherence score across topics.
5988
6089
Examples
6190
--------
91+
Basic usage:
92+
6293
>>> import bitermplus as btm
63-
>>> texts = ["machine learning is great", "I love natural language processing"]
94+
>>> texts = [
95+
... "machine learning algorithms are powerful",
96+
... "deep learning neural networks process data",
97+
... "natural language processing understands text"
98+
... ]
6499
>>> model = btm.BTMClassifier(n_topics=2, random_state=42)
65100
>>> model.fit(texts)
101+
BTMClassifier(n_topics=2, random_state=42)
66102
>>> doc_topics = model.transform(texts)
103+
>>> print(f"Shape: {doc_topics.shape}")
104+
Shape: (3, 2)
105+
106+
Getting topic words:
107+
67108
>>> topic_words = model.get_topic_words(n_words=5)
109+
>>> for topic_id, words in topic_words.items():
110+
... print(f"Topic {topic_id}: {', '.join(words)}")
111+
112+
Using with sklearn pipelines:
113+
114+
>>> from sklearn.pipeline import Pipeline
115+
>>> from sklearn.preprocessing import FunctionTransformer
116+
>>> pipeline = Pipeline([
117+
... ('preprocess', FunctionTransformer(lambda x: [s.lower() for s in x])),
118+
... ('btm', btm.BTMClassifier(n_topics=3, random_state=42))
119+
... ])
120+
>>> topics = pipeline.fit_transform(texts)
121+
122+
References
123+
----------
124+
Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for
125+
short texts. In Proceedings of the 22nd international conference on World
126+
Wide Web (pp. 1445-1456).
127+
128+
See Also
129+
--------
130+
BTM : Low-level BTM implementation
131+
get_words_freqs : Extract word frequencies from documents
132+
get_biterms : Generate biterms from vectorized documents
68133
"""
69134

70135
def __init__(

0 commit comments

Comments
 (0)