Releases · MaartenGr/BERTopic

08 Jun 08:00

MaartenGr

v0.8.1

687d846

v0.8.1

Highlights:

Improved models:
- For English documents the default is now: "paraphrase-MiniLM-L6-v2"
- For Non-English or multi-lingual documents the default is now: "paraphrase-multilingual-MiniLM-L12-v2"
- Both models show not only great performance but are much faster!
Add interactive visualizations to the plotting API documentation

For even better performance, please use the following models:

English: "paraphrase-mpnet-base-v2"
Non-English or multi-lingual: "paraphrase-multilingual-mpnet-base-v2"

Fixes:

Improved unit testing for more stability
Set transformers version for Flair

Assets 2

31 May 10:02

MaartenGr

v0.8.0

8b81eb8

Major Release v0.8

Mainly a visualization update to improve understanding of the topic model.

Features

Additional visualizations:
- Topic Hierarchy: topic_model.visualize_hierarchy()
- Topic Similarity Heatmap: topic_model.visualize_heatmap()
- Topic Representation Barchart: topic_model.visualize_barchart()
- Term Score Decline: topic_model.visualize_term_rank()

Improvements

Created bertopic.plotting library to easily extend visualizations
Improved automatic topic reduction by using HDBSCAN to detect similar topics
Sort topic ids by their frequency. -1 is the outlier class and contains typically the most documents. After that 0 is the largest topic, 1 the second largest, etc.
Update MKDOCS with new visualizations

Fixes

Fix typo #113, #117
Fix #121 by removing the following two lines:
- https://github.com/MaartenGr/BERTopic/blob/5c6cf22776fafaaff728370781a5d33727d3dc8f/bertopic/_bertopic.py#L359-L360
Fix mapping of topics after reduction (it now excludes 0) (#103)

Assets 2

26 Apr 11:26

MaartenGr

v0.7.0

4369b4e

Major Release v0.7

The two main features are (semi-)supervised topic modeling
and several backends to use instead of Flair and SentenceTransformers!

Highlights:

(semi-)supervised topic modeling by leveraging supervised options in UMAP
- model.fit(docs, y=target_classes)
Backends:
- Added Spacy, Gensim, USE (TFHub)
- Use a different backend for document embeddings and word embeddings
- Create your own backends with bertopic.backend.BaseEmbedder
- Click here for an overview of all new backends
Calculate and visualize topics per class
- Calculate: topics_per_class = topic_model.topics_per_class(docs, topics, classes)
- Visualize: topic_model.visualize_topics_per_class(topics_per_class)
Several tutorials were updated and added:

Name	Link
Topic Modeling with BERTopic
(Custom) Embedding Models in BERTopic
Advanced Customization in BERTopic
(semi-)Supervised Topic Modeling with BERTopic
Dynamic Topic Modeling with Trump's Tweets

Fixes:

Fixed issues with Torch req
Prevent saving term frequency matrix in CTFIDF class
Fixed DTM not working when reducing topics (#96)
Moved visualization dependencies to base BERTopic
- pip install bertopic[visualization] becomes pip install bertopic
Allow precomputed embeddings in bertopic.find_topics() (#79):

model = BERTopic(embedding_model=my_embedding_model)
model.fit(docs, my_precomputed_embeddings)
model.find_topics(search_term)

Assets 2

09 Mar 13:23

MaartenGr

v0.6.0

1ffc456

Major Release v0.6

Highlights:

DTM: Added a basic dynamic topic modeling technique based on the global c-TF-IDF representation
- model.topics_over_time(docs, timestamps, global_tuning=True)
DTM: Option to evolve topics based on t-1 c-TF-IDF representation which results in evolving topics over time
- Only uses topics at t-1 and skips evolution if there is a gap
- model.topics_over_time(docs, timestamps, evolution_tuning=True)
DTM: Function to visualize topics over time
- model.visualize_topics_over_time(topics_over_time)
DTM: Add binning of timestamps
- model.topics_over_time(docs, timestamps, nr_bins=10)
Add function get general information about topics (id, frequency, name, etc.)
- get_topic_info()
Improved stability of c-TF-IDF by taking the average number of words across all topics instead of the number of documents

Fixes:

_map_probabilities() does not take into account that there is no probability of the outlier class and the probabilities are mutated instead of copied (#63, #64)

Assets 2

08 Feb 13:35

MaartenGr

v0.5.0

e84d7d1

Major Release v0.5

Features

Add Flair to allow for more (custom) token/document embeddings
Option to use custom UMAP, HDBSCAN, and CountVectorizer
Added low_memory parameter to reduce memory during computation
Improved verbosity (shows progress bar)
Improved testing
Use the newest version of sentence-transformers as it speeds ups encoding significantly
Return the figure of visualize_topics()
Expose all parameters with a single function: get_params()
Option to disable the saving of embedding_model, should reduce BERTopic size significantly
Add FAQ page

Fixes

To simplify the API, the parameters stop_words and n_neighbors were removed. These can still be used when a custom UMAP or CountVectorizer is used.
Set calculate_probabilities to False as a default. Calculating probabilities with HDBSCAN significantly increases computation time and memory usage. Better to remove calculating probabilities or only allow it by manually turning this on.

Assets 2

10 Jan 07:54

MaartenGr

v0.4.2

8813b4d

Fix embedding parameter

Fixed the parameter embedding_model not working properly when language had been set. If you are using an older version of BERTopic, please set language to False when you want to set embedding_model.

Assets 2

07 Jan 11:30

MaartenGr

v0.4.1

c271ec6

Language fix

There was an issue with selecting the correct language model. This is now fixed with this small pypi update.

Assets 2

21 Dec 09:34

MaartenGr

v0.4.0

47e4804

Major Release

Highlights:

Visualize Topics similar to LDAvis
Added option to reduce topics after training
Added option to update topic representation after training
Added option to search topics using a search term
Significantly improved the stability of generating clusters
Finetune the topic words by selecting the most coherent words with the highest c-TF-IDF values
More extensive tutorials in the documentation

Notable Changes:

Option to select language instead of sentence-transformers models to minimize the complexity of using BERTopic
Improved logging (remove duplicates)
Check if BERTopic is fitted
Added TF-IDF as an embedder instead of transformer models (see tutorial)
Numpy for Python 3.6 will be dropped and was therefore removed from the workflow.
Preprocess text before passing it through c-TF-IDF
Merged get_topics_freq() with get_topic_freq()

Fixes:

Fix error handling topic probabilities

Assets 2

16 Nov 11:52

MaartenGr

v0.3.2

07c3be9

BugFix Topic Reduction

Fixed a bug with the topic reduction method that seems to reduce the number of topics but not to the nr_topics as defined in the class. Since this was, to a certain extend, breaking the topic reduction method a new release was necessary.

Assets 2

04 Nov 14:53

MaartenGr

v0.3.1

d96f5a1

Custom Embeddings

Adding the option to use custom embeddings or embeddings that you generated beforehand with whatever package you'd like to use. This allows users to further customize BERTopic to their liking.

NOTE: I cannot guarantee that using your own embeddings would result in better performance. It is likely to swing both ways depending on the embeddings you are using. For example, if you use poorly-trained W2V embeddings then it is likely to result in a poor topic generation. Thus, it is up to the user to experiment with the embeddings that best serve their purposes.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Features

Improvements

Fixes

Uh oh!

Uh oh!

Uh oh!

Features

Fixes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: MaartenGr/BERTopic

v0.8.1

Uh oh!

Major Release v0.8

Features

Improvements

Fixes

Uh oh!

Major Release v0.7

Uh oh!

Major Release v0.6

Uh oh!

Major Release v0.5

Features

Fixes

Uh oh!

Fix embedding parameter

Uh oh!

Language fix

Uh oh!

Major Release

Uh oh!

BugFix Topic Reduction

Uh oh!

Custom Embeddings

Uh oh!