Releases: MaartenGr/BERTopic
v0.8.1
Highlights:
- Improved models:
- For English documents the default is now:
"paraphrase-MiniLM-L6-v2"
- For Non-English or multi-lingual documents the default is now:
"paraphrase-multilingual-MiniLM-L12-v2"
- Both models show not only great performance but are much faster!
- For English documents the default is now:
- Add interactive visualizations to the
plotting
API documentation
For even better performance, please use the following models:
- English:
"paraphrase-mpnet-base-v2"
- Non-English or multi-lingual:
"paraphrase-multilingual-mpnet-base-v2"
Fixes:
- Improved unit testing for more stability
- Set transformers version for Flair
Major Release v0.8
Mainly a visualization update to improve understanding of the topic model.
Features
- Additional visualizations:
- Topic Hierarchy:
topic_model.visualize_hierarchy()
- Topic Similarity Heatmap:
topic_model.visualize_heatmap()
- Topic Representation Barchart:
topic_model.visualize_barchart()
- Term Score Decline:
topic_model.visualize_term_rank()
- Topic Hierarchy:
Improvements
- Created
bertopic.plotting
library to easily extend visualizations - Improved automatic topic reduction by using HDBSCAN to detect similar topics
- Sort topic ids by their frequency. -1 is the outlier class and contains typically the most documents. After that 0 is the largest topic, 1 the second largest, etc.
- Update MKDOCS with new visualizations
Fixes
Major Release v0.7
The two main features are (semi-)supervised topic modeling
and several backends to use instead of Flair and SentenceTransformers!
Highlights:
- (semi-)supervised topic modeling by leveraging supervised options in UMAP
model.fit(docs, y=target_classes)
- Backends:
- Added Spacy, Gensim, USE (TFHub)
- Use a different backend for document embeddings and word embeddings
- Create your own backends with
bertopic.backend.BaseEmbedder
- Click here for an overview of all new backends
- Calculate and visualize topics per class
- Calculate:
topics_per_class = topic_model.topics_per_class(docs, topics, classes)
- Visualize:
topic_model.visualize_topics_per_class(topics_per_class)
- Calculate:
- Several tutorials were updated and added:
Fixes:
- Fixed issues with Torch req
- Prevent saving term frequency matrix in CTFIDF class
- Fixed DTM not working when reducing topics (#96)
- Moved visualization dependencies to base BERTopic
pip install bertopic[visualization]
becomespip install bertopic
- Allow precomputed embeddings in bertopic.find_topics() (#79):
model = BERTopic(embedding_model=my_embedding_model)
model.fit(docs, my_precomputed_embeddings)
model.find_topics(search_term)
Major Release v0.6
Highlights:
- DTM: Added a basic dynamic topic modeling technique based on the global c-TF-IDF representation
model.topics_over_time(docs, timestamps, global_tuning=True)
- DTM: Option to evolve topics based on t-1 c-TF-IDF representation which results in evolving topics over time
- Only uses topics at t-1 and skips evolution if there is a gap
model.topics_over_time(docs, timestamps, evolution_tuning=True)
- DTM: Function to visualize topics over time
model.visualize_topics_over_time(topics_over_time)
- DTM: Add binning of timestamps
model.topics_over_time(docs, timestamps, nr_bins=10)
- Add function get general information about topics (id, frequency, name, etc.)
get_topic_info()
- Improved stability of c-TF-IDF by taking the average number of words across all topics instead of the number of documents
Fixes:
Major Release v0.5
Features
- Add
Flair
to allow for more (custom) token/document embeddings - Option to use custom UMAP, HDBSCAN, and CountVectorizer
- Added
low_memory
parameter to reduce memory during computation - Improved verbosity (shows progress bar)
- Improved testing
- Use the newest version of
sentence-transformers
as it speeds ups encoding significantly - Return the figure of
visualize_topics()
- Expose all parameters with a single function:
get_params()
- Option to disable the saving of
embedding_model
, should reduce BERTopic size significantly - Add FAQ page
Fixes
- To simplify the API, the parameters
stop_words
andn_neighbors
were removed. These can still be used when a custom UMAP or CountVectorizer is used. - Set
calculate_probabilities
to False as a default. Calculating probabilities with HDBSCAN significantly increases computation time and memory usage. Better to remove calculating probabilities or only allow it by manually turning this on.
Fix embedding parameter
Fixed the parameter embedding_model
not working properly when language
had been set. If you are using an older version of BERTopic, please set language
to False when you want to set embedding_model
.
Language fix
There was an issue with selecting the correct language model. This is now fixed with this small pypi update.
Major Release
Highlights:
- Visualize Topics similar to LDAvis
- Added option to reduce topics after training
- Added option to update topic representation after training
- Added option to search topics using a search term
- Significantly improved the stability of generating clusters
- Finetune the topic words by selecting the most coherent words with the highest c-TF-IDF values
- More extensive tutorials in the documentation
Notable Changes:
- Option to select language instead of sentence-transformers models to minimize the complexity of using BERTopic
- Improved logging (remove duplicates)
- Check if BERTopic is fitted
- Added TF-IDF as an embedder instead of transformer models (see tutorial)
- Numpy for Python 3.6 will be dropped and was therefore removed from the workflow.
- Preprocess text before passing it through c-TF-IDF
- Merged get_topics_freq() with get_topic_freq()
Fixes:
- Fix error handling topic probabilities
BugFix Topic Reduction
Fixed a bug with the topic reduction method that seems to reduce the number of topics but not to the nr_topics as defined in the class. Since this was, to a certain extend, breaking the topic reduction method a new release was necessary.
Custom Embeddings
Adding the option to use custom embeddings or embeddings that you generated beforehand with whatever package you'd like to use. This allows users to further customize BERTopic to their liking.
NOTE: I cannot guarantee that using your own embeddings would result in better performance. It is likely to swing both ways depending on the embeddings you are using. For example, if you use poorly-trained W2V embeddings then it is likely to result in a poor topic generation. Thus, it is up to the user to experiment with the embeddings that best serve their purposes.