Skip to content

Commit 109c88e

Browse files
committed
Merge branch 'release-4.1.0'
2 parents b4f64a9 + 1bb426a commit 109c88e

File tree

88 files changed

+8577
-922
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

88 files changed

+8577
-922
lines changed

.github/workflows/tests.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,8 @@ jobs:
3939
#
4040
- name: Update sbt
4141
run: |
42-
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
42+
echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | sudo tee /etc/apt/sources.list.d/sbt.list
43+
echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | sudo tee /etc/apt/sources.list.d/sbt_old.list
4344
curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | sudo apt-key add
4445
sudo apt-get update -y
4546
sudo apt-get install -y sbt

CHANGELOG.md

+120
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,126 @@
11
Changes
22
=======
33

4+
## Unreleased
5+
6+
## 4.1.0, 2021-08-15
7+
8+
Gensim 4.1 brings two major new functionalities:
9+
10+
* [Ensemble LDA](https://radimrehurek.com/gensim/auto_examples/tutorials/run_ensemblelda.html) for robust training, selection and comparison of LDA models.
11+
* [FastSS module](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/similarities/fastss.pyx) for super fast Levenshtein "fuzzy search" queries. Used e.g. for ["soft term similarity"](https://github.com/RaRe-Technologies/gensim/pull/3146) calculations.
12+
13+
There are several minor changes that are **not** backwards compatible with previous versions of Gensim.
14+
The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump.
15+
Nevertheless, we describe them below.
16+
17+
### Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods
18+
19+
We now handle both ``positive`` and ``negative`` keyword parameters consistently.
20+
They may now be either:
21+
22+
1. A string, in which case the value is reinterpreted as a list of one element (the string value)
23+
2. A vector, in which case the value is reinterpreted as a list of one element (the vector)
24+
3. A list of strings
25+
4. A list of vectors
26+
27+
So you can now simply do:
28+
29+
```python
30+
model.most_similar(positive='war', negative='peace')
31+
```
32+
33+
instead of the slightly more involved
34+
35+
```python
36+
model.most_similar(positive=['war'], negative=['peace'])
37+
```
38+
39+
Both invocations remain correct, so you can use whichever is most convenient.
40+
If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.
41+
42+
```python
43+
model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
44+
```
45+
46+
then you will need to specify the lists explicitly in gensim 4.1.
47+
### Deprecated obsolete `step` parameter from doc2vec
48+
49+
With the newer version, do this:
50+
51+
```python
52+
model.infer_vector(..., epochs=123)
53+
```
54+
55+
instead of this:
56+
57+
```python
58+
model.infer_vector(..., steps=123)
59+
```
60+
61+
Plus a large number of smaller improvements and fixes, as usual.
62+
63+
**⚠️ If migrating from old Gensim 3.x, read the [Migration guide](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4) first.**
64+
65+
### :+1: New features
66+
67+
* [#3169](https://github.com/RaRe-Technologies/gensim/pull/3169): Implement `shrink_windows` argument for Word2Vec, by [@M-Demay](https://github.com/M-Demay)
68+
* [#3163](https://github.com/RaRe-Technologies/gensim/pull/3163): Optimize word mover distance (WMD) computation, by [@flowlight0](https://github.com/flowlight0)
69+
* [#3157](https://github.com/RaRe-Technologies/gensim/pull/3157): New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by [@Witiko](https://github.com/Witiko)
70+
* [#3153](https://github.com/RaRe-Technologies/gensim/pull/3153): Vectorize word2vec.predict_output_word for speed, by [@M-Demay](https://github.com/M-Demay)
71+
* [#3146](https://github.com/RaRe-Technologies/gensim/pull/3146): Use FastSS for fast kNN over Levenshtein distance, by [@Witiko](https://github.com/Witiko)
72+
* [#3128](https://github.com/RaRe-Technologies/gensim/pull/3128): Materialize and copy the corpus passed to SoftCosineSimilarity, by [@Witiko](https://github.com/Witiko)
73+
* [#3115](https://github.com/RaRe-Technologies/gensim/pull/3115): Make LSI dispatcher CLI param for number of jobs optional, by [@robguinness](https://github.com/robguinness)
74+
* [#3091](https://github.com/RaRe-Technologies/gensim/pull/3091): LsiModel: Only log top words that actually exist in the dictionary, by [@kmurphy4](https://github.com/kmurphy4)
75+
* [#2980](https://github.com/RaRe-Technologies/gensim/pull/2980): Added EnsembleLda for stable LDA topics, by [@sezanzeb](https://github.com/sezanzeb)
76+
* [#2978](https://github.com/RaRe-Technologies/gensim/pull/2978): Optimize performance of Author-Topic model, by [@horpto](https://github.com/horpto)
77+
* [#3000](https://github.com/RaRe-Technologies/gensim/pull/3000): Tidy up KeyedVectors.most_similar() API, by [@simonwiles](https://github.com/simonwiles)
78+
79+
### :books: Tutorials and docs
80+
81+
* [#3155](https://github.com/RaRe-Technologies/gensim/pull/3155): Correct parameter name in documentation of fasttext.py, by [@bizzyvinci](https://github.com/bizzyvinci)
82+
* [#3148](https://github.com/RaRe-Technologies/gensim/pull/3148): Fix broken link to mycorpus.txt in documentation, by [@rohit901](https://github.com/rohit901)
83+
* [#3142](https://github.com/RaRe-Technologies/gensim/pull/3142): Use more permanent pdf link and update code link, by [@dymil](https://github.com/dymil)
84+
* [#3141](https://github.com/RaRe-Technologies/gensim/pull/3141): Update link for online LDA paper, by [@dymil](https://github.com/dymil)
85+
* [#3133](https://github.com/RaRe-Technologies/gensim/pull/3133): Update link to Hoffman paper (online VB LDA), by [@jonaschn](https://github.com/jonaschn)
86+
* [#3129](https://github.com/RaRe-Technologies/gensim/pull/3129): [MRG] Add bronze sponsor: TechTarget, by [@piskvorky](https://github.com/piskvorky)
87+
* [#3126](https://github.com/RaRe-Technologies/gensim/pull/3126): Fix typos in make_wiki_online.py and make_wikicorpus.py, by [@nicolasassi](https://github.com/nicolasassi)
88+
* [#3125](https://github.com/RaRe-Technologies/gensim/pull/3125): Improve & unify docs for dirichlet priors, by [@jonaschn](https://github.com/jonaschn)
89+
* [#3123](https://github.com/RaRe-Technologies/gensim/pull/3123): Fix hyperlink for doc2vec tutorial, by [@AdityaSoni19031997](https://github.com/AdityaSoni19031997)
90+
* [#3121](https://github.com/RaRe-Technologies/gensim/pull/3121): [MRG] Add bronze sponsor: eaccidents.com, by [@piskvorky](https://github.com/piskvorky)
91+
* [#3120](https://github.com/RaRe-Technologies/gensim/pull/3120): Fix URL for ldamodel.py, by [@jonaschn](https://github.com/jonaschn)
92+
* [#3118](https://github.com/RaRe-Technologies/gensim/pull/3118): Fix URL in doc string, by [@jonaschn](https://github.com/jonaschn)
93+
* [#3107](https://github.com/RaRe-Technologies/gensim/pull/3107): Draw attention to sponsoring in README, by [@piskvorky](https://github.com/piskvorky)
94+
* [#3105](https://github.com/RaRe-Technologies/gensim/pull/3105): Fix documentation links: Travis to Github Actions, by [@piskvorky](https://github.com/piskvorky)
95+
* [#3057](https://github.com/RaRe-Technologies/gensim/pull/3057): Clarify doc comment in LdaModel.inference(), by [@yocen](https://github.com/yocen)
96+
* [#2964](https://github.com/RaRe-Technologies/gensim/pull/2964): Document that preprocessing.strip_punctuation is limited to ASCII, by [@sciatro](https://github.com/sciatro)
97+
98+
99+
### :red_circle: Bug fixes
100+
101+
* [#3178](https://github.com/RaRe-Technologies/gensim/pull/3178): Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by [@Witiko](https://github.com/Witiko)
102+
* [#3174](https://github.com/RaRe-Technologies/gensim/pull/3174): Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by [@emgucv](https://github.com/emgucv)
103+
* [#3136](https://github.com/RaRe-Technologies/gensim/pull/3136): Fix indexing error in word2vec_inner.pyx, by [@bluekura](https://github.com/bluekura)
104+
* [#3131](https://github.com/RaRe-Technologies/gensim/pull/3131): Add missing import to NMF docs and models/__init__.py, by [@properGrammar](https://github.com/properGrammar)
105+
* [#3116](https://github.com/RaRe-Technologies/gensim/pull/3116): Fix bug where saved Phrases model did not load its connector_words, by [@aloknayak29](https://github.com/aloknayak29)
106+
* [#2830](https://github.com/RaRe-Technologies/gensim/pull/2830): Fixed KeyError in coherence model, by [@pietrotrope](https://github.com/pietrotrope)
107+
108+
109+
### :warning: Removed functionality & deprecations
110+
111+
* [#3176](https://github.com/RaRe-Technologies/gensim/pull/3176): Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by [@rock420](https://github.com/rock420)
112+
* [#2965](https://github.com/RaRe-Technologies/gensim/pull/2965): Remove strip_punctuation2 alias of strip_punctuation, by [@sciatro](https://github.com/sciatro)
113+
* [#3180](https://github.com/RaRe-Technologies/gensim/pull/3180): Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by [@rock420](https://github.com/rock420)
114+
115+
### 🔮 Testing, CI, housekeeping
116+
117+
* [#3156](https://github.com/RaRe-Technologies/gensim/pull/3156): Update Numpy minimum version to 1.17.0, by [@PrimozGodec](https://github.com/PrimozGodec)
118+
* [#3143](https://github.com/RaRe-Technologies/gensim/pull/3143): replace _mul function with explicit casts, by [@mpenkov](https://github.com/mpenkov)
119+
* [#2952](https://github.com/RaRe-Technologies/gensim/pull/2952): Allow newer versions of the Morfessor module for the tests, by [@pabs3](https://github.com/pabs3)
120+
* [#2965](https://github.com/RaRe-Technologies/gensim/pull/2965): Remove strip_punctuation2 alias of strip_punctuation, by [@sciatro](https://github.com/sciatro)
121+
122+
123+
4124
## 4.0.1, 2021-04-01
5125

6126
Bugfix release to address issues with Wheels on Windows:

README.md

+11-14
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,8 @@ and *similarity retrieval* with large corpora. Target audience is the
1919
*natural language processing* (NLP) and *information retrieval* (IR)
2020
community.
2121

22-
<!--
23-
## :pizza: Hacktoberfest 2019 :beer:
22+
## ⚠️ Please [sponsor Gensim](https://github.com/sponsors/piskvorky) to help sustain this open source project ❤️
2423

25-
We are accepting PRs for Hacktoberfest!
26-
See [here](HACKTOBERFEST.md) for details.
27-
-->
2824

2925
Features
3026
--------
@@ -57,10 +53,10 @@ scientific computing. You must have them installed prior to installing
5753
gensim.
5854

5955
It is also recommended you install a fast BLAS library before installing
60-
NumPy. This is optional, but using an optimized BLAS such as [ATLAS] or
56+
NumPy. This is optional, but using an optimized BLAS such as MKL, [ATLAS] or
6157
[OpenBLAS] is known to improve performance by as much as an order of
62-
magnitude. On OS X, NumPy picks up the BLAS that comes with it
63-
automatically, so you don’t need to do anything special.
58+
magnitude. On OSX, NumPy picks up its vecLib BLAS automatically,
59+
so you don’t need to do anything special.
6460

6561
Install the latest version of gensim:
6662

@@ -77,7 +73,8 @@ package:
7773

7874
For alternative modes of installation, see the [documentation].
7975

80-
Gensim is being [continuously tested](https://travis-ci.org/RaRe-Technologies/gensim) under Python 3.6, 3.7 and 3.8.
76+
Gensim is being [continuously tested](http://radimrehurek.com/gensim/#testing) under all
77+
[supported Python versions](https://github.com/RaRe-Technologies/gensim/wiki/Gensim-And-Compatibility).
8178
Support for Python 2.7 was dropped in gensim 4.0.0 – install gensim 3.8.3 if you must use Python 2.7.
8279

8380
How come gensim is so fast and memory efficient? Isn’t it pure Python, and isn’t Python slow and greedy?
@@ -110,9 +107,12 @@ Documentation
110107
Support
111108
-------
112109

113-
Ask open-ended or research questions on the [Gensim Mailing List](https://groups.google.com/forum/#!forum/gensim).
110+
For commercial support, please see [Gensim sponsorship](https://github.com/sponsors/piskvorky).
111+
112+
Ask open-ended questions on the public [Gensim Mailing List](https://groups.google.com/forum/#!forum/gensim).
113+
114+
Raise bugs on [Github](https://github.com/RaRe-Technologies/gensim/blob/develop/CONTRIBUTING.md) but please **make sure you follow the [issue template](https://github.com/RaRe-Technologies/gensim/blob/develop/ISSUE_TEMPLATE.md)**. Issues that are not bugs or fail to provide the requested details will be closed without inspection.
114115

115-
Raise bugs on [Github](https://github.com/RaRe-Technologies/gensim/blob/develop/CONTRIBUTING.md) but **make sure you follow the [issue template](https://github.com/RaRe-Technologies/gensim/blob/develop/ISSUE_TEMPLATE.md)**. Issues that are not bugs or fail to follow the issue template will be closed without inspection.
116116

117117
---------
118118

@@ -162,15 +162,12 @@ BibTeX entry:
162162

163163
[citing gensim in academic papers and theses]: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:NaGl4SEjCO4C
164164

165-
[Travis CI for automated testing]: https://travis-ci.org/RaRe-Technologies/gensim
166165
[design goals]: http://radimrehurek.com/gensim/about.html
167166
[RaRe Technologies]: http://rare-technologies.com/wp-content/uploads/2016/02/rare_image_only.png%20=10x20
168167
[rare\_tech]: //rare-technologies.com
169168
[Talentpair]: https://avatars3.githubusercontent.com/u/8418395?v=3&s=100
170169
[citing gensim in academic papers and theses]: https://scholar.google.cz/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:u-x6o8ySG0sC
171170

172-
173-
174171
[documentation and Jupyter Notebook tutorials]: https://github.com/RaRe-Technologies/gensim/#documentation
175172
[Vector Space Model]: http://en.wikipedia.org/wiki/Vector_space_model
176173
[unsupervised document analysis]: http://en.wikipedia.org/wiki/Latent_semantic_indexing
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"metadata": {
7+
"scrolled": false
8+
},
9+
"outputs": [],
10+
"source": [
11+
"import logging\n",
12+
"from gensim.models import EnsembleLda, LdaMulticore\n",
13+
"from gensim.models.ensemblelda import rank_masking\n",
14+
"from gensim.corpora import OpinosisCorpus\n",
15+
"import os"
16+
]
17+
},
18+
{
19+
"cell_type": "markdown",
20+
"metadata": {},
21+
"source": [
22+
"enable the ensemble logger to show what it is doing currently"
23+
]
24+
},
25+
{
26+
"cell_type": "code",
27+
"execution_count": null,
28+
"metadata": {},
29+
"outputs": [],
30+
"source": [
31+
"elda_logger = logging.getLogger(EnsembleLda.__module__)\n",
32+
"elda_logger.setLevel(logging.INFO)\n",
33+
"elda_logger.addHandler(logging.StreamHandler())"
34+
]
35+
},
36+
{
37+
"cell_type": "code",
38+
"execution_count": null,
39+
"metadata": {},
40+
"outputs": [],
41+
"source": [
42+
"def pretty_print_topics():\n",
43+
" # note that the words are stemmed so they appear chopped off\n",
44+
" for t in elda.print_topics(num_words=7):\n",
45+
" print('-', t[1].replace('*',' ').replace('\"','').replace(' +',','), '\\n')"
46+
]
47+
},
48+
{
49+
"cell_type": "markdown",
50+
"metadata": {},
51+
"source": [
52+
"# Experiments on the Opinosis Dataset\n",
53+
"\n",
54+
"Opinosis [1] is a small (but redundant) corpus that contains 289 product reviews for 51 products. Since it's so small, the results are rather unstable.\n",
55+
"\n",
56+
"[1] Kavita Ganesan, ChengXiang Zhai, and Jiawei Han, _Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions [online],_ Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, 2010, pp. 340–348. Available from: https://kavita-ganesan.com/opinosis/"
57+
]
58+
},
59+
{
60+
"cell_type": "markdown",
61+
"metadata": {},
62+
"source": [
63+
"## Preparing the corpus\n",
64+
"\n",
65+
"First, download the opinosis dataset. On linux it can be done like this for example:"
66+
]
67+
},
68+
{
69+
"cell_type": "code",
70+
"execution_count": null,
71+
"metadata": {},
72+
"outputs": [],
73+
"source": [
74+
"!mkdir ~/opinosis\n",
75+
"!wget -P ~/opinosis https://github.com/kavgan/opinosis/raw/master/OpinosisDataset1.0_0.zip\n",
76+
"!unzip ~/opinosis/OpinosisDataset1.0_0.zip -d ~/opinosis"
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": null,
82+
"metadata": {},
83+
"outputs": [],
84+
"source": [
85+
"path = os.path.expanduser('~/opinosis/')"
86+
]
87+
},
88+
{
89+
"cell_type": "markdown",
90+
"metadata": {},
91+
"source": [
92+
"Corpus and id2word mapping can be created using the load_opinosis_data function provided in the package.\n",
93+
"It preprocesses the data using the PorterStemmer and stopwords from the nltk package.\n",
94+
"\n",
95+
"The parameter of the function is the relative path to the folder, into which the zip file was extracted before. That folder contains a 'summaries-gold' subfolder."
96+
]
97+
},
98+
{
99+
"cell_type": "code",
100+
"execution_count": null,
101+
"metadata": {},
102+
"outputs": [],
103+
"source": [
104+
"opinosis = OpinosisCorpus(path)"
105+
]
106+
},
107+
{
108+
"cell_type": "markdown",
109+
"metadata": {},
110+
"source": [
111+
"## Training"
112+
]
113+
},
114+
{
115+
"cell_type": "markdown",
116+
"metadata": {},
117+
"source": [
118+
"**parameters**\n",
119+
"\n",
120+
"**topic_model_kind** ldamulticore is highly recommended for EnsembleLda. ensemble_workers and **distance_workers** are used to improve the time needed to train the models, as well as the **masking_method** 'rank'. ldamulticore is not able to fully utilize all cores on this small corpus, so **ensemble_workers** can be set to 3 to get 95 - 100% cpu usage on my i5 3470.\n",
121+
"\n",
122+
"Since the corpus is so small, a high number of **num_models** is needed to extract stable topics. The Opinosis corpus contains 51 categories, however, some of them are quite similar. For example there are 3 categories about the batteries of portable products. There are also multiple categories about cars. So I chose 20 for num_topics, which is smaller than the number of categories."
123+
]
124+
},
125+
{
126+
"cell_type": "code",
127+
"execution_count": null,
128+
"metadata": {},
129+
"outputs": [],
130+
"source": [
131+
"elda = EnsembleLda(\n",
132+
" corpus=opinosis.corpus, id2word=opinosis.id2word, num_models=128, num_topics=20,\n",
133+
" passes=20, iterations=100, ensemble_workers=3, distance_workers=4,\n",
134+
" topic_model_class='ldamulticore', masking_method=rank_masking,\n",
135+
")\n",
136+
"pretty_print_topics()"
137+
]
138+
},
139+
{
140+
"cell_type": "markdown",
141+
"metadata": {},
142+
"source": [
143+
"The default for **min_samples** would be 64, half of the number of models and **eps** would be 0.1. You basically play around with them until you find a sweetspot that fits for your needs."
144+
]
145+
},
146+
{
147+
"cell_type": "code",
148+
"execution_count": null,
149+
"metadata": {},
150+
"outputs": [],
151+
"source": [
152+
"elda.recluster(min_samples=55, eps=0.14)\n",
153+
"pretty_print_topics()"
154+
]
155+
}
156+
],
157+
"metadata": {
158+
"kernelspec": {
159+
"display_name": "Python 3",
160+
"language": "python",
161+
"name": "python3"
162+
},
163+
"language_info": {
164+
"codemirror_mode": {
165+
"name": "ipython",
166+
"version": 3
167+
},
168+
"file_extension": ".py",
169+
"mimetype": "text/x-python",
170+
"name": "python",
171+
"nbconvert_exporter": "python",
172+
"pygments_lexer": "ipython3",
173+
"version": "3.9.5"
174+
}
175+
},
176+
"nbformat": 4,
177+
"nbformat_minor": 2
178+
}
7.1 KB
Loading
32.1 KB
Loading

0 commit comments

Comments
 (0)