Skip to content

Commit 355ecc6

Browse files
committed
Merge branch 'release-3.6.0'
2 parents 2ee7fac + 35d1b5b commit 355ecc6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+49249
-10005
lines changed

CHANGELOG.md

+102
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,107 @@
11
Changes
22
===========
3+
## 3.6.0, 2018-09-20
4+
5+
### :star2: New features
6+
* File-based training for `*2Vec` models (__[@persiyanov](https://github.com/persiyanov)__, [#2127](https://github.com/RaRe-Technologies/gensim/pull/2127) & [#2078](https://github.com/RaRe-Technologies/gensim/pull/2078) & [#2048](https://github.com/RaRe-Technologies/gensim/pull/2048))
7+
8+
New training mode for `*2Vec` models (word2vec, doc2vec, fasttext) that allows model training to scale linearly with the number of cores (full GIL elimination). The result of our Google Summer of Code 2018 project by Dmitry Persiyanov.
9+
10+
**Benchmark**
11+
- Dataset: full English Wikipedia
12+
- Cloud: GCE
13+
- CPU: Intel(R) Xeon(R) CPU @ 2.30GHz 32 cores
14+
- BLAS: libblas3 (3.7.1-3ubuntu2)
15+
16+
17+
| Model | Queue-based version [sec] | File-based version [sec] | speed up | Accuracy (queue-based) | Accuracy (file-based) |
18+
|-------|------------|--------------------|----------|----------------|-----------------------|
19+
| Word2Vec | 9230 | **2437** | **3.79x** | 0.754 (± 0.003) | 0.750 (± 0.001) |
20+
| Doc2Vec | 18264 | **2889** | **6.32x** | 0.721 (± 0.002) | 0.683 (± 0.003) |
21+
| FastText | 16361 | **10625** | **1.54x** | 0.642 (± 0.002) | 0.660 (± 0.001) |
22+
23+
Usage:
24+
25+
```python
26+
import gensim.downloader as api
27+
from multiprocessing import cpu_count
28+
from gensim.utils import save_as_line_sentence
29+
from gensim.test.utils import get_tmpfile
30+
from gensim.models import Word2Vec, Doc2Vec, FastText
31+
32+
33+
# Convert any corpus to the needed format: 1 document per line, words delimited by " "
34+
corpus = api.load("text8")
35+
corpus_fname = get_tmpfile("text8-file-sentence.txt")
36+
save_as_line_sentence(corpus, corpus_fname)
37+
38+
# Choose num of cores that you want to use (let's use all, models scale linearly now!)
39+
num_cores = cpu_count()
40+
41+
# Train models using all cores
42+
w2v_model = Word2Vec(corpus_file=corpus_fname, workers=num_cores)
43+
d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores)
44+
ft_model = FastText(corpus_file=corpus_fname, workers=num_cores)
45+
46+
```
47+
[Read notebook tutorial with full description.](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb)
48+
49+
50+
### :+1: Improvements
51+
52+
* Add scikit-learn wrapper for `FastText` (__[@mcemilg](https://github.com/mcemilg)__, [#2178](https://github.com/RaRe-Technologies/gensim/pull/2178))
53+
* Add multiprocessing support for `BM25` (__[@Shiki-H](https://github.com/Shiki-H)__, [#2146](https://github.com/RaRe-Technologies/gensim/pull/2146))
54+
* Add `name_only` option for downloader api (__[@aneesh-joshi](https://github.com/aneesh-joshi)__, [#2143](https://github.com/RaRe-Technologies/gensim/pull/2143))
55+
* Make `word2vec2tensor` script compatible with `python3` (__[@vsocrates](https://github.com/vsocrates)__, [#2147](https://github.com/RaRe-Technologies/gensim/pull/2147))
56+
* Add custom filter for `Wikicorpus` (__[@mattilyra](https://github.com/mattilyra)__, [#2089](https://github.com/RaRe-Technologies/gensim/pull/2089))
57+
* Make `similarity_matrix` support non-contiguous dictionaries (__[@Witiko](https://github.com/Witiko)__, [#2047](https://github.com/RaRe-Technologies/gensim/pull/2047))
58+
59+
60+
### :red_circle: Bug fixes
61+
62+
* Fix memory consumption in `AuthorTopicModel` (__[@philipphager](https://github.com/philipphager)__, [#2122](https://github.com/RaRe-Technologies/gensim/pull/2122))
63+
* Correctly process empty documents in `AuthorTopicModel` (__[@probinso](https://github.com/probinso)__, [#2133](https://github.com/RaRe-Technologies/gensim/pull/2133))
64+
* Fix ZeroDivisionError `keywords` issue with short input (__[@LShostenko](https://github.com/LShostenko)__, [#2154](https://github.com/RaRe-Technologies/gensim/pull/2154))
65+
* Fix `min_count` handling in phrases detection using `npmi_scorer` (__[@lopusz](https://github.com/lopusz)__, [#2072](https://github.com/RaRe-Technologies/gensim/pull/2072))
66+
* Remove duplicate count from `Phraser` log message (__[@robguinness](https://github.com/robguinness)__, [#2151](https://github.com/RaRe-Technologies/gensim/pull/2151))
67+
* Replace `np.integer` -> `np.int` in `AuthorTopicModel` (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2145](https://github.com/RaRe-Technologies/gensim/pull/2145))
68+
69+
70+
### :books: Tutorial and doc improvements
71+
72+
* Update docstring with new analogy evaluation method (__[@akutuzov](https://github.com/akutuzov)__, [#2130](https://github.com/RaRe-Technologies/gensim/pull/2130))
73+
* Improve `prune_at` parameter description for `gensim.corpora.Dictionary` (__[@yxonic](https://github.com/yxonic)__, [#2128](https://github.com/RaRe-Technologies/gensim/pull/2128))
74+
* Fix `default` -> `auto` prior parameter in documentation for lda-related models (__[@Laubeee](https://github.com/Laubeee)__, [#2156](https://github.com/RaRe-Technologies/gensim/pull/2156))
75+
* Use heading instead of bold style in `gensim.models.translation_matrix` (__[@nzw0301](https://github.com/nzw0301)__, [#2164](https://github.com/RaRe-Technologies/gensim/pull/2164))
76+
* Fix quote of vocabulary from `gensim.models.Word2Vec` (__[@nzw0301](https://github.com/nzw0301)__, [#2161](https://github.com/RaRe-Technologies/gensim/pull/2161))
77+
* Replace deprecated parameters with new in docstring of `gensim.models.Doc2Vec` (__[@xuhdev](https://github.com/xuhdev)__, [#2165](https://github.com/RaRe-Technologies/gensim/pull/2165))
78+
* Fix formula in Mallet documentation (__[@Laubeee](https://github.com/Laubeee)__, [#2186](https://github.com/RaRe-Technologies/gensim/pull/2186))
79+
* Fix minor semantic issue in docs for `Phrases` (__[@RunHorst](https://github.com/RunHorst)__, [#2148](https://github.com/RaRe-Technologies/gensim/pull/2148))
80+
* Fix typo in documentation (__[@KenjiOhtsuka](https://github.com/KenjiOhtsuka)__, [#2157](https://github.com/RaRe-Technologies/gensim/pull/2157))
81+
* Additional documentation fixes (__[@piskvorky](https://github.com/piskvorky)__, [#2121](https://github.com/RaRe-Technologies/gensim/pull/2121))
82+
83+
### :warning: Deprecations (will be removed in the next major release)
84+
85+
* Remove
86+
- `gensim.models.wrappers.fasttext` (obsoleted by the new native `gensim.models.fasttext` implementation)
87+
- `gensim.examples`
88+
- `gensim.nosy`
89+
- `gensim.scripts.word2vec_standalone`
90+
- `gensim.scripts.make_wiki_lemma`
91+
- `gensim.scripts.make_wiki_online`
92+
- `gensim.scripts.make_wiki_online_lemma`
93+
- `gensim.scripts.make_wiki_online_nodebug`
94+
- `gensim.scripts.make_wiki` (all of these obsoleted by the new native `gensim.scripts.segment_wiki` implementation)
95+
- "deprecated" functions and attributes
96+
97+
* Move
98+
- `gensim.scripts.make_wikicorpus` ➡ `gensim.scripts.make_wiki.py`
99+
- `gensim.summarization` ➡ `gensim.models.summarization`
100+
- `gensim.topic_coherence` ➡ `gensim.models._coherence`
101+
- `gensim.utils` ➡ `gensim.utils.utils` (old imports will continue to work)
102+
- `gensim.parsing.*` ➡ `gensim.utils.text_utils`
103+
104+
3105
## 3.5.0, 2018-07-06
4106

5107
This release comprises a glorious 38 pull requests from 28 contributors. Most of the effort went into improving the documentation—hence the release code name "Docs 💬"!

0 commit comments

Comments
 (0)