Skip to content

Commit df13670

Browse files
committed
Fix merge conflict
2 parents 67b1a17 + d692db4 commit df13670

23 files changed

+7794
-163
lines changed

.travis.yml

+1-3
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,7 @@ sudo: false
22
dist: trusty
33
language: python
44
python:
5-
- "2.6"
65
- "2.7"
7-
- "3.3"
8-
- "3.4"
96
- "3.5"
107
- "3.6"
118
before_install:
@@ -21,5 +18,6 @@ install:
2118
- pip install annoy
2219
- pip install testfixtures
2320
- pip install unittest2
21+
- pip install Morfessor==2.0.2a4
2422
- python setup.py install
2523
script: python setup.py test

CHANGELOG.md

+7
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,13 @@ Changes
33

44
Unreleased:
55

6+
1.0.0RC2, 2017-02-16
7+
8+
* Add note about Annoy speed depending on numpy BLAS setup in annoytutorial.ipynb (@greninja,[#1137](https://github.com/RaRe-Technologies/gensim/pull/1137))
9+
* Remove direct access to properties moved to KeyedVectors (@tmylk,[#1147](https://github.com/RaRe-Technologies/gensim/pull/1147))
10+
* Remove support for Python 2.6, 3.3 and 3.4 (@tmylk,[#1145](https://github.com/RaRe-Technologies/gensim/pull/1145))
11+
* Write UTF-8 byte strings in tensorboard conversion (@tmylk,[#1144](https://github.com/RaRe-Technologies/gensim/pull/1144))
12+
* Make top_topics and sparse2full compatible with numpy 1.12 strictly int idexing (@tmylk,[#1146](https://github.com/RaRe-Technologies/gensim/pull/1146))
613

714
1.0.0RC1, 2017-01-31
815

appveyor.yml

+12-5
Original file line numberDiff line numberDiff line change
@@ -14,21 +14,28 @@ environment:
1414

1515
matrix:
1616
- PYTHON: "C:\\Python27"
17-
PYTHON_VERSION: "2.7.8"
17+
PYTHON_VERSION: "2.7.12"
1818
PYTHON_ARCH: "32"
1919

2020
- PYTHON: "C:\\Python27-x64"
21-
PYTHON_VERSION: "2.7.8"
21+
PYTHON_VERSION: "2.7.12"
2222
PYTHON_ARCH: "64"
2323

2424
- PYTHON: "C:\\Python35"
25-
PYTHON_VERSION: "3.5.0"
25+
PYTHON_VERSION: "3.5.2"
2626
PYTHON_ARCH: "32"
2727

2828
- PYTHON: "C:\\Python35-x64"
29-
PYTHON_VERSION: "3.5.0"
29+
PYTHON_VERSION: "3.5.2"
3030
PYTHON_ARCH: "64"
31+
32+
- PYTHON: "C:\\Python36"
33+
PYTHON_VERSION: "3.6.0"
34+
PYTHON_ARCH: "32"
3135

36+
- PYTHON: "C:\\Python36-x64"
37+
PYTHON_VERSION: "3.6.0"
38+
PYTHON_ARCH: "64"
3239

3340

3441
install:
@@ -59,7 +66,7 @@ test_script:
5966
# installed library.
6067
- "mkdir empty_folder"
6168
- "cd empty_folder"
62-
- "pip install pyemd testfixtures unittest2"
69+
- "pip install pyemd testfixtures unittest2 Morfessor==2.0.2a4"
6370

6471
- "python -c \"import nose; nose.main()\" -s -v gensim"
6572
# Move back to the project folder

docs/notebooks/Varembed.ipynb

+163
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# VarEmbed Tutorial\n",
8+
"\n",
9+
"Varembed is a word embedding model incorporating morphological information, capturing shared sub-word features. Unlike previous work that constructs word embeddings directly from morphemes, varembed combines morphological and distributional information in a unified probabilistic framework. Varembed thus yields improvements on intrinsic word similarity evaluations. Check out the original paper, [arXiv:1608.01056](https://arxiv.org/abs/1608.01056) accepted in [EMNLP 2016](http://www.emnlp2016.net/accepted-papers.html).\n",
10+
"\n",
11+
"Varembed is now integrated into [Gensim](http://radimrehurek.com/gensim/) providing ability to load already trained varembed models into gensim with additional functionalities over word vectors already present in gensim.\n",
12+
"\n",
13+
"# This Tutorial\n",
14+
"\n",
15+
"In this tutorial you will learn how to train, load and evaluate varembed model on your data.\n",
16+
"\n",
17+
"# Train Model\n",
18+
"\n",
19+
"The authors provide their code to train a varembed model. Checkout the repository [MorphologicalPriorsForWordEmbeddings](https://github.com/rguthrie3/MorphologicalPriorsForWordEmbeddings) for to train a varembed model. You'll need to use that code if you want to train a model. \n",
20+
"\n",
21+
"# Load Varembed Model\n",
22+
"\n",
23+
"Now that you have an already trained varembed model, you can easily load the varembed word vectors directly into Gensim. <br>\n",
24+
"For that, you need to provide the path to the word vectors pickle file generated after you train the model and run the script to [package varembed embeddings](https://github.com/rguthrie3/MorphologicalPriorsForWordEmbeddings/blob/master/package_embeddings.py) provided in the [varembed source code repository](https://github.com/rguthrie3/MorphologicalPriorsForWordEmbeddings).\n",
25+
"\n",
26+
"We'll use a varembed model trained on [Lee Corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee.cor) as the vocabulary, which is already available in gensim.\n",
27+
"\n",
28+
"\n",
29+
"\n",
30+
"\n"
31+
]
32+
},
33+
{
34+
"cell_type": "code",
35+
"execution_count": 1,
36+
"metadata": {
37+
"collapsed": false
38+
},
39+
"outputs": [],
40+
"source": [
41+
"from gensim.models.wrappers import varembed\n",
42+
"\n",
43+
"vector_file = '../../gensim/test/test_data/varembed_leecorpus_vectors.pkl'\n",
44+
"model = varembed.VarEmbed.load_varembed_format(vectors=vector_file)"
45+
]
46+
},
47+
{
48+
"cell_type": "markdown",
49+
"metadata": {},
50+
"source": [
51+
"This loads a varembed model into Gensim. Also if you want to load with morphemes added into the varembed vectors, you just need to also provide the path to the trained morfessor model binary as an argument. This works as an optional parameter, if not provided, it would just load the varembed vectors without morphemes."
52+
]
53+
},
54+
{
55+
"cell_type": "code",
56+
"execution_count": 2,
57+
"metadata": {
58+
"collapsed": false
59+
},
60+
"outputs": [],
61+
"source": [
62+
"morfessor_file = '../../gensim/test/test_data/varembed_leecorpus_morfessor.bin'\n",
63+
"model_with_morphemes = varembed.VarEmbed.load_varembed_format(vectors=vector_file, morfessor_model=morfessor_file)"
64+
]
65+
},
66+
{
67+
"cell_type": "markdown",
68+
"metadata": {},
69+
"source": [
70+
"This helps load trained varembed models into Gensim. Now you can use this for any of the Keyed Vector functionalities, like 'most_similar', 'similarity' and so on, already provided in gensim. \n"
71+
]
72+
},
73+
{
74+
"cell_type": "code",
75+
"execution_count": 12,
76+
"metadata": {
77+
"collapsed": false
78+
},
79+
"outputs": [
80+
{
81+
"data": {
82+
"text/plain": [
83+
"[(u'launch', 0.2694973647594452),\n",
84+
" (u'again', 0.2564533054828644),\n",
85+
" (u'gun', 0.2521245777606964),\n",
86+
" (u'response', 0.24817466735839844),\n",
87+
" (u'swimming', 0.23348823189735413),\n",
88+
" (u'bombings', 0.23146548867225647),\n",
89+
" (u'transformed', 0.2289058119058609),\n",
90+
" (u'used', 0.2224646955728531),\n",
91+
" (u'weeks,', 0.21905183792114258),\n",
92+
" (u'scheduled', 0.2170265018939972)]"
93+
]
94+
},
95+
"execution_count": 12,
96+
"metadata": {},
97+
"output_type": "execute_result"
98+
}
99+
],
100+
"source": [
101+
"model.most_similar('government')"
102+
]
103+
},
104+
{
105+
"cell_type": "code",
106+
"execution_count": 4,
107+
"metadata": {
108+
"collapsed": false
109+
},
110+
"outputs": [
111+
{
112+
"data": {
113+
"text/plain": [
114+
"0.022313305789051038"
115+
]
116+
},
117+
"execution_count": 4,
118+
"metadata": {},
119+
"output_type": "execute_result"
120+
}
121+
],
122+
"source": [
123+
"model.similarity('peace', 'grim')"
124+
]
125+
},
126+
{
127+
"cell_type": "markdown",
128+
"metadata": {},
129+
"source": [
130+
"# Conclusion\n",
131+
"In this tutorial, we learnt how to load already trained varembed models vectors into gensim and easily use and evaluate it. That's it!\n",
132+
"\n",
133+
"# Resources\n",
134+
"\n",
135+
"* [Varembed Source Code](https://github.com/rguthrie3/MorphologicalPriorsForWordEmbeddings)\n",
136+
"* [Gensim](http://radimrehurek.com/gensim/)\n",
137+
"* [Lee Corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee.cor)\n"
138+
]
139+
}
140+
],
141+
"metadata": {
142+
"anaconda-cloud": {},
143+
"kernelspec": {
144+
"display_name": "Python [default]",
145+
"language": "python",
146+
"name": "python2"
147+
},
148+
"language_info": {
149+
"codemirror_mode": {
150+
"name": "ipython",
151+
"version": 2
152+
},
153+
"file_extension": ".py",
154+
"mimetype": "text/x-python",
155+
"name": "python",
156+
"nbconvert_exporter": "python",
157+
"pygments_lexer": "ipython2",
158+
"version": "2.7.12"
159+
}
160+
},
161+
"nbformat": 4,
162+
"nbformat_minor": 1
163+
}

docs/notebooks/annoytutorial.ipynb

+3-1
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,9 @@
177177
"\n",
178178
"**This speedup factor is by no means constant** and will vary greatly from run to run and is particular to this data set, BLAS setup, Annoy parameters(as tree size increases speedup factor decreases), machine specifications, among other factors.\n",
179179
"\n",
180-
">**Note**: Initialization time for the annoy indexer was not included in the times. The optimal knn algorithm for you to use will depend on how many queries you need to make and the size of the corpus. If you are making very few similarity queries, the time taken to initialize the annoy indexer will be longer than the time it would take the brute force method to retrieve results. If you are making many queries however, the time it takes to initialize the annoy indexer will be made up for by the incredibly fast retrieval times for queries once the indexer has been initialized"
180+
">**Note**: Initialization time for the annoy indexer was not included in the times. The optimal knn algorithm for you to use will depend on how many queries you need to make and the size of the corpus. If you are making very few similarity queries, the time taken to initialize the annoy indexer will be longer than the time it would take the brute force method to retrieve results. If you are making many queries however, the time it takes to initialize the annoy indexer will be made up for by the incredibly fast retrieval times for queries once the indexer has been initialized\n",
181+
"\n",
182+
">**Note** : Gensim's 'most_similar' method is using numpy operations in the form of dot product whereas Annoy's method isnt. If 'numpy' on your machine is using one of the BLAS libraries like ATLAS or LAPACK, it'll run on multiple cores(only if your machine has multicore support ). Check [SciPy Cookbook](http://scipy-cookbook.readthedocs.io/items/ParallelProgramming.html) for more details."
181183
]
182184
},
183185
{

docs/notebooks/doc2vec-IMDB.ipynb

+7-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,13 @@
1313
"source": [
1414
"TODO: section on introduction & motivation\n",
1515
"\n",
16-
"TODO: prerequisites + dependencies (statsmodels, patsy, ?)"
16+
"TODO: prerequisites + dependencies (statsmodels, patsy, ?)\n",
17+
"\n",
18+
"### Requirements\n",
19+
"Following are the dependencies for this tutorial:\n",
20+
" - testfixtures\n",
21+
" - statsmodels\n",
22+
" "
1723
]
1824
},
1925
{

docs/src/apiref.rst

+1
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ Modules:
4545
models/wrappers/dtmmodel
4646
models/wrappers/ldavowpalwabbit.rst
4747
models/wrappers/wordrank
48+
models/wrappers/varembed
4849
similarities/docsim
4950
similarities/index
5051
topic_coherence/aggregation

docs/src/conf.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@
5454
# The short X.Y version.
5555
version = '1.0'
5656
# The full version, including alpha/beta/rc tags.
57-
release = '1.0.0rc1'
57+
release = '1.0.0rc2'
5858

5959
# The language for content autogenerated by Sphinx. Refer to documentation
6060
# for a list of supported languages.

docs/src/models/wrappers/varembed.rst

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
:mod:`models.wrappers.varembed` -- VarEmbed Word Embeddings
2+
================================================================================================
3+
4+
.. automodule:: gensim.models.wrappers.varembed
5+
:synopsis: VarEmbed Word Embeddings
6+
:members:
7+
:inherited-members:
8+
:undoc-members:
9+
:show-inheritance:

gensim/matutils.py

+3
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,9 @@ def sparse2full(doc, length):
206206
207207
"""
208208
result = np.zeros(length, dtype=np.float32) # fill with zeroes (default value)
209+
# convert indices to int as numpy 1.12 no longer indexes by floats
210+
doc = ((int(id_), float(val_)) for (id_, val_) in doc)
211+
209212
doc = dict(doc)
210213
# overwrite some of the zeroes with explicit values
211214
result[list(doc)] = list(itervalues(doc))

gensim/models/ldamodel.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -862,7 +862,7 @@ def top_topics(self, corpus, num_words=20):
862862
for m in top_words[1:]:
863863
# m_docs is v_m^(t)
864864
m_docs = doc_word_list[m]
865-
m_index = np.where(top_words == m)[0]
865+
m_index = np.where(top_words == m)[0][0]
866866

867867
# Sum of top words l=1..m
868868
# i.e., all words ranked higher than the current word m

0 commit comments

Comments
 (0)