-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
GSoC 2018 project ideas
A list of ideas for Google Summer of Code 2018 of new functionality and projects in Gensim, topic modelling for humans.
Potential mentors:
First of all, please have a look to gensim roadmap 2018, this demonstrates our main ideas target for this year.
You can suggest any project related to NLP, which, in your opinion, will be a successful addition to gensim, but please consider our wishes.
Below you will find the directions that we would be very happy to see in gensim
Difficulty: Medium
Background: We already have a large number of models, therefore, we want to pay more attention to quality (documentation is main thing here), because if we have a great model and lack of documentation - nobody will use it! For this reason, we want to significantly improve our documentation.
To do:
- [WIP] Docstrings for all stuff in gensim
- New "beginner tutorial chain" (persistent on site and in repository)
- User-guides for all stuff (sphinx-gallery)
- New documentation website
- New structure of documentation
Resources:
- Numpy docstring style
- Numpy docstring style from sphinx
- Sphinx-Gallery
- Gensim documentation project
- Daniele Procida: "How documentation works, and how to make it work for your project", video, PyCon 2017
- Daniele Procida: "What nobody tells you about documentation", blogpost
Difficulty: Medium
Background: A package for working with sparse matrices, built on top of scipy, optimized with Cython, memory-efficient and fast. An improvement and replacement on recently deprecated scipy's sparsetools package.
Should also include faster (Cythonized) transformations between the "gensim streamed corpus" and various formats (scipy.sparse, numpy...). Similar to matutils. Must allows to make multiplication (and other) operation with sparce matricies fastly.
To do: Need to develop new package (or any alternative way) that will replace scipy in gensim. Should significantly increase perfomace of sparce multiplication.
Resources:
Difficulty: Hard
Background: Non-negative matrix factorization is an algorithm similar to Latent Semantic Analysis/Latent Dirichlet Allocation. It falls into matrix factorization methods and can be phrased as an online learning algorithm.
To do: Based on existing online parallel implementation in libmf, implement NNMF in Python/Cython in gensim and evaluate. Must support multiple cores on the same machine.
Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.
We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Evaluation can use the Lee corpus of human similarity judgments included in gensim or evaluate in some other way.
Resources:
Difficulty: Medium
Background: Gensim is general similarity, all the previous time we work with vector representation for words, sentences, texts, etc. But our general target is well-defined similarity function. In fact, it is not so important for us to present the document, it is important for us to distance them. This problem looks like similarity learning, and we think that neural networks will help us.
To do: Need to develop a universal neural network for this task (full process: from collecting data to publishing benchmark results). Idea - pass 2 documents (as list of tokens) and receive one number as output of network - distance between this documents. Must support streaming & multicore + gensim-like API.
Resources:
If you'd like to work on any of the topics below, or have your own ideas, get in touch at student-projects@rare-technologies.com.