Skip to content

GSoC 2018 project ideas

Radim Řehůřek edited this page Jan 20, 2018 · 18 revisions

Intro

A list of ideas for Google Summer of Code 2018 of new functionality and projects in Gensim, topic modelling for humans.

Potential mentors:

First of all, please have a look at Gensim's road-map 2018, which describes our main targets for this year.

You can suggest any project related to NLP and machine learning, which, in your opinion, will be a successful addition to Gensim, but considering our wishes will improve your chances of being accepted.

Below you will find the directions that we would be very happy to see in Gensim.

Documentation

Difficulty: Medium; requires excellent UX skills and native English

Background: We already have a large number of models, therefore, we want to pay more attention to the model quality (documentation and model discovery being the main thing here). If we have a great model users don't know how (or when) to use - they won't use it! For this reason, we want to significantly improve our documentation.

To do:

  • [already WIP] Consistent docstrings for all methods in Gensim
  • New "beginner tutorial chain"
  • User-guides for all models and use-case pipelines (sphinx-gallery)
  • New slick project website
  • Improved UX: a logical structure for all documentation, intuitive navigation, discovery

Resources:

Performance

SparseTools package

Difficulty: Medium; requires excellent C and optimization skills

Background: A package for working with sparse matrices, optimized with Cython, memory-efficient, and fast. An improvement or replacement over the recently deprecated scipy's sparsetools package, which is only single-threaded, memory-hungry with unnecessary copies, and too slow.

Should also include fast (Cythonized) transformations between the "Gensim streamed corpus" and various formats (scipy.sparse, numpy...). Similar to our existing matutils module. Must support fast operation with sparse matrices, such as multiplying a sparse CSC matrix with a dense matrix, multiplying CSC by a random matrix etc.

To do: Need to develop a new package to replace scipy.sparse in Gensim. Should significantly increase the performance of sparse multiplications, using multi-threading.

Resources:

Multiple-stream API

Difficulty: medium/hard

Background: On powerful machines (>10 cores) we lose ~linear-scalability of performance, the reason is "IO-bottleneck", i.e. we read corpus really slow, for this reason, if we increase the number of workers, we didn't receive any performance boost, good example described here To fix this situation, our corpus should be read with multiple threads and this should be supported by all gensim models (word2vec, LDA, LSI, etc). This task is very engineering, need serious programming skill.

To do:

  • Analyze bottleneck, how it happens and why
  • Develop multi-threaded corpus
  • Integrate it with all models

Resources:

Models

Online NNMF

Difficulty: Hard

Background: Non-negative matrix factorization is an algorithm similar to Latent Semantic Analysis/Latent Dirichlet Allocation. It falls into matrix factorization methods and can be phrased as an online learning algorithm.

To do: Based on existing online parallel implementation in libmf, implement NNMF in Python/Cython in gensim and evaluate. Must support multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Evaluation can use the Lee corpus of human similarity judgments included in gensim or evaluate in some other way.

Resources:

Neural networks (similarity learning, state-of-the-art language models)

Difficulty: Medium

Background: Gensim is general similarity, all the previous time we work with vector representation for words, sentences, texts, etc. But our general target is well-defined similarity function. In fact, it is not so important for us to present the document, it is important for us to distance them. This problem looks like similarity learning, and we think that neural networks will help us.

To do: Need to develop a universal neural network for this task (full process: from collecting data to publishing benchmark results). Idea - pass 2 documents (as a list of tokens) and receive one number as an output of network - a distance between this documents. Must support streaming & multicore + gensim-like API.

Resources:

If you'd like to work on any of the topics below, or have your own ideas, get in touch at student-projects@rare-technologies.com.

Clone this wiki locally