Skip to content

Application redesign aspects: Reuse preprocessed data + Split phases of Training #22

@jeroarenas

Description

@jeroarenas

Right now training of a model involves three steps:

  1. Preprocessing of training dataset
  2. Training of the model itself
  3. Generation of the TMmodel object for topic model edition

Current problems are:

  1. We should copy the stopwords and equivalences to the TMfolder, because lists can be further edited ...
  2. It is not possible to reuse an already processed dataset. This is not efficient, particularly if we want to train several models changing LDA settings
  3. sparkLDA fails with large datasets. The model is trained, but sparsification requires moving the whole theta matrix to the driver node, that can run out of memory ... This could be solve if we split phases 2 and 3 above as two separate commands .... Again the issue of whether sparkLDA is convenient should be answered before taking any decisions on this.

Metadata

Metadata

Labels

Design AspectsImply rethinking of the structure of the applicationHigh PriorityIssues that need to be prioritized for next release

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions