Application redesign aspects: Reuse preprocessed data + Split phases of Training

Right now training of a model involves three steps:
1. Preprocessing of training dataset
2. Training of the model itself
3. Generation of the TMmodel object for topic model edition

Current problems are:
1. We should copy the stopwords and equivalences to the TMfolder, because lists can be further edited ...
2. It is not possible to reuse an already processed dataset. This is not efficient, particularly if we want to train several models changing LDA settings
3. sparkLDA fails with large datasets. The model is trained, but sparsification requires moving the whole theta matrix to the driver node, that can run out of memory ... This could be solve if we split phases 2 and 3 above as two separate commands ....   Again the issue of whether sparkLDA is convenient should be answered before taking any decisions on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Application redesign aspects: Reuse preprocessed data + Split phases of Training #22

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Application redesign aspects: Reuse preprocessed data + Split phases of Training #22

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions