-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
Design AspectsImply rethinking of the structure of the applicationImply rethinking of the structure of the applicationHigh PriorityIssues that need to be prioritized for next releaseIssues that need to be prioritized for next release
Description
Right now training of a model involves three steps:
- Preprocessing of training dataset
- Training of the model itself
- Generation of the TMmodel object for topic model edition
Current problems are:
- We should copy the stopwords and equivalences to the TMfolder, because lists can be further edited ...
- It is not possible to reuse an already processed dataset. This is not efficient, particularly if we want to train several models changing LDA settings
- sparkLDA fails with large datasets. The model is trained, but sparsification requires moving the whole theta matrix to the driver node, that can run out of memory ... This could be solve if we split phases 2 and 3 above as two separate commands .... Again the issue of whether sparkLDA is convenient should be answered before taking any decisions on this.
Metadata
Metadata
Assignees
Labels
Design AspectsImply rethinking of the structure of the applicationImply rethinking of the structure of the applicationHigh PriorityIssues that need to be prioritized for next releaseIssues that need to be prioritized for next release