Sparse Vector #65

a-h · 2015-01-26T13:46:21Z

When using a bag of words to feed to a KMeans clustering algorithm, the memory consumption can be quite large. Using a sparse vector can significantly reduce the amount of RAM required to store the vectors, at the cost of some CPU performance.

I've created a graph of the RAM consumption of my application which uses the KMeans algorithm to cluster together vectors containing TF-IDF data from 500 books chosen at Random from the Gutenberg library.

https://docs.google.com/spreadsheets/d/1sRxGfRWOrBFBVkJHILZ6y_IiKkFUDlT0TVake-9WUzE/edit?usp=sharing

I've created a new type called SparseMLData and updated the KMeans algorithm to support it, since it was coded to work only with the BasicMLData. I guess that the BasicMLData could actually be updated to support a choice of sparse or array storage, dependent on parameters passed to it.

I've added unit tests to the SparseMLData with near-100% code coverage. I use NCrunch which makes that easier, hence a couple of NCrunch artefacts which will help anyone else who uses the tool get up and running faster. I've also tagged unit tests which attempt to use the file system to read / write CSV with "Integration" markers, so that they can be excluded from multi-threaded execution.

Your comments would be welcome.

Jeff - Not really on topic, but I've read two of your books which have led me to being able to suggest this contribution, so thanks!

The SparseMLData type is much more memory efficient for sparse arrays, e.g. bag of words.

Also set some more tests with an "Integration" category. The integration tests cannot run in parallel because they rely on the file system.

a-h added 3 commits January 26, 2015 12:21

Added SparseMLData type

c536502

The SparseMLData type is much more memory efficient for sparse arrays, e.g. bag of words.

Removed NCrunch files

f0ec9b5

Also set some more tests with an "Integration" category. The integration tests cannot run in parallel because they rely on the file system.

Removed NCrunch files

546097a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparse Vector #65

Sparse Vector #65

Uh oh!

a-h commented Jan 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sparse Vector #65

Are you sure you want to change the base?

Sparse Vector #65

Uh oh!

Conversation

a-h commented Jan 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant