Skip to content

Conversation

@a-h
Copy link

@a-h a-h commented Jan 26, 2015

When using a bag of words to feed to a KMeans clustering algorithm, the memory consumption can be quite large. Using a sparse vector can significantly reduce the amount of RAM required to store the vectors, at the cost of some CPU performance.

I've created a graph of the RAM consumption of my application which uses the KMeans algorithm to cluster together vectors containing TF-IDF data from 500 books chosen at Random from the Gutenberg library.

https://docs.google.com/spreadsheets/d/1sRxGfRWOrBFBVkJHILZ6y_IiKkFUDlT0TVake-9WUzE/edit?usp=sharing
graph

I've created a new type called SparseMLData and updated the KMeans algorithm to support it, since it was coded to work only with the BasicMLData. I guess that the BasicMLData could actually be updated to support a choice of sparse or array storage, dependent on parameters passed to it.

I've added unit tests to the SparseMLData with near-100% code coverage. I use NCrunch which makes that easier, hence a couple of NCrunch artefacts which will help anyone else who uses the tool get up and running faster. I've also tagged unit tests which attempt to use the file system to read / write CSV with "Integration" markers, so that they can be excluded from multi-threaded execution.

Your comments would be welcome.

Jeff - Not really on topic, but I've read two of your books which have led me to being able to suggest this contribution, so thanks!

a-h added 3 commits January 26, 2015 12:21
The SparseMLData type is much more memory efficient for sparse arrays,
e.g. bag of words.
Also set some more tests with an "Integration" category.  The
integration tests cannot run in parallel because they rely on the file
system.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant