Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse Vector #65

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Sparse Vector #65

wants to merge 3 commits into from

Conversation

a-h
Copy link

@a-h a-h commented Jan 26, 2015

When using a bag of words to feed to a KMeans clustering algorithm, the memory consumption can be quite large. Using a sparse vector can significantly reduce the amount of RAM required to store the vectors, at the cost of some CPU performance.

I've created a graph of the RAM consumption of my application which uses the KMeans algorithm to cluster together vectors containing TF-IDF data from 500 books chosen at Random from the Gutenberg library.

https://docs.google.com/spreadsheets/d/1sRxGfRWOrBFBVkJHILZ6y_IiKkFUDlT0TVake-9WUzE/edit?usp=sharing
graph

I've created a new type called SparseMLData and updated the KMeans algorithm to support it, since it was coded to work only with the BasicMLData. I guess that the BasicMLData could actually be updated to support a choice of sparse or array storage, dependent on parameters passed to it.

I've added unit tests to the SparseMLData with near-100% code coverage. I use NCrunch which makes that easier, hence a couple of NCrunch artefacts which will help anyone else who uses the tool get up and running faster. I've also tagged unit tests which attempt to use the file system to read / write CSV with "Integration" markers, so that they can be excluded from multi-threaded execution.

Your comments would be welcome.

Jeff - Not really on topic, but I've read two of your books which have led me to being able to suggest this contribution, so thanks!

a-h added 3 commits January 26, 2015 12:21
The SparseMLData type is much more memory efficient for sparse arrays,
e.g. bag of words.
Also set some more tests with an "Integration" category.  The
integration tests cannot run in parallel because they rely on the file
system.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant