Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When using a bag of words to feed to a KMeans clustering algorithm, the memory consumption can be quite large. Using a sparse vector can significantly reduce the amount of RAM required to store the vectors, at the cost of some CPU performance.
I've created a graph of the RAM consumption of my application which uses the KMeans algorithm to cluster together vectors containing TF-IDF data from 500 books chosen at Random from the Gutenberg library.
https://docs.google.com/spreadsheets/d/1sRxGfRWOrBFBVkJHILZ6y_IiKkFUDlT0TVake-9WUzE/edit?usp=sharing
I've created a new type called SparseMLData and updated the KMeans algorithm to support it, since it was coded to work only with the BasicMLData. I guess that the BasicMLData could actually be updated to support a choice of sparse or array storage, dependent on parameters passed to it.
I've added unit tests to the SparseMLData with near-100% code coverage. I use NCrunch which makes that easier, hence a couple of NCrunch artefacts which will help anyone else who uses the tool get up and running faster. I've also tagged unit tests which attempt to use the file system to read / write CSV with "Integration" markers, so that they can be excluded from multi-threaded execution.
Your comments would be welcome.
Jeff - Not really on topic, but I've read two of your books which have led me to being able to suggest this contribution, so thanks!