-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multithreading center init #4
Comments
The K-means++ initialization is inherently serial which is why you see only one core doing the work. At least, each major stage of the algorithm requires previous stages to complete (each iteration of https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L183). However, the code inside that loop can be parallelized, especially the loop you mention at https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187 One thing I have tried before and works well but is not in this code is to apply the idea of the triangle inequality to this code. This avoids distance computations on subsequent iterations and makes it much faster for large k. I may have this code lying around somewhere. Alternatively, there are other methods based on K-means++ that supposedly work faster in parallel, see here: http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf As for the max_dist variable, I think you're right that it's unused. It's probably left over from some older profiling code. |
From BaylorCS/baylorml#2 (comment) (@kno10) It may well be feasible to use random initialization. A simpler—yet effective—strategy for huge data is to initialize using k-means on a sample (if I'm not mistaken k-means|| is similar to running k-means++ on a distributed sample?). In ELKI (not baylorml), on a 38 million 2d coordinates data set, k=100, ''on a single core and not on a dedicated benchmarking host''. For more reliable conclusions, this would need to be run on many data sets, with different ks, and many more seeds, on dedicated systems. SortMeans with random intialization, seed 0: 420 s, 41 iterations, varsum 4.471356921340852E8 So the main benefit of k-means++ is that it is less likely to get stuck in a bad local minimum. SortMeans initialized with 5it,km++,1%, seed 0: 624 s, 55 iter, varsum 3.538176164640555E8 And, unfortunately, reality is much more complex... The performance is not bad, but also not substantially better. We need much larger experiments to even be able to tell what is "usually" better. Overall, lucky or bad random seems to be the largest factor. The parameters used for the "sample k-means" combination ELKI:
If you want to experiment with different seeds, I have contributed a |
Migrated from BaylorCS/baylorml#2 (@BenjaminHorn)
I have a fairly big dataset (100m * 10) , and as i calculated it would take around 8 hours to initialise the centers with init_centers_kmeanspp_v2. After some test i realised
-that only one core does the work
-most of the time is spent in this loop: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187
I have to admit i dont know much about multithreaded programming, but i think the loop could be split into the number of threads, to make it run parallel.
But those parallel running function have to read from the same dist2 array and x. Maybe this is why a cluster loop takes 5-6s, and it cant be run parallel, and fasten up.
Before i start to dig into the topic i just wanted to ask your opinion.
Some other thing:
why is https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L198 necessary?
As i can see max_dist wont be used anywhere.
The text was updated successfully, but these errors were encountered: