Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse matrices for BIG distance matrices #16

Open
l-ramirez-lopez opened this issue Sep 24, 2020 · 6 comments
Open

sparse matrices for BIG distance matrices #16

l-ramirez-lopez opened this issue Sep 24, 2020 · 6 comments

Comments

@l-ramirez-lopez
Copy link
Owner

When working with a huge number of observations, conventional computation of distance matrices might become problematic in terms of memory. Sparse matrices seem a simple and efficient alternative to approach this problem

@jsanderman
Copy link
Collaborator

Hi Leo,

I'd love to hear more about your thinking here since this is the biggest computational bottleneck in our current use of MBL. I wouldn't consider spectral data sparse in the conventional sense of a sparse matrix but admittedly my understanding of sparse matrices is very limited.

Cheers, Jon

@philipp-baumann
Copy link
Collaborator

Hi Leo,
have you considered using arma for this purpose?
Cheers,
Philipp

@l-ramirez-lopez
Copy link
Owner Author

l-ramirez-lopez commented Sep 25, 2020

Hi Jon, Hi Philipp,
Spectral data is indeed not sparse, but the spectral distance matrices can be seen as sparse matrices.
I think that the heaviest memory burden in mbl comes from the distance matrices. I did not consider this as an issue before when I started the development of the package since the spectral libraries were relatively small at that time (max. 20.000 observations e.g LUCAS). However, for libraries with more than 30.000 observations memory handling becomes relevant. For example, for a data set with 80.000 observations we might need (at least) 51 Gb of memory to allocate its distance matrix.
My thinking is that we do not need the whole data in that matrix, the only data we need is the one corresponding to the neighbors of each observation, a “missing” value can be then assigned to the the non-neighbor observations. In this case the distances and neighbor identification must be computed iteratively for each observation. For example, if we retain information of 400 neighbors only we can then reduce memory requirements to 0.5 Gb max (i.e. 200 times less memory). Computational time might suffer a bit, this I need to check. Concerning storage of the distance data, there are several ways that could be efficient.
I still need to find time to implement all this but this is one of the key developments/improvements I think is required for resemble.
@philipp: I think that all of the above can be easily implemented with the current dependencies of resemble.
Cheers,
Leo

@philipp-baumann
Copy link
Collaborator

philipp-baumann commented Sep 25, 2020

Hi Leo, Hi Jon,
cool. data.table would be neat for memory and speed. Indeed, 0.5Gb is nothing to worry about, except maybe some parallel overhead. My machines have 128GB and 1TB of memory, which is in typical range for high performance systems. A prediction service could for example be based on a virtual machine with decent RAM. Looking forward to the next amendments. Happy to contribute if there is anything you think I can help out with.
Cheers,
Philipp

@philipp-baumann
Copy link
Collaborator

philipp-baumann commented Sep 25, 2020

maybe another option would be to use {disk.frame} when random-access memory is sparse. This would mean that distance data is written out onto disk and chunked.

@AlexandreWadoux
Copy link
Collaborator

Leonardo is right that we can use sparse matrices for this. For example, if you only want 200 neighbours from a dataset of say, 30,000 points, why storing the distance between 30,000 points?
Computing the distance is fast, but the size of the resulting matrix is a problem.
One possible solution is:
1- create a sparse matrix of size N*N, assign NA values.
2- Make a hierarchical clustering on the points. This should be fast.
3- Keep the branches in memory, but store the clusters somewhere.
4- Select the good number of cluster, so that you have a sufficiently large number of points in each cluster. The number of points in the cluster must be much larger than the number of neighbours, but sufficiently small to save memory.
5- Only compute the distance to the points within the selected cluster, put the distances into the sparse matrix.

I guess it would dramatically reduce the size of the distance matrix.

Alexandre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants