Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuSPARSE dense matrix and sparse matrix multiplication resulting in a dense matrix #211

Open
1 task
zhoeujei opened this issue Aug 22, 2024 · 5 comments
Open
1 task
Labels

Comments

@zhoeujei
Copy link

zhoeujei commented Aug 22, 2024

Why doesn't cuSPARSE support dense matrix sparse matrix multiplication resulting in a dense matrix? Many application scenarios require this. Please consider adding support.

Tasks

@essex-edwards
Copy link
Contributor

The SpMM routine can do this. The API is phrased as dense = sparse * dense, but you can achieve dense = dense * sparse by transposing everything.

@zhoeujei
Copy link
Author

It is indeed possible to achieve this through transposition, but it reduces efficiency. Please consider adding direct support.

@essex-edwards
Copy link
Contributor

Thanks for the suggestion. Have you got any benchmarks or particular call sequences or matrices that seem unexpectedly slow? If you can share that, that would help us optimize for your particular use case.

@zhoeujei
Copy link
Author

Yes, I set the input matrices dimensions to 1024x1024, and the output matrix dimensions to 1024x1024. The algorithm execution speed on the cusparseSpMM interface with the parameter CUSPARSE_OPERATION_TRANSPOSE is twice as fast as the algorithm execution speed on the cusparseSpMM interface with the parameter CUSPARSE_OPERATION_NON_TRANSPOSE. Therefore, I would like to ask if NVIDIA provides a library for dense matrix * sparse matrix = dense matrix operations.

@essex-edwards
Copy link
Contributor

Okay. If I understand correctly, you are using SpMM to compute C = A^TB where A is a CSR matrix and B,C are dense matrices. You observe that this has slower performance than C=AB. This is expected. The performance loss is not due to a lack of specialize API. It's due to the data layout of A^T. When A is a CSR matrix, A^T has the entries stored column-by-column. This is not an algorithmically-convenient order. We can, of course, try to make it faster, but a significant performance gap is probably unavoidable. If possible, you can try storing A^T instead of A and using opA=NON_TRANSPOSE (or storing A in CSC format). In that case, A^T will have the data arranged row-by-row, and you should get faster performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants