Skip to content

CUDA implementation of mixed-precision block QR decomposition

License

Notifications You must be signed in to change notification settings

jaidonlybbert/MixedPrecisionBlockQR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mixed-Precision Block QR Decomposition

A rectangular matrix A can be factored into a product of an orthogonal matrix Q, and an upper triangular matrix R; A = QR.

- Van Loan Golub. Matrix Computations, Fourth Edition.

Overview

The QR decomposition can be used as a non-linear least squares matrix solver. It is numerically stable, and well-suited for parallelism. These properties make it useful for processing matrix data from cameras and sensor arrays, such as for SLAM, background subtraction, radio communications, object encoding, point cloud visualization, and many more applications.

This project was started as a University of Washington graduate school project in collaboration with Amazon Lab126. The objective, a fast and correct parallel Block QR decomposition algorithm using half-precision FP16 matrix-matrix multiplies. The Mixed-precision Block QR algorithm is well-suited for large, wide matrices. Other QR algorithms are better suited for small or tall-and-skinny matrices.

Test Data

Our test data is derived from the Euroc-MAV dataset to emulate using the QR decomposition to perform a non-linear least squares optimization for robot pose estimation and bundle adjustment for SLAM applications. The implementation works for arbitrarily sized matrices. The dimensions of the largest matrices in our test data were on the order of 2000 x 2000.

What this project is

  • 100% free to use for anything
  • A reference for future researchers implementing a parallel QR algorithm
  • A way to test FP16 error performance for the Block QR algorithm using TensorCores
  • A low-level C++ implementation
    • Very few libraries (just CUDA)
    • Adaptable to alternate parallel processors

What this project isn't

  • In development
    • Our results are published here, with instructions on installing and running the project for yourself. I will respond to issues and PRs about bugs or unclear instructions, but won't be adding features.
  • The fastest parallel implementation of QR in CUDA
    • Use CuSolver if you just need a CUDA solver, it uses a more robust implementation of QR by the NVIDIA developers themselves
  • Memory optimized
    • We met our performance targets for the research project, but many many things could be done more efficiently, especially with the HtoD memory traffic

Dependencies

  • NVIDIA GPU (compute capability >7.5)
  • CUDA Development Kit 12.1
  • Python 3.11
  • CMAKE

Linux

  • gcc
  • make or Ninja

Windows

  • Microsoft Visual Studio 2022

Installation

  • Clone repo
git clone https://github.com/jaidonlybbert/MixedPrecisionBlockQR.git
  • Run CMAKE
mkdir build
cd build
cmake ..
cmake --build .

Executables can be found in the /build directory. On Windows, a Visual Studio solution (.sln) is built into the build directory.

After running the tests, the Python script /Cuda/performance/runtime.py can be run to parse the test log results and generate graphs.

Our Test Results (work in progress)

Error

Execution Time

Related Projects

References

  • Zhang, S., Baharlouei, E., & Wu, P. (2020). High Accuracy Matrix Computations on Neural Engines: A Study of QR Factorization and its Applications. Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, 17–28. https://doi.org/10.1145/3369583.3392685 | PDF
  • Bouwmeester, H., Mathias Jacquelin, Langou, J., & Yves, R. (2011). Tiled QR factorization algorithms. arXiv.org. https://arxiv.org/pdf/1104.4475.pdf
  • L. Minah Yang, Alyson Fox, and Geoffrey Sanders. Rounding error analysis of mixed precision block householder qr algorithms. SIAM Journal on Scientific Computing, 43(3):A1723–A1753, 2021.
  • Van Loan Golub. Matrix Computations, Fourth Edition, pages 235–240, 246–251. The Johns Hopkins University Press, 2013.
  • Bhaskar Dasgupta. Applied Mathematical Methods, chapters 10-12. Pearson, 1986.

About

CUDA implementation of mixed-precision block QR decomposition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •