-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a practical example of use of imfilter in GPUs? #209
Comments
There isn't now, but it's definitely on the agenda: https://julialang.org/jsoc/gsoc/images/#gpu_support_for_many_algorithms_medium That said, no one has claimed it yet, though I have my fingers crossed for this GSoC period. If you're interested in tackling it, would love to have this! |
Thanks for input. We certainly would love to help, but we are newcomers to Julia (I moved all our software from C++ to python a few years ago, I would have tried Julia now). My plan is to use Julia in specific areas, where there is a distinct advantage to the overburden of adding a second language to the software system, and imaging processing using GPUs would be one, if the software were mature enough. From what I can see, this may take still a while. One possibility that I will think about is to try a toy example and evaluate how hard it would be extending the problem. Thanks again. |
It's hard to say how hard it would be (Julia is exceptionally nice for GPU programming), but it is probably not the best project for a Julia beginner. Let's leave this open and you'll know it's done when it gets closed. Certainly don't expect anything in the next couple of months, but I'm hopeful it won't be too bad and that someone will find the time to tackle it. |
Maybe relevent, DynamicGrids.jl can run direct convolutions on GPU using KernelAbstractions.jl. A simple direct windowing implementation is pretty easy to write as a toy example. But something faster or other algorithms for larger window sizes might be more difficult. Unfortunately the code is pretty abstract as it handles running arbitrary functions over arbitrary neighborhood shapes, but maybe gives an idea of what is involved. In my limited experience really fast GPU code does seem to require knowledge of GPU blocks, threads and shared memory to optimize. Which I don't have yet either, but I'm interested to see what you come up with. Even a toy example on GPU is still pretty fast. |
Hi again. I have run a simple exercise, convoluting an image in CPU/GPU, using FFTs. You can find it here: https://github.com/jjgomezcadenas/JImages/blob/main/nb/ConvolvingLenaGPUFFT.ipynb Several remarkable points:
I thought @timholy may be interested in the exercise. If there is any obvious recipe that I can apply to improve the algorithm I will be glad to try. I am also amazed how fast the imfilter algorithm runs (again, I would be happy to learn any trick you care to share). Thanks |
Have you tried a much larger kernel with a larger image? These sizes are not favourable for FFTs or GPUs. I think You could also compare to |
I will try. The bottom line, in any case is that it should be very simple to add the functionality of running in the GPU for imfilter, at least in the case of ffts, which, as you say, could be very useful out of the box for large image/kernel processing. Our problem, however, goes in the opposite direction, we need to deconvolve many small images. Each electron in our detector is reconstructed as a stack of about 100 images, each image a square o about 300 pixels size (eacy image size is variable). Kernels are also small and of variable size. For each image we need to apply an iterative deconvolution procedure (lucy-richardson) which takes about 100 loops to converge, each loop on itself implies 2 convolutions. This repeats for many events. I suspect one should find a way to send the full stack of 100 images simultaenously to the GPU. Thanks! |
In that notebook, it's not obvious whether you're accounting for compilation time. Are those times you show after running it a second time?
|
Yes, yes, I run several times, the times you see in the NB is after a few runs. The separation trick is nice, indeed and should account at least for a good portion of the huge difference with my naive FFT implementation. The fact that the GPU accelerates my toy calculation by a factor 50 also suggests that simply using CUDA FFTs in imfilter (which should be easy, I guess) would result in extra gains, even for "smallish" kernels and images. I have looked at the imfilter code to try to figure out if I could introduced these, but is sort of convoluted itself... ;)! |
To clarify, the factorization trick is only used for FIR filtering, not FFT filtering. Did you force FFT filtering with imfilter for comparison? Since this is mostly about "why is my alternative implementation slow?" I confess I don't really have the bandwidth to help you figure that out. Try profiling your implementation and viewing it with ProfileView or equivalent to see if you can figure out what's going on; doing the same for |
Hi Tim, thanks a lot for input. Actually I am not that interested in my alternative implementation which I did just to compare GPU vs CPU. That exercise yields an improvement using GPU. But then, of course, what I would like to do is to adapt imfilter to GPU, as originally stated. Alas, the code seems too complex for me to mess up with. |
Im interested in adding GPU FFT to DynamicGrids.jl at some stage. It would be good if we could share a base package somehow, eventually... DG convolutions code is even more convoluted - the price of generality. Here it seems that imfilter just calls imfilter until all the arguments are clarified, so you should be able to find the method signature that GPU FFT can plug into. This is a very common coding style with multiple dispatch in Julia. |
Hi, just wanted to check if there is any news on this. Is there any easy way to use |
There is not yet that I know of, but we want to move in that direction |
Hello!
I am interested in using Julia/imfilter for a deconvolution problem in particle physics. In short, our experiment (http://next.ific.uv.es/next/) takes "electronic movies" of electrons propagating in dense xenon gas. We divide those pictures in a stack of about 100 2D images. Each image is "blurred" by the diffusion of the ionisation electrons on their way to our sensors, but the diffusion can be very well described (by a kernel approximately gaussian) and thus we can deconvolve our pictures and reconstuct the track, using a classical Richardson-Lucy deconvolution. If curious see:
Boosting background suppression in the NEXT experiment through Richardson-Lucy deconvolution
2102.11931 [physics.ins-det]
The algorithm (and all our software) is implemented in python (+numpy, scipy, etc). LR algoritm is implemented in scikit-images, one of the libraries of the ecosystem and is pretty fast, but not fast enough for our problem. Tipically we take about 2.5 seconds per image, so about 30 seconds per event... and we need to process zillions of events. Currently we solve the problem by butre force, running in large clusters one event per node. Clearly not optimal.
The problem appears to be well suited for GPUS, since it involves loops and more loops of convolutions. I have implemented RL in Julia, using imfilter (it was suprisingly easy, the interface and doc is very clever, congrats!) and I have checked that it runs pretty fast in the CPU (about 0.5 second in my Mac, in multi-thread mode). Now I would like to port the algorithm to GPUs. The docs indicate that one could in principle call imfilter invoking CudaLibs() as resource, but I have failed to find a single example of anyone trying this.
So my question (apologies for the long intro). Can one run imfilter with CudaLibs() out of the box today? If yes, is there a practical example, a test, a tutorial one can look at? If not, are there plans to implement it?
Thanks a lot and congrats for a really clever library
The text was updated successfully, but these errors were encountered: