-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General Discussion #50
Comments
With respect to command line handling for
I hope you can approve of this. |
@mn416 Upon checking in, I happened to notice the branch Looks like you're enabling VPM and DMA. Now, that is cool. |
Hi @wimrijnders,
Yes, sounds good.
The new version should be completely backwards compatible with the current stable version (when it works, almost there). However, the new DMA and VPM functions can break the deferencing operator when they are used. I will have to document this, but basically programs that use explit VPM/DMA should not use the dereferencing operator (we can add a compiler check to make sure this constraint is met). |
By the way, I had to disable the platform detection code because it said my Pi was not a Pi :). No doubt I am on a very old kernel, but I guess there are a lot of Pi's out there on old kernels. I wonder if a better detection method would simply be to check if the VideoCore header files are present? |
OK Thanks for easing my mind. Good to hear.
Do you know off the top of your head which Pi you have?
Please run this and see if it returns anything. In the meantime, I'll investigate if this is version dependent. EDIT: Yes, it is version dependent. Only the later versions of Pi support this. 😞 . Perhaps it's dependent in the distro version only; in any case, this needs fixing because it should work always, not just the newfangled Pi's. |
The 'correct' way to do it is to determine the hardware revision number ( Hardware revision numbers are totally unique per pi-version. So, what I can do, is use firmware as above as fallback for the If you don't mind, please give me your distro version as well, to see how old it is.
Can you check if [1] And this implies that I can do away with the |
You mean the Also, theoretically you could be running on a BCM platform which is not a Pi......actually that would be no problem at all. Never mind. |
It's a Pi 1, Model B
|
Ouch. If this doesn't work, your distro is old. Never mind. See instead if The relevant lines for the Pi 2:
You should have something similar. |
@mn416 Have fix ready for Edit: Never mind, I created the PR. Please check if both platform detection scripts work for you now. |
Hi @wimrijnders, The new explicit VPM/DMA is finally in |
Yay! Was looking a bit forward to it. Will see if I can test it.
Thanks for pushing through. I understand you don't have much time for pet-projects, your effort is appreciated. This also enables me to push further PR's |
Aside, I'm trying hard to not buy a Pi Zero right now. You should imagine the pull I feel every time I pass the electronics outlet. I sort of want to get the collection complete: got 1 2 3, zero should be on the list. Never imagined computers could be collectibles, like Pokemon cards! Skipping the Compute Modules though, even if it's cutesy small. You need an expensive peripheral setup to program it at all. Perhaps one day (you only need to get the dev-kit once I would think). |
This looks relevant and interesting to me : Introduction to compute shaders. I'll take the time soon to read it in detail. Still don't understand the term 'shader' though. I see no difference with 'kernel' in this project. |
@mn416 Any chance of setting up a I ask for two reasons:
Perhaps this is something I could attempt myself? |
@mn416 New DMA example: working in QPU as well as emulation mode. Great work! Now, is it possible to overlap DMA read/writes while computing internally? My intuition says 'of course', but it is not apparent from the example. I would appreciate if you also make an example to (explicitly) show how to overlap DMA with computation. Just a thought. EDIT: Also appreciate the commenting in the example. It is just verbose enough to let the code make sense. EDIT2: Now, is it also possible to overlap DMA with direct memory access? Or would you regard this as an 'exercise for the reader', i.e. me? |
Hi @wimrijnders,
Yep, that is indeed possible.
By "direct memory access", I assume you mean the dereferencing operators. Unfortunately not. The compiler would need to be cleverer for that.
Yes, I'm hoping to do this. I'd like a compelling example of these new features. Maybe matrix multiplication, or a perceptron / multi-level perceptron. Still deciding... I think Rot3D is not a good example of the new features, because it's basically memory bound (not a good compute to memory ratio). That's why we don't see a great speedup over the ARM in this example. One of the main features still missing from the library is the ability to use sub-word operations. For example, you can split a 32-bit word into 4 bytes and do 64-way vector operations -- that certainly improves the compute to memory ratio. Low-precision arithmetic is becoming v popular in neural nets. |
I'm considering implementing a Discrete Fourier transform (not FFT) for VideoCore. It's actually a reason why I came here in the first place. Plenty of calculation required there, perhaps it would be a better example. I understand your concerns about
I just mean the 'old' style of loading data from/to main memory. I see no technical reason why this couldn't be combined with DMA. But you're suggesting that the Lib-code can't handle this right now, correct?
In the reference doc eg. page 57 onwards, I read that you can also use 8 and 16 bit elements, that would certainly allow you to pack more data into the calculation at the price of precision. Not sure how this lowered precision would be useful though, I'm too much of an exact thinker to appreciate it. |
Another example I would like to see running is a Mandelbrot set calculation. That should be really effective on the QPU's, since basically you're looping over a limited set of values. My understanding of the DSL is just too limited to be able to implement this (and DFT), hoping to get up to par soon. |
@mn416 Trying to chart the memories available to a QPU and further. This is what I got till now: Does this look OK to you? Any obvious things I've got wrong? Questions on this:
Thanks. |
Excellent idea. This would be a nice example and should be reasonably straightforward. Each pixel is computed independently, so I imagine it will look a bit like the GCD example. You shouldn't need DMA for this example, the |
The diagram seems ok, but is it not better to leave this level of detail to the manual? My understanding is that only 4KB of the VPM is available to the QPUs for general use. Also the regfiles are 32x16x4=2KB in size. Almost the size of the VPM, but the crucially the VPM is shared so data can be loaded once (expensive) and used many times (cheap). Not sure about s,t,r,b. From experimentation, I believe the receive FIFO is 4 elements deep, but I may be missing a setting that makes it 8 elements deep. |
Thanks for answering. There are details which I'm still struggling to understand. I hope you can enlighten me.
I would love to agree with you, but I found that the overview diagram is incomplete. There are elements not drawn in there which are mentioned in the text. Notably:
My diagram is an attempt to get them in view, for my better understanding. In addition, I want to know the actual sizes of the memory elements, mostly not mentioned in the document. I don't want to draw the whole thing, just the memory parts that are relevant to a QPU.
Yes, page 53:
So 12KB - 8KB (reserved) = 4KB. From reading, this is because capacity is reserved for automatic execution of various shader types. I would like to point out that it's possible to disable the special shaders (fragment, vertex, coordinate) and run user programs only, see page 89. My hope (note emphasis) is that disabling the special shaders will free more capacity in the VPM for general use.
For the 32: Page 17:
Now, the mapped registers are shown in table 14 on page 37. If you look at any specific register definition, eg It follows in my thinking that the general-purpose registers are also 32-bits, otherwise the mapping is wonky. While it may be possible, I really cannot imagine a memory scheme where half of the addresses are for 64-byte registers and the other half are for 4-byte registers. Perhaps I'm missing something here? I you have further documentation for this, please share. (Even then, the 64B is wrong. It should be 32x4 = 128B)
There is; it depends on whether threads are enabled or not. This is the reason that I asked #41. page 40:
So, if you can guarantee that the kernel running is not multi-threaded, you can use all 8 elements of the FIFO. |
As you can see, I have a lot to learn about QPU's. I really hope you don't mind if I discuss the hardware stuff with you. Addendum:
This bothers me; it should have been exactly specified in the reference doc's. It's not the only thing that is vague. Page 39, "QPU Interface", says that there are 8 slots in the receive FIFO for color data. Color data is then defined as 32-bits (RGBA8888), meaning that a FIFO would only be half a 16-vector big. This can't be right. The logical assumption to make is that a slot contains a 16-vector of color data. But I'm struggling to find proof of this in the document. I keep on re-reading this part, I find the language confusing and ambiguous. On regfile elements, "Thread Control" p.20:
Two things about this:
On VPM size, page 53:
So yes, 4KB if the window can't be changed. I get the impression that the 12KB can't ever be accessed fully; it's something that I'll just have to accept. |
What about registers This is my current hypothesis on how the QPU works as a 16-way SIMD device, perhaps you can confirm:
Does this make sense? Hoping for corrections or confirmation. |
I'm glad you agree. I'm itching to make this, or at least give it a start. |
@mn416 I've been thinking about a good showcase for small-value integers, something you mentioned previously you want to implement. I think something with cellular automata would be suitable. These usually deal with small values only. It would be nice, however, to be able to show every step while running. I've just spent some time in the garden ruminating about how to do this. Something like your I realize that this is long-term thinking |
Excellent idea. If we pick Game of Life then we just need 1 bit per cell and can probably just do bit-wise operations on 32-bit values to implement the state transition function, i.e. treat a 16 word vector as 512 bit vector. This also sounds like a good example to demonstrate the new DMA features. |
Yes. But Game of Life is so boring..... Ooh, is 1 bit also possible? I thought 8-bit was the minimum. I suppose the 1-bit handling can be implemented within the kernel. And then just for kicks make a giant game of life board! |
I'm currently thinking over two things:
Please note, 'just thinking' and a bit of research. Not going to attempt these any time soon. |
Wrt graphic viewer (this is all still speculative): I was considering Instead, I've been looking at what a Pi can offer out of the box wrt GUI programming. A good candidate appears to be So I imagine a python graphic front-end which can interface with a c++ back-end, and which can display the result - think of A python <-> c++ binding is doable, I've done it before. You wouldn't happen to have python experience, would you? (Note that this is still all vaporware - just thinking out loud) [1] I know this because I just upgraded to latest version of QT Creator |
No, you can't. The FIFO is actually 8-deep, but it is used for both request and receiving, so you can stack up to 8/2=4 requests to the FIFO even if a kernel is single-threaded. |
Thanks @Terminus-IMRC for answering. However: The documentation makes a clear distinction between request and receive FIFO. VideoCore reference documentation, page39:
Although I must say that elsewhere the text is open to interpretation. Also, it wouldn't be the first time I detected inconsistencies in the document. In this case, I would say that experience trumps whatever is written in the documentation. So I'll seriously keep your comment in mind. EDIT: It doesn't make sense that a FIFO could be bidirectional, by definition. Also, assuming it's a single FIFO, the input/output length depends on how you use it. E.g. you might not read anything into the QPU and output 8 result vectors. EDIT2: Removed brainfart in previous EDIT. |
Never mind. I see your point. I think you're talking about data only.
You're talking about |
I'd like to get #66 merged soon, yes. Just need to sort out #52 first, which could be done simply by conditionally including/excluding |
Sorry, bad wording by me... It seems that there are 2 FIFOs and 8 entries on a TMU, and 4 are used for request FIFO and the other 4 are used for receive FIFO. |
Been examining the emulator code to understand how memory reads and write work. @mn416 @Terminus-IMRC is the following correct? Very much helicopter view. EDIT: Following is the case for gather/receive calls. Direct reads also go through the VPM Read data
Write data:
|
@mn416 I might have an actually useful application for I've been thinking about it. This transform can be parallelized something awesome, much better than FFT. Will get back on this after my vacation. |
I don't know much about that domain, but more QPULib applications/examples will definitely make me happy :) |
Hi there, currently on my way to France for vacation. I believe that, because Goertzel transform is well-parallellizable, it should be possible to obtain the full effect of 12x16 SIMD concurrency. This will be a killer application for QPU IMHO. Also to note, it can be made compatible with Fourier. I'm truly excited about this. But it will have to wait till I get back from camping :-), August 12. |
Please note that the Goertzel transform is actually a Goertzel filter to search for a specific frequency in a signal. When it comes to QPU-FFT, consider a look at http://www.aholme.co.uk/GPU_FFT/Main.htm |
@mn416 Hereby checking in, showing a sign of life.. |
Hi @wimrijnders, Glad to hear it. Unfortunately, I've also been too busy recently to make any further progress on the development branch. |
@mn416 Yeah, that makes two of us. That's OK, the project won't run away any time soonand I'm still interested in progressing the state of the art. We'll get back here eventually. Good luck with whatever you're doing! |
Is there a way to estimate the performance of a compiled kernel ? By calling:
in emulation mode also, I can at least see how many target instructions there are, but I'm unsure how this correlates to code execution time. |
Hi @robiwano, Not at present. It should be straightforward to extend the emulator to count the number of instructions executed. Of course, this will not account for the memory access cost. Matt |
I have now a working complex MAC function (complex values are interleaved floats re/im):
plus I added the spin-to-completion functionality of GPU_FFT to avoid the mailbox overhead, and with it it is a lot faster than the reference code. However I would like to be able to process say 4 batches of 512 complex MACs accumulating to a single 512 complex accumulator, and I have no idea how to express that with QPULib :) |
Is it possible to have |
It's really not an alternative, as it only computes single bins from the DFT, albeit efficiently. Nonetheless, it is a very relevant and useful algorithm. |
Hi there, great some discussion here.
I can answer this in several layers, I will stick to this one: I did not state that the Goertzel should replace the DFT, I stated that goertzel can be parallelized much better than it. I hope you see the nuance difference. I realize fully that the Goertzel would not be a direct replacement for FFT, but when you're dealing with limited number of frequencies it's a better alternative. This will of course not stop me from wanting to do my utmost to get goertzel in competitive shape. I'm actually looking to make some form of progressive benchmarking for both, in the spirit of how the docs are set up in this project. Also, a great finger exercise for getting to grips with GPU programming....and indeed useful for my work, where we use goertzel massively. EDIT: OK scrolling back I can definitely see how I might have implied it. I don't remember my line of thinking then any more, right now my above comment holds. |
Another issue I'd like input on, I plan to use both GPU_FFT and QPULib in a project for a RPi Zero. But I see potential collision problems, mainly due to mailbox, so I'd like to extract the handling of the mailbox into a separate repository, which I can then use from both GPU_FFT and QPULib. |
This is an issue for discussing general things.
The text was updated successfully, but these errors were encountered: