-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Way forward #1
Comments
I forgot to mention that SPI_3WIRE is missing support in the driver: https://github.com/notro/spi-bcm2708/wiki#bidirectional-mode |
actually (and again as per official documentation) you can only do DMA transfers of 0-65535 bytes (16 bits) - at least that is the way I read it. As for clocks: I have not had any issues with the dynamic clocks, but then I have to admit I only used my "preferred" frequencies of 4/8/10MHz as the goal... As for the clocks there is also somewhere else (i believe with the UART docu) some mentioning of how the chip does clock spreading. In the end the clock might be slightly asymetrical with clock dividers that are not a power of two, but that is typically NOT an issue, as lots of devices can handle that assymetry... As for pages: So I wonder what you have really implemented - say transfer these 192000 (= 320_200_3) dma-mapped bytes (which would mean that these would be guaranteed to be dma-mapped in sequential order or not) |
See what I have found and measured so far - the wiki contains all of it. |
The more I read of LoSSI the more I come to the conclusion that with the code now it can not work. |
Some more thought on this gives me the impression that the "read" operations are not implemented at all (as you said: read would require knowledge of the last command and then we would need to read the Fifo differently) But in the end: as far as I understood you are not using it for any of your tft drivers (you use the workarround 8_9bit = 9_8bit, which is actually more efficient memory wise) and those seem the only ones that use this mode. Is this a correct assumption? |
Next thing I will do is enable the realtime priority with the bcm2835 driver and put up the differences on the wiki including screenshots... |
There have been now a lot of fixes already applied - some of which I can not really test due to missing HW.
Please look if you find any obvious issues... |
Actually I have found a quick workaround for my issues by implementing polling for situations where we run the interrupt which then wakes up the worker thread for requests that run <20us. To put it in perspective:
This does not solve the issue of long term transfer delays, but at still solves part of the issue. |
We could probably modify the code to use DMA for anything that lies beyond this. |
reducing the time to 10us will mean an interrupt getting scheduled and that means a penalty of 36us. |
In the meantime I have figured out a test for most use-cases (CS-polarity and 3Wire) |
Martin I'll pick up this discussion later in the week. |
Note that there is a major issue with latencies of interrupts - need to document this. |
We are aware of interrupt latency causing lirc driver issues when network traffic is high (e.g. playing a high bitrate video from kodi over network makes remote control unreliably). From a hardware point of view we can assign the fiq to a core. We can assign the system (i.e. GPU) interrupts to a core. The local timers and mailboxes are core specific. The USB interrupt comes from a FIQ, and when there is non-trivial work required, a (fake) system IRQ is triggered. We believe work done from this point harms latency of other interrupts. @P33M is going to move the "fake" IRQ used by the USB system to use a core specific mailbox interrupt (on a core different to the normal system IRQs) which will hopefully significantly improve interrupt latency. |
@popcornmix: why not handle parts of the USB interrupt via RTOS - from what I have looked at, it looks as if the VideoCore has a fully vectored interrupt handler, so calling the specific interrupts directly. Attached some measurements from the SPI driver with several pieces of the code insturmented to pull GPIOs high/low. You see that: the interrupt takes from:
So if you look at this then we spend less than 15% of the whole time in interrupt in the real interrupt handler routine. Note also that the IRQ "shutdown" takes longer than transferring 2 bytes over SPI at 8MHz. We might see the same thing also with the USB irq. Note that this measurement has been taken on a RPI2 with the foundation kernel. |
Another observation:
and then there is sometimes an interrupt that does only enter do_irq but does not call armctl_*mask_irq (so the mask/unmask pin toggle does not happen) You might be able to identify which those are, but I would guess that the 2 irqs that hapen back2back are related to USB. |
Note: here the patch with the instrumentation: |
A few of those patches have been in the spi tree - which shall go into linux-next, so we are moving forward... |
Status is right now: The only thing I have not been able to handle is an SD-card. I still found one "awkward" situation in case I disable one of the mcp2515s in the Device-tree I fear that this is a "generic" issue that I need to bring up on the SPI list. |
the following commits: |
next step:
|
Glad to see the progress. |
I guess that DMA will be the 3rd in the list now - to start assuming that we have spi_message.is_dma_mapped=true and not relying on can_dma functionality... |
I don't understand. By default fbtft enables is_dma_mapped for pixel data transfers: https://github.com/torvalds/linux/blob/master/drivers/staging/fbtft/fbtft-io.c#L27 |
With can_dma you can define if the framework should try to do dma_mapping/unmapping for you - return 1 to request mapping, 0 for no mapping. This only applies if the tx_dma is not set. Essentially it is just code to decide if it is worth the effort or not - dma might have limitations and also setup times may be expensive compared to the gain. So in our case it would make sense to implement this for transfers >64 bytes, as then we can avoid interrupts (or maybe after 48 bytes if I interpret the DC-register and there RPANIC and TPANIC correctly). I guess I should check when the irq really starts to run... But as you are using dma_mapped already this is actually not important... |
see the patch used for upstream is as of now available as: https://github.com/msperl/spi-bcm2835/tree/patch_the_upstream_use_transfer_one |
Sorry, should have been more specific. |
Have you seen this in https://github.com/raspberrypi/linux/blob/rpi-3.18.y/drivers/dma/bcm2708-dmaengine.c #define MAX_LITE_TRANSFER 32768
#define MAX_NORMAL_TRANSFER 1073741824 This indicates that the Pi is capable of VERY large DMA transfers (1GB). |
PR raspberrypi/linux#909 puts in place the necessary pieces to get DMA channels from DT. |
Well - the is_dma_mapped is the easier portion if the driver supports it. So for the initial things I really like to get the "simple" dma working before I address the "complex" version... I guess you saw my email with the patch for upstream, which hopefully might go into 4.1. So if you can run some independent tests and send a "Tested-by:" it would be helpfull to see that working with you collection of displays - you should be able to run a lot of those displays now concurrently... |
As for the transfers: that 1GB size is surprizing, but I guess it is mostly an artificial limit imposed by the 1+1+1+1G split for the different caching zones. |
Actually lots of things have been merged into spi/for-next. |
The following patches are scheduled for 4.1: So lots of improvements... We might want to push those into the foundation kernel... |
You can't avoid that. That's the api used for dma transfers. And you get alot of help from the SPI core to use it. I don't think you will get away in mainline with setting up dma CB's directly in the driver. |
AFAICT there is no cap on how large a SPI DMA transfer can be. If the dma engine driver detects a Lite channel, it chops up the transfer in 64k-1 chunks. Seems the SPI driver doesn't have to care about that. From bcm2835-dma.c patch: + if (c->ch >= 8) /* LITE channel */
+ max_size = MAX_LITE_TRANSFER;
+ else
+ max_size = MAX_NORMAL_TRANSFER;
+
+ /*
+ * Store the length of the SG list in d->frames
+ * taking care to account for splitting up transfers
+ * too large for a LITE channel
+ */ |
actually the spi framework has 2 DMA options:
There is no support (yet) in the framework that would translate is_dma_mapped=1 cases to "normal" I guess that at some point the is_dma_mapped case may become depreciated - when there is a translation to scatter/gather... So right now you have to implement each one of those separately or you tell the system that it needs to remap the areas every time even though they are mapped (which is a waste of CPU resources, if the driver has already mapped it). And as your drivers currently implement is_dma_mapped, this works immediately... The big question that remains is: how long does it take to map something? |
Looking at your patch quickly I would actually recommend that you do NOT split on 63335 for the LITE channels! Because transfers of 65535 bytes to SPI-FIFO via DMA means that you actually transfer 16384 (4-byte)words to the fifo where the 16384th word contains a 0x00 as the final byte. So you should actually separate it on 32768 if above 65535 for the lite channel... Martin |
Would (64k - 4) cover it? |
well - as the minimum DMA transfer is 4 bytes anyway that is not the real issue. It might actually be possible that the RX-DMA is stalling if it is not a minimum of 4 bytes in the SPI case, as the SPI may not triggeri the DREQ - but it should if DLEN is set correctly.... |
that is actually one of the reasons why I want to implement the is_dma_mapped event first. As for 65532 - it probably would work, but IMO if you have something like this then try to use a power of 2, as there may be other cases that require some sort of alignment - that way you play it save - I would go for at least SZ_64K-SZ_4K to give you a page size of "margin" |
From datasheet:
So (64k - 16) would cover that. But you're right, there are probably more allignment issues here. Let's see what the DMA people has to say about your comment. |
From spec:
This suggests 65536 is allowed (even though technically it requires 17 bits). Have you tried it? |
The missing xxx_ignore make the Lite dma's unpractical from an spi perspective, as they are typical activities that can happen - especially when pushing data to a tft or similar - lots of data but only one direction... The 17bit feature may be again one of those errata in the document (like we have seen so many in spi alone + the hw-strangeness we have seen in border cases with regards to native-io) - I would play it by the book and use 32k and if you find time then use 64k for a specific use-case to test it. Anyway: I will need to review how the transfers work when using dma-engine - for example: does it support "null" as the pointers and will map it to ignore... |
The lack of SRC_IGNORE can be worked around. Just allocate a small zeroed buffer and set source to non-increment. |
that is true... Also if I look at the extracted information from the GPU code shared some time ago with regards to the registers for DMA7: https://github.com/msperl/rpi-registers/blob/master/md/Region_DMA7.md All this would indicate to me that the text on page 63 4.5 point 3 is containing a off-by-1 error, so 65535 is "correct"... On the other hand - the fact that we fetch words and store words as well means that Arguing that case explaining why this is the case would be quite complex though - also it would need to get confirmed... Setting it to 32K seems more sensible and avoids all sorts of questions, mentioning that there is a discrepancy in the documentation about 64K and 16 bit limits would be welcome though. But in the end I wonder how many requests do we have that have are >32K with scatter-gather? |
See raspberrypi/linux#930 for the request to merge the existing patches into the foundation kernels... |
The transfer length/width field for lite channels can be resolved easily.
|
I asked the author of the DMA hardware. 65535 is the maximum transfer length of a LITE channel. |
thanks - the documentation was right there then! |
@notro: after some more reading on dmaengine I now understand why we need that "scatter/gather" patch to dmaengine that you have started the process to upstream. So for now (that the upstream patches have made it into rpi-4.0.y) I will need to develop the dma implementation on the foundation kernels (as we have something working there - it seems) - when that works then we may be merging both together into upstream as an implementation and client using the implementation and by that avoiding the mmc discussions for now. |
But there is one other concern that comes to my mind with DMA that I am slightly unsure and that could result in issues if not handled correctly: Say we have a transfer set up like this (using the direct bcm2835 control-blocks):
This results in an effective requested range of [0x00000F01:0x0000FFF] But as we have 32 bit as minimum transfer length on the DMA this means that we effectively would write to: [0x00000F01:0x0001000] So the DMA be writing to two 4k-pages: 0x00000000 and 0x00001000 even though in principle we would should only write to 0x00000000. This could mean we overwrite some structures that are outside of our scope and could crash the kernel or some independent user-space if the page 0x00001000 is not assigned to the same process. For typical kernel usage it might be less of an issue as the structures would be 4 byte aligned anyway - at least in the majority of cases. But especially with data that comes from user-space it could become an issue if the start address is not aligned on 4 byte addresses (or 16 in the case of wide transfers) That is all "ifs" for now:
I just want to be cautious, so I mention those possible cases here if they have not been addressed... |
Note that DMA deals in bus addresses and there is no concept of pages being assigned to user processes. DMA does not go through ARM MMU so will not generate page faults. Obviously you need to take care when setting up DMA to not access memory you shouldn't. DMA also handles byte-enables on SDRAM reads and writes, so even if it is set to use 128-bit accesses, you can still have byte aligned start address, destination address and length. Remember again that DMA deals with bus addresses that bypass arm MMU and arm caches, so care needs to be taken to ensure cache coherency. Byte-enables don't exist on peripheral addresses, so in general peripheral addresses need to be 32-bit wide, non-incrementing, with a 32-bit aligned address and length. |
So from what I read in your statement is that we should be safe for all cases - we always only write the amount of bytes we are supposed to write even when the start/end is not 32 bit aligned - in the case of 13 bytes we also only write 13 bytes regardless of if we read 32 or 256 bit wide or not. I am slightly doubtful if this works as expected with a 3 byte transfer to the FIFO registers, that always expect a 32-bit write I guess we should confirm those assumptions even though I have not seen any issues with my old dma solution - but I guess I did forget some issues and I have not tested all those corner-cases and did not have lots of devices connected... Worsted case if it does not work as expected then we will need to handle dma-transfers to and from a device where the length is not a multiple of 4 in a special manner say by doing the last transfer to a bounce buffer and copy the 1 to 3 bytes to the "correct" location. As for cache coherency: the dma_(un)map_single as well as dma_alloc_coherent should take care of those caching issues and sync the CPU-caches if needed. |
The byte enables and unaligned data comment referred to the SDRAM end of the DMA transfer (often both ends are SDRAM when using DMA more copying/blitting). I did say this didn't apply to peripheral addresses, which should still be 32-bit aligned in length. I would expect using DMA to write 3 bytes to a 32-bit peripheral register would be undefined. It possibly has the behaviour you want, but it's not clear if those 3 bytes should end up in bits 23:0 or 31:8 of the peripheral register, or if it the write won't occur at all as the transfer is not complete. |
DMA is byte aligned on 8-bit SPI, which is also proven by your DMA version of spi-bcm2708
|
I'm picking up from notro/rpi-firmware#18
I don't follow you here, are you talking about the dma mask?
Why only one page per transfer? Is it stated somewhere that the SPI controller has this limitation? Or the DMA controller?
I have tried 64kB and it worked.
spi-bcm2835 is patched so it can be built on ARCH_BCM2708.
But I don't see any point in enabling in it kconfig before it has some advantage over spi-bcm2708 apart from removing a message from the kernel log.
https://github.com/raspberrypi/linux/tree/rpi-3.18.y/arch/arm/boot/dts
But I think @popcornmix is reluctant to do that, because there might be users of the 9-bit support in spi-bcm2708.
He said so when I checked to see if there was interest in using the DMA driver raspberrypi/linux#567
I agree it's best to get it included upstream first. When it is, Dom won't hesitate in pulling it into raspberry/linux.
Yes I saw that comment.
My brief testing showed that the FIFO is larger than 120 bytes: https://github.com/notro/spi-bcm2708/wiki#fifo-size
I have a display that uses this, and that was why I addded support for it in spi-bcm2708. The problem is that LoSSI is not generic 9-bit operation.
Sending certain values (MIPI DCS read register) triggers a 8/24/32 bit read in hardware, so I don't think this is acceptable. Hence I haven't made a patch for spi-bcm2835.
I didn't "get" this when I made the patch, or else I would have dropped it. It's mentioned in the datasheet (spi-bcm2708 wiki).
I work around this in my driver by sending 8x9-bit as 9x8-bit. Since register zero is a no-op and pixel data is on at least 8-byte boundaries, I get away with this.
The Auxiliary peripherals SPI controller is also available on the header now. I think it's better to implement other bit lengths (1-32) in a driver for this instead. It can't do DMA however.
The client driver would have to be optimized as well, wouldn't it? Putting write and read in the same SPI message?
I have seen requests for more than two CS lines from people having a SPI touchdisplay occupying the entire SPI bus.
There is also this issue with the clock divider limitation: https://github.com/notro/spi-bcm2708/wiki#spi-clock-divider
Since you have a logical analyzer, you would be able to check if there are any limitaions at all on the CDIV register.
DMA
Now that my FBTFT drivers are accepted in mainline (staging), they can be pulled into raspberrypi/linux. When that is done I need a DMA SPI driver there as well. Then I can stop releasing my prebuilt kernels for FBTFT, relieving me of that maintanence burden.
I found this patch that implements can_dma in a very clean and readable way: torvalds/linux@f62cacc
Combined with Gellert's slave_sg implementation and your spi-bcm2708, it shouldn't be to hard to get working.
The text was updated successfully, but these errors were encountered: