Way forward #1

notro · 2015-02-26T12:23:45Z

I'm picking up from notro/rpi-firmware#18

You are right: for all practical purposes it is not possible to run than 32k (0xffff seems stupid) and would require multiple adjacent pages in physical space - possible, but hard...

I don't follow you here, are you talking about the dma mask?

So it seems you will have to have multiple transfers in a spi message one per page.

Why only one page per transfer? Is it stated somewhere that the SPI controller has this limitation? Or the DMA controller?
I have tried 64kB and it worked.

That trick with device-tree is nice - I did not think of that - but under some circumstances it would be helpfull to load multiple devices. But then, maybe we could go the other way: implement a compatibility in the old driver and ask the foundation to start using spi-bcm2835 in the dt overlays they provide.

spi-bcm2835 is patched so it can be built on ARCH_BCM2708.
But I don't see any point in enabling in it kconfig before it has some advantage over spi-bcm2708 apart from removing a message from the kernel log.

do you have an idea where the foundation hides their dt-overlay sources, so that I can provide a merge request to prepare the move to spi-bcm2835?

https://github.com/raspberrypi/linux/tree/rpi-3.18.y/arch/arm/boot/dts
But I think @popcornmix is reluctant to do that, because there might be users of the 9-bit support in spi-bcm2708.
He said so when I checked to see if there was interest in using the DMA driver raspberrypi/linux#567

The big advantage of the spi-bcm2835 is that it ist already upstream, so we get the official review from kernel and have less of a fight when we incorporate it officially...

I agree it's best to get it included upstream first. When it is, Dom won't hesitate in pulling it into raspberry/linux.

So what are my low-hanging fruit:
Latency between CS-low and the first byte sent (3us) due to the comment saying that it is not possible as the first irq triggers immediately (which is not true)

Yes I saw that comment.

transfer as many bytes in/out as the chipset says it can handle - not stopping after 12 to 16 bytes while we could feed it more. This would reduce the number of interrupts

My brief testing showed that the FIFO is larger than 120 bytes: https://github.com/notro/spi-bcm2708/wiki#fifo-size

add lossi/9bit mode (so that it becomes easier to move to the new driver in all situations) - I hear you have a device that supports that so maybe you could test it?

I have a display that uses this, and that was why I addded support for it in spi-bcm2708. The problem is that LoSSI is not generic 9-bit operation.
Sending certain values (MIPI DCS read register) triggers a 8/24/32 bit read in hardware, so I don't think this is acceptable. Hence I haven't made a patch for spi-bcm2835.
I didn't "get" this when I made the patch, or else I would have dropped it. It's mentioned in the datasheet (spi-bcm2708 wiki).
I work around this in my driver by sending 8x9-bit as 9x8-bit. Since register zero is a no-op and pixel data is on at least 8-byte boundaries, I get away with this.

The Auxiliary peripherals SPI controller is also available on the header now. I think it's better to implement other bit lengths (1-32) in a driver for this instead. It can't do DMA however.

move multiple transfers into the interrupt handler without having to wake up a thread and the corresponding late cues, which will speed up transfers that are write then read (so your touch controller)

The client driver would have to be optimized as well, wouldn't it? Putting write and read in the same SPI message?

fix CS polarity issues - this may mean any gpio can do CS!

I have seen requests for more than two CS lines from people having a SPI touchdisplay occupying the entire SPI bus.

There is also this issue with the clock divider limitation: https://github.com/notro/spi-bcm2708/wiki#spi-clock-divider
Since you have a logical analyzer, you would be able to check if there are any limitaions at all on the CDIV register.

DMA
Now that my FBTFT drivers are accepted in mainline (staging), they can be pulled into raspberrypi/linux. When that is done I need a DMA SPI driver there as well. Then I can stop releasing my prebuilt kernels for FBTFT, relieving me of that maintanence burden.
I found this patch that implements can_dma in a very clean and readable way: torvalds/linux@f62cacc
Combined with Gellert's slave_sg implementation and your spi-bcm2708, it shouldn't be to hard to get working.

The text was updated successfully, but these errors were encountered:

notro · 2015-02-26T15:06:22Z

I forgot to mention that SPI_3WIRE is missing support in the driver: https://github.com/notro/spi-bcm2708/wiki#bidirectional-mode
(datasheet: 10.2.2 Bidirectional mode, REN bit in SPI_CS)

msperl · 2015-02-26T16:25:14Z

actually (and again as per official documentation) you can only do DMA transfers of 0-65535 bytes (16 bits) - at least that is the way I read it.
But if I remember it IS possible to transfer bigger "chunks", but that is not officially documented (which I have mostly used for my delays).

As for clocks: I have not had any issues with the dynamic clocks, but then I have to admit I only used my "preferred" frequencies of 4/8/10MHz as the goal...

As for the clocks there is also somewhere else (i believe with the UART docu) some mentioning of how the chip does clock spreading.

In the end the clock might be slightly asymetrical with clock dividers that are not a power of two, but that is typically NOT an issue, as lots of devices can handle that assymetry...

As for pages:
Obviously the addresses that are "linear" in linux space are not necessarily linear in DMA/VideoCore space (but most often they are). So either some sort of scatter/gather process is needed (based on page boundries) or your driver is submitting a SPI_message with lots of distinct SPI_transfers.

So I wonder what you have really implemented - say transfer these 192000 (= 320_200_3) dma-mapped bytes (which would mean that these would be guaranteed to be dma-mapped in sequential order or not)

msperl · 2015-02-26T16:46:34Z

See what I have found and measured so far - the wiki contains all of it.
These scheduling delays may be a major headache, but we have shaved of about 3us already...

msperl · 2015-02-26T20:57:30Z

The more I read of LoSSI the more I come to the conclusion that with the code now it can not work.
The question is: is it needed by anyone?

msperl · 2015-02-27T06:17:26Z

Some more thought on this gives me the impression that the "read" operations are not implemented at all (as you said: read would require knowledge of the last command and then we would need to read the Fifo differently)

But in the end: as far as I understood you are not using it for any of your tft drivers (you use the workarround 8_9bit = 9_8bit, which is actually more efficient memory wise) and those seem the only ones that use this mode.

Is this a correct assumption?

msperl · 2015-02-27T06:19:04Z

Next thing I will do is enable the realtime priority with the bcm2835 driver and put up the differences on the wiki including screenshots...

msperl · 2015-02-27T15:12:06Z

There have been now a lot of fixes already applied - some of which I can not really test due to missing HW.
See the other open issues...
Now I can progress to other optimizations, which are:

minimal transfers (multiple) done via interrupt alone
Transfers >64 bytes trigger DMA if is_dma_mapped (@notro: I guess this would be of most interrest to you)

Please look if you find any obvious issues...

msperl · 2015-02-28T11:42:25Z

Actually I have found a quick workaround for my issues by implementing polling for situations where we run the interrupt which then wakes up the worker thread for requests that run <20us.

To put it in perspective:

spi-bcm2835 - original takes (at least with this example) 358us to complete a single CAN-message-receive (with 58us for the time interrupt to CS down).
the fully patched version with polling for up to 20us runs in 160us to complete a single CAN-message-receive (with 64us for the time interrupt to CS down).

This does not solve the issue of long term transfer delays, but at still solves part of the issue.

msperl · 2015-02-28T11:46:24Z

We could probably modify the code to use DMA for anything that lies beyond this.
This would mean minimal overheads/penalties for most cases.
The only thing that is open to question is: how long should we poll/is 20us too much (it is geared towards my use-case, where I need to transfer 15 bytes)?

msperl · 2015-02-28T11:51:20Z

reducing the time to 10us will mean an interrupt getting scheduled and that means a penalty of 36us.
The data has been added and I will include the graphs in the wiki page...

msperl · 2015-02-28T16:41:47Z

In the meantime I have figured out a test for most use-cases (CS-polarity and 3Wire)
this only leaves LoSSI open.

notro · 2015-03-02T14:57:56Z

Martin I'll pick up this discussion later in the week.
I prepared and sent a patchset this weekend, so I'll have to spend some time on the replies to that. The kernel merge window closed a week ago, so I guess most people have been able to catch their breath.

msperl · 2015-03-15T19:17:09Z

Note that there is a major issue with latencies of interrupts - need to document this.
Obviously it is more performant with a RPI2, but still far from perfect.

popcornmix · 2015-03-16T14:24:46Z

We are aware of interrupt latency causing lirc driver issues when network traffic is high (e.g. playing a high bitrate video from kodi over network makes remote control unreliably).

From a hardware point of view we can assign the fiq to a core. We can assign the system (i.e. GPU) interrupts to a core. The local timers and mailboxes are core specific.

The USB interrupt comes from a FIQ, and when there is non-trivial work required, a (fake) system IRQ is triggered. We believe work done from this point harms latency of other interrupts.

@P33M is going to move the "fake" IRQ used by the USB system to use a core specific mailbox interrupt (on a core different to the normal system IRQs) which will hopefully significantly improve interrupt latency.

msperl · 2015-03-16T15:30:02Z

@popcornmix: why not handle parts of the USB interrupt via RTOS - from what I have looked at, it looks as if the VideoCore has a fully vectored interrupt handler, so calling the specific interrupts directly.
maybe that way irq handling could get improved and the corresponding latencies reduced...

Attached some measurements from the SPI driver with several pieces of the code insturmented to pull GPIOs high/low.

You see that: the interrupt takes from:
A1 to E2 or about 5.65us
of that we spend:

0.65us to get into handle_irq (A1-B1)
- I have not figured out the entry point yet to identify the REAL interrupt handler, so
2.00us to call the "correct" SPI interrupt handler (B1-C1)
** here we spend 0.20us in armctrl_mask_irq
0.80us to handle the interrupt itself (C1-D1)
2.20us from end of interrupt till the exit of handle_irq (E1-E2)
- here we spend another 0.20us in armctrl_unmask_irq.
there is probably some more of the cleanup stuff that has not been identified

So if you look at this then we spend less than 15% of the whole time in interrupt in the real interrupt handler routine. Note also that the IRQ "shutdown" takes longer than transferring 2 bytes over SPI at 8MHz.

We might see the same thing also with the USB irq.

Note that this measurement has been taken on a RPI2 with the foundation kernel.
I am looking into measuring the same on a ModelB+ with the upstream stock kernel.

msperl · 2015-03-16T15:39:50Z

Another observation:
In general there are 2 interrupts that happen every 8000.9us on a VERY regular basis.

the first one takes 22.45us
the second takes 18.15us
time between end of first and the start of the second is 104.55us

and then there is sometimes an interrupt that does only enter do_irq but does not call armctl_*mask_irq (so the mask/unmask pin toggle does not happen)
but this one is NOT happening all the time, but still in regular intervals.

You might be able to identify which those are, but I would guess that the 2 irqs that hapen back2back are related to USB.

msperl · 2015-03-16T16:26:54Z

Note: here the patch with the instrumentation:
raspberrypi/linux@rpi-3.18.y...msperl:rpi-3.18.y-irq_instrumented
obviously you also need a fast logic-analyzer to get those measurements...

msperl · 2015-03-23T19:03:29Z

A few of those patches have been in the spi tree - which shall go into linux-next, so we are moving forward...

msperl · 2015-03-23T19:41:20Z

Status is right now:
I can run 4 distinct devices on my RPI now using arbitrary GPIOs as CS.
2x mcp2515
1x fb_st7735r
1x enc28j60

The only thing I have not been able to handle is an SD-card.

I still found one "awkward" situation in case I disable one of the mcp2515s in the Device-tree
then the CS for that stays low and does not get initialized so this "disabled" mcp2515 still feels the need to answer.
The problem is: setup does not get called for the disabled device, so the gpio does not get set as it should...

I fear that this is a "generic" issue that I need to bring up on the SPI list.

msperl · 2015-03-24T18:06:40Z

the following commits:
ddb9376c836548263e6744bf7636181589b06d16 - fix all checkpath --strict messages
066aa208a57ff914dd56540830f69b6a685fd700 - allow arbitrary GPIO to act as SPI-chip_select
253e15ed751a96c57aeebe3bee5ca633eba8dd87 - fill/drain SPI-fifo as much as possible
are now in: //git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git
and there in branch: for-next

msperl · 2015-03-24T18:10:29Z

next step:

avoid initial irq to transmit data - this reduces delays by 4us
implement polling for short transfers - an interrupt takes about 8-12us from start to end, so that seems an acceptable time to poll...

notro · 2015-03-24T19:28:03Z

Glad to see the progress.

msperl · 2015-03-24T21:52:53Z

I guess that DMA will be the 3rd in the list now - to start assuming that we have spi_message.is_dma_mapped=true and not relying on can_dma functionality...
@notro: Maybe you can have your fbtft driver generate such spi_messages? Is there something like this in place already - maybe with a DT or a module-parameter?

notro · 2015-03-25T16:00:27Z

to start assuming that we have spi_message.is_dma_mapped=true and not relying on can_dma functionality...

I don't understand.

By default fbtft enables is_dma_mapped for pixel data transfers: https://github.com/torvalds/linux/blob/master/drivers/staging/fbtft/fbtft-io.c#L27

msperl · 2015-03-26T08:06:26Z

With can_dma you can define if the framework should try to do dma_mapping/unmapping for you - return 1 to request mapping, 0 for no mapping.

This only applies if the tx_dma is not set.

Essentially it is just code to decide if it is worth the effort or not - dma might have limitations and also setup times may be expensive compared to the gain.

So in our case it would make sense to implement this for transfers >64 bytes, as then we can avoid interrupts (or maybe after 48 bytes if I interpret the DC-register and there RPANIC and TPANIC correctly).

I guess I should check when the irq really starts to run...

But as you are using dma_mapped already this is actually not important...

msperl · 2015-03-26T10:31:51Z

see the patch used for upstream is as of now available as: https://github.com/msperl/spi-bcm2835/tree/patch_the_upstream_use_transfer_one
it contains a function that is not available in 3.18, but basic test shows that the driver still works as expected.
Further testing would be needed to see if there are any error-conditions that we miss because of this.
Hopefully this version will make it upstream...

notro · 2015-03-26T12:00:00Z

Sorry, should have been more specific.
I understand that can_dma is part of dma mapping, but I didn't understand what is_dma_mapped had to do with anything.
The spi core doesn't care about is_dma_mapped, and the driver doesn't have to either, because core handles dma mapping.
This makes it much easier to write client drivers who wants dma, but has vmalloc'ed memory for instance. The spi core breaks it up into pages and makes sure everything is ok for dma transfer.
It even breaks up buffers that are too large for dma.

notro · 2015-03-26T12:03:52Z

Have you seen this in https://github.com/raspberrypi/linux/blob/rpi-3.18.y/drivers/dma/bcm2708-dmaengine.c

#define MAX_LITE_TRANSFER 32768
#define MAX_NORMAL_TRANSFER 1073741824

This indicates that the Pi is capable of VERY large DMA transfers (1GB).
And if you look at bcm2835_dma_prep_slave_sg() you can see that it breaks up transfers that is too large.
So, is there a limit on the spi peripheral maybe?

notro · 2015-03-26T12:11:03Z

PR raspberrypi/linux#909 puts in place the necessary pieces to get DMA channels from DT.
I also have a patch for drivers/dma/bcm2835-dma.c that will eventually be sent upstream. Need some more testing first.

msperl · 2015-03-26T16:43:31Z

Well - the is_dma_mapped is the easier portion if the driver supports it.
getting the scatter/gather lists and transfering them is still up to the driver to pass that list on to the dmaengine.

So for the initial things I really like to get the "simple" dma working before I address the "complex" version...
And before that I want to get those "simple" things that mean huge improvements in place...

I guess you saw my email with the patch for upstream, which hopefully might go into 4.1.

So if you can run some independent tests and send a "Tested-by:" it would be helpfull to see that working with you collection of displays - you should be able to run a lot of those displays now concurrently...

msperl · 2015-03-26T16:46:16Z

As for the transfers: that 1GB size is surprizing, but I guess it is mostly an artificial limit imposed by the 1+1+1+1G split for the different caching zones.
BTW: I am also getting an USB-soundcard to see if I can reproduce those "clicks" when running Xclock...

msperl · 2015-04-12T20:29:30Z

Actually lots of things have been merged into spi/for-next.
So these may get into 4.1, which should improve the overall situation.
Now (when the latest) build is done I will continue with the dma portion - I come have to realize that as soon as the transfer exceeds 48 bytes (or maybe 64?) it becomes beneficial to run in dma mode as we want to avoid interrupts.
For now I will only handle is_dma_mapped transfers - those scatter/gather things need to get addressed separately.

msperl · 2015-04-13T15:46:57Z

The following patches are scheduled for 4.1:
Martin Sperl (9):
spi: bcm2835: fix all checkpath --strict messages
spi: bcm2835: fill/drain SPI-fifo as much as possible during interrupt
spi: bcm2835: clock divider can be a multiple of 2
spi: bcm2835: enable support of 3-wire mode
spi: bcm2835: move to the transfer_one driver model
spi: bcm2835: fix code formatting issue
spi: bcm2835: fill FIFO before enabling interrupts to reduce interrupts/message
spi: bcm2835: transform native-cs to gpio-cs on first spi_setup
spi: bcm2835: enabling polling mode for transfers shorter than 30us

So lots of improvements...

We might want to push those into the foundation kernel...

notro · 2015-04-15T13:06:02Z

those scatter/gather things need to get addressed separately.

You can't avoid that. That's the api used for dma transfers. And you get alot of help from the SPI core to use it. I don't think you will get away in mainline with setting up dma CB's directly in the driver.
I sent the necessary bits earlier today: [PATCH] dmaengine: bcm2835: Add slave dma support

notro · 2015-04-15T13:20:11Z

AFAICT there is no cap on how large a SPI DMA transfer can be. If the dma engine driver detects a Lite channel, it chops up the transfer in 64k-1 chunks. Seems the SPI driver doesn't have to care about that.

From bcm2835-dma.c patch:

+   if (c->ch >= 8) /* LITE channel */
+       max_size = MAX_LITE_TRANSFER;
+   else
+       max_size = MAX_NORMAL_TRANSFER;
+
+   /*
+    * Store the length of the SG list in d->frames
+    * taking care to account for splitting up transfers
+    * too large for a LITE channel
+    */

msperl · 2015-04-15T13:20:57Z

actually the spi framework has 2 DMA options:

spi_message.is_dma_mapped=1 and spi_transfer.tx_dma and spi_transfer.rx_dma set for all - this exists for quite a few years now)
the can_dma method with scatter-gather support

There is no support (yet) in the framework that would translate is_dma_mapped=1 cases to "normal"
scatter/gather type of requests.

I guess that at some point the is_dma_mapped case may become depreciated - when there is a translation to scatter/gather...

So right now you have to implement each one of those separately or you tell the system that it needs to remap the areas every time even though they are mapped (which is a waste of CPU resources, if the driver has already mapped it).

And as your drivers currently implement is_dma_mapped, this works immediately...
ScatterGather may become interresting for big transfers that are NOT dma_mapped.

The big question that remains is: how long does it take to map something?
But we will find that out and then can make a decission at what point we should restart

msperl · 2015-04-15T14:24:52Z

Looking at your patch quickly I would actually recommend that you do NOT split on 63335 for the LITE channels!

Because transfers of 65535 bytes to SPI-FIFO via DMA means that you actually transfer 16384 (4-byte)words to the fifo where the 16384th word contains a 0x00 as the final byte.
This gets counted as a byte to transfer!

So you should actually separate it on 32768 if above 65535 for the lite channel...

Martin

notro · 2015-04-15T14:52:58Z

Would (64k - 4) cover it?
Do you know if there is a minimum DMA transfer size? I have experienced that a 1 byte SPI DMA transfer hangs the DMA spi-bcm2708.

msperl · 2015-04-15T15:06:26Z

well - as the minimum DMA transfer is 4 bytes anyway that is not the real issue.
length of the transfer is controlled via DLEN=1 and then you push in 1 word into the fifo.
afterwards you have to CLEAR_RX/CLEAR_TX!

It might actually be possible that the RX-DMA is stalling if it is not a minimum of 4 bytes in the SPI case, as the SPI may not triggeri the DREQ - but it should if DLEN is set correctly....

msperl · 2015-04-15T15:09:52Z

that is actually one of the reasons why I want to implement the is_dma_mapped event first.

As for 65532 - it probably would work, but IMO if you have something like this then try to use a power of 2, as there may be other cases that require some sort of alignment - that way you play it save - I would go for at least SZ_64K-SZ_4K to give you a page size of "margin"

notro · 2015-04-15T15:16:25Z

From datasheet:

DMA channel 0 and 15 are fitted with an external 128 bit 8 word read FIFO.

So (64k - 16) would cover that. But you're right, there are probably more allignment issues here. Let's see what the DMA people has to say about your comment.

popcornmix · 2015-04-15T15:50:10Z

From spec:

DMA LITE Engines
Several of the DMA engines are of the LITE design. This is a reduced specification engine designed to save space. The engine behaves in the same way as a normal DMA engine except for the following differences.

The internal data structure is 128 bits instead of 256 bits. This means that if you do a 128 bit wide read burst of more than 1 beat, the DMA input register will be full and the read bus will be stalled. The normal DMA engine can accept a read burst of 2 without stalling. If you do a narrow 32 bit read burst from the peripherals then the lite engine can cope with a burst of 4 as opposed to a burst of 8 for the normal engine. Note that stalling the read bus will potentially reduce the overall system performance, and may possible cause a system lockup if you end up with a conflict where the DMA cannot free the read bus as the read stall has prevented it writing out its data due to some circular system relationship.

The Lite engine does not support 2D transfers. The TDMODE, S_STRIDE, D_STRIDE and YLENGTH registers will all be removed. Setting these registers will have no effect.

The DMA length register is now 16 bits, limiting the maximum transferrable length to 65536 bytes.

Source ignore (SRC_IGNORE) and destination ignore (DEST_IGNORE) modes are removed. The Lite engine will have about half the bandwidth of a normal DMA engine, and are intended for low bandwith peripheral servicing.

This suggests 65536 is allowed (even though technically it requires 17 bits). Have you tried it?

msperl · 2015-04-15T17:24:13Z

The missing xxx_ignore make the Lite dma's unpractical from an spi perspective, as they are typical activities that can happen - especially when pushing data to a tft or similar - lots of data but only one direction...

The 17bit feature may be again one of those errata in the document (like we have seen so many in spi alone + the hw-strangeness we have seen in border cases with regards to native-io) - I would play it by the book and use 32k and if you find time then use 64k for a specific use-case to test it.

Anyway: I will need to review how the transfers work when using dma-engine - for example: does it support "null" as the pointers and will map it to ignore...

popcornmix · 2015-04-15T17:27:51Z

The lack of SRC_IGNORE can be worked around. Just allocate a small zeroed buffer and set source to non-increment.

msperl · 2015-04-15T20:05:55Z

that is true...
As for 7-15_TXFR_LEN the datasheet explicitly states on page 57:
bits 15:0 XLENGTH Length of transfer, in bytes. Updated by the DMA engine as the transfer progresses.
65536 could also mean 0 DMA length...
But in the end I actually trust the register-map more than the text arround it in those "extra" sections.

Also if I look at the extracted information from the GPU code shared some time ago with regards to the registers for DMA7: https://github.com/msperl/rpi-registers/blob/master/md/Region_DMA7.md
then that info also explicitly states bits 0-15 only and also the corresponding masks for setting and clearing.

All this would indicate to me that the text on page 63 4.5 point 3 is containing a off-by-1 error, so 65535 is "correct"...

On the other hand - the fact that we fetch words and store words as well means that
for len=65532 we would transfer 16383 words
but for len=65533, 65534, 65535 we would transfer 16384 words.
So in the end we would be transferring 64K bytes if we set len= 64K-1.

Arguing that case explaining why this is the case would be quite complex though - also it would need to get confirmed...

Setting it to 32K seems more sensible and avoids all sorts of questions, mentioning that there is a discrepancy in the documentation about 64K and 16 bit limits would be welcome though.
Someone brave may try to run 65535=65536 at some point and come up with a better solution.

But in the end I wonder how many requests do we have that have are >32K with scatter-gather?
If it is really scatter/gather then blocks would typically come from user-space or page-cache and then
you would have 4K blocks for scatter/gather - unless we have huge-pages enabled (but I do not know if that is enabled for ARM/Raspberry PI). So the likleyhood of this happening would be low.
Normal spi drivers on the contrary would either have small messages or they would run a single huge memory block from cma that already comes dma_mapped so they would not use scatter/gather...

msperl · 2015-04-15T20:08:27Z

See raspberrypi/linux#930 for the request to merge the existing patches into the foundation kernels...

P33M · 2015-04-15T22:46:10Z

The transfer length/width field for lite channels can be resolved easily.

Allocate a buffer of length 64k (or 128k) and fill it with a pattern (e.g. 0x55)
Allocate a second buffer of identical length, fill it with a different pattern
Create a transfer on a Lite channel of max length from buffer 1 to buffer 2, see how many bytes get written.
Varying source/destination widths may produce different results - test both 128-bit write and 32-bit write cases.

popcornmix · 2015-04-16T10:24:31Z

I asked the author of the DMA hardware. 65535 is the maximum transfer length of a LITE channel.
65536 will be treated as zero which is undefined (it will actually do one transfer then stop)

msperl · 2015-04-16T11:21:44Z

thanks - the documentation was right there then!
Recommendation on the distribution list is to use either 0x8000 or 0xf000.

msperl · 2015-04-17T07:55:54Z

@notro: after some more reading on dmaengine I now understand why we need that "scatter/gather" patch to dmaengine that you have started the process to upstream.

So for now (that the upstream patches have made it into rpi-4.0.y) I will need to develop the dma implementation on the foundation kernels (as we have something working there - it seems) - when that works then we may be merging both together into upstream as an implementation and client using the implementation and by that avoiding the mmc discussions for now.
Ideally your patch would still go into 4.1.

msperl · 2015-04-17T08:22:48Z

But there is one other concern that comes to my mind with DMA that I am slightly unsure and that could result in issues if not handled correctly:

Say we have a transfer set up like this (using the direct bcm2835 control-blocks):

DEST_AD = 0x00000F01
TXFR_LEN = 0xFF

This results in an effective requested range of [0x00000F01:0x0000FFF]

But as we have 32 bit as minimum transfer length on the DMA this means that we effectively would write to: [0x00000F01:0x0001000]

So the DMA be writing to two 4k-pages: 0x00000000 and 0x00001000 even though in principle we would should only write to 0x00000000.

This could mean we overwrite some structures that are outside of our scope and could crash the kernel or some independent user-space if the page 0x00001000 is not assigned to the same process.

For typical kernel usage it might be less of an issue as the structures would be 4 byte aligned anyway - at least in the majority of cases. But especially with data that comes from user-space it could become an issue if the start address is not aligned on 4 byte addresses (or 16 in the case of wide transfers)

That is all "ifs" for now:

I have not read the code if there are some sort of alignment requirements inside the dma-engine
nor do I know if the HW would write to the second page at all if it is not really necessary (as in the constructed situation above)

I just want to be cautious, so I mention those possible cases here if they have not been addressed...

popcornmix · 2015-04-17T11:17:31Z

Note that DMA deals in bus addresses and there is no concept of pages being assigned to user processes. DMA does not go through ARM MMU so will not generate page faults.

Obviously you need to take care when setting up DMA to not access memory you shouldn't.

DMA also handles byte-enables on SDRAM reads and writes, so even if it is set to use 128-bit accesses, you can still have byte aligned start address, destination address and length. Remember again that DMA deals with bus addresses that bypass arm MMU and arm caches, so care needs to be taken to ensure cache coherency.

Byte-enables don't exist on peripheral addresses, so in general peripheral addresses need to be 32-bit wide, non-incrementing, with a 32-bit aligned address and length.

msperl · 2015-04-17T14:22:17Z

So from what I read in your statement is that we should be safe for all cases - we always only write the amount of bytes we are supposed to write even when the start/end is not 32 bit aligned - in the case of 13 bytes we also only write 13 bytes regardless of if we read 32 or 256 bit wide or not. I am slightly doubtful if this works as expected with a 3 byte transfer to the FIFO registers, that always expect a 32-bit write

I guess we should confirm those assumptions even though I have not seen any issues with my old dma solution - but I guess I did forget some issues and I have not tested all those corner-cases and did not have lots of devices connected...

Worsted case if it does not work as expected then we will need to handle dma-transfers to and from a device where the length is not a multiple of 4 in a special manner say by doing the last transfer to a bounce buffer and copy the 1 to 3 bytes to the "correct" location.

As for cache coherency: the dma_(un)map_single as well as dma_alloc_coherent should take care of those caching issues and sync the CPU-caches if needed.

popcornmix · 2015-04-17T14:29:29Z

The byte enables and unaligned data comment referred to the SDRAM end of the DMA transfer (often both ends are SDRAM when using DMA more copying/blitting).

I did say this didn't apply to peripheral addresses, which should still be 32-bit aligned in length. I would expect using DMA to write 3 bytes to a 32-bit peripheral register would be undefined. It possibly has the behaviour you want, but it's not clear if those 3 bytes should end up in bits 23:0 or 31:8 of the peripheral register, or if it the write won't occur at all as the transfer is not complete.

notro · 2015-04-17T16:43:48Z

DMA is byte aligned on 8-bit SPI, which is also proven by your DMA version of spi-bcm2708

LEN_LONG
Enable Long data word in Lossi mode if DMA_LEN is set
0= writing to the FIFO will write a single byte
1= wrirng to the FIFO will write a 32 bit word

msperl self-assigned this Feb 27, 2015

msperl added the question label Feb 27, 2015

Way forward #1

Way forward #1

Comments

notro commented Feb 26, 2015

notro commented Feb 26, 2015

msperl commented Feb 26, 2015

msperl commented Feb 26, 2015

msperl commented Feb 26, 2015

msperl commented Feb 27, 2015

msperl commented Feb 27, 2015

msperl commented Feb 27, 2015

msperl commented Feb 28, 2015

msperl commented Feb 28, 2015

msperl commented Feb 28, 2015

msperl commented Feb 28, 2015

notro commented Mar 2, 2015

msperl commented Mar 15, 2015

popcornmix commented Mar 16, 2015

msperl commented Mar 16, 2015

msperl commented Mar 16, 2015

msperl commented Mar 16, 2015

msperl commented Mar 23, 2015

msperl commented Mar 23, 2015

msperl commented Mar 24, 2015

msperl commented Mar 24, 2015

notro commented Mar 24, 2015

msperl commented Mar 24, 2015

notro commented Mar 25, 2015

msperl commented Mar 26, 2015

msperl commented Mar 26, 2015

notro commented Mar 26, 2015

notro commented Mar 26, 2015

notro commented Mar 26, 2015

msperl commented Mar 26, 2015

msperl commented Mar 26, 2015

msperl commented Apr 12, 2015

msperl commented Apr 13, 2015

notro commented Apr 15, 2015

notro commented Apr 15, 2015

msperl commented Apr 15, 2015

msperl commented Apr 15, 2015

notro commented Apr 15, 2015

msperl commented Apr 15, 2015

msperl commented Apr 15, 2015

notro commented Apr 15, 2015

popcornmix commented Apr 15, 2015

msperl commented Apr 15, 2015

popcornmix commented Apr 15, 2015

msperl commented Apr 15, 2015

msperl commented Apr 15, 2015

P33M commented Apr 15, 2015

popcornmix commented Apr 16, 2015

msperl commented Apr 16, 2015

msperl commented Apr 17, 2015

msperl commented Apr 17, 2015

popcornmix commented Apr 17, 2015

msperl commented Apr 17, 2015

popcornmix commented Apr 17, 2015

notro commented Apr 17, 2015