Investigate accelerating Pounder profile configuration #155

ryan-summers · 2020-10-21T06:55:12Z

@jordens has indicated that there is a desire to configure Pounder asynchronously to avoid CPU overhead. Ideally, pounder could be updated at the 500KHz sampling rate of Stabilizer.

This task involves investigative work into asynchronously sending Pounder configurations over the QSPI interface.

Specifically, the interest is to send up to two channel configurations (phase, amplitude, and frequency) at a rate of 500KHz. The CPU overhead for each configuration should be minimal.

Current investigations:

It is possible to configure multiple pounder registers in a single SPI transaction, so once the transaction has been set up, it can be written to the QSPI FIFO or DMA can be used.
Measure the amount of time required to configure a transfer in the QSPI peripheral
Measure the amount of time required to configure a transfer using DMA
Determine if QSPI can operate in "endless transaction" mode

ryan-summers · 2020-10-22T07:46:26Z

I have confirmed that the QSPI can operate without needing the instruction/address phase (e.g. data writes to FIFO only) and have confirmed that we can complete an "endless" transaction where CS is held low and we write data into the QSPI FIFO whenever available.

ryan-summers · 2020-10-22T07:56:38Z

When writing data directly to the QSPI FIFO, 14 bytes took 626 cycles to write, which comes out to a total of ~1.565 microseconds for the operation.

jordens · 2020-10-22T08:12:45Z

That's more than an order of magnitude slower than I expected. Surely there is room for improvement, no? No need to do checks and no need to drain it.

ryan-summers · 2020-10-22T08:34:28Z

I agree - there's a lot of room for optimization here, I'm just trying to understand where the timing is coming from.

The writes to the FIFO occur over the AHB3 bus, which is currently clocked at 200MHz.

The following source results in the following assembly:

pub fn write_stream(&mut self, addr: u8, data: &[u8]) {
    unsafe {
        ptr::write_volatile(&self.rb.dr as *const _ as *mut u8, addr);
        for byte in data {
            ptr::write_volatile(&self.rb.dr as *const _ as *mut u8, *byte);
        }
    }
}

|   0x8022e7c <stm32h7xx_hal::qspi::Qspi::write_stream>     push    {r4, r6, r7, lr}                                                                                                                              │
│   0x8022e7e <stm32h7xx_hal::qspi::Qspi::write_stream+2>   add     r7, sp, #8                                                                                                                                    │
│   0x8022e80 <stm32h7xx_hal::qspi::Qspi::write_stream+4>   movw    r0, #20512      ; 0x5020                                                                                                                      │
│   0x8022e84 <stm32h7xx_hal::qspi::Qspi::write_stream+8>   movt    r0, #20992      ; 0x5200                                                                                                                      │
│B+>0x8022e88 <stm32h7xx_hal::qspi::Qspi::write_stream+12>  strb    r1, [r0, #0]                                                                                                                                  │
│   0x8022e8a <stm32h7xx_hal::qspi::Qspi::write_stream+14>  cbz     r3, 0x8022ec8 <stm32h7xx_hal::qspi::Qspi::write_stream+76>                                                                                    │
│   0x8022e8c <stm32h7xx_hal::qspi::Qspi::write_stream+16>  ands.w  lr, r3, #3                                                                                                                                    │
│   0x8022e90 <stm32h7xx_hal::qspi::Qspi::write_stream+20>  mov     r1, r2                                                                                                                                        │
│   0x8022e92 <stm32h7xx_hal::qspi::Qspi::write_stream+22>  sub.w   r12, r3, #1                                                                                                                                   │
│   0x8022e96 <stm32h7xx_hal::qspi::Qspi::write_stream+26>  ittt    ne                                                                                                                                            │
│   0x8022e98 <stm32h7xx_hal::qspi::Qspi::write_stream+28>  ldrbne.w        r4, [r1], #1                                                                                                                          │
│   0x8022e9c <stm32h7xx_hal::qspi::Qspi::write_stream+32>  strbne  r4, [r0, #0]                                                                                                                                  │
│   0x8022e9e <stm32h7xx_hal::qspi::Qspi::write_stream+34>  cmpne.w lr, #1                                                                                                                                        │
│   0x8022ea2 <stm32h7xx_hal::qspi::Qspi::write_stream+38>  bne.n   0x8022eca <stm32h7xx_hal::qspi::Qspi::write_stream+78>                                                                                        │
│   0x8022ea4 <stm32h7xx_hal::qspi::Qspi::write_stream+40>  cmp.w   r12, #3                                                                                                                                       │
│   0x8022ea8 <stm32h7xx_hal::qspi::Qspi::write_stream+44>  it      cc                                                                                                                                            │
│   0x8022eaa <stm32h7xx_hal::qspi::Qspi::write_stream+46>  popcc   {r4, r6, r7, pc}                                                                                                                              │
│   0x8022eac <stm32h7xx_hal::qspi::Qspi::write_stream+48>  add     r2, r3                                                                                                                                        │
│   0x8022eae <stm32h7xx_hal::qspi::Qspi::write_stream+50>  subs    r1, #4                                                                                                                                        │
│   0x8022eb0 <stm32h7xx_hal::qspi::Qspi::write_stream+52>  ldrb.w  r3, [r1, #4]!                                                                                                                                 │
│   0x8022eb4 <stm32h7xx_hal::qspi::Qspi::write_stream+56>  strb    r3, [r0, #0]                                                                                                                                  │
│   0x8022eb6 <stm32h7xx_hal::qspi::Qspi::write_stream+58>  ldrb    r3, [r1, #1]                                                                                                                                  │
│   0x8022eb8 <stm32h7xx_hal::qspi::Qspi::write_stream+60>  strb    r3, [r0, #0]                                                                                                                                  │
│   0x8022eba <stm32h7xx_hal::qspi::Qspi::write_stream+62>  ldrb    r3, [r1, #2]                                                                                                                                  │
│   0x8022ebc <stm32h7xx_hal::qspi::Qspi::write_stream+64>  strb    r3, [r0, #0]                                                                                                                                  │
│   0x8022ebe <stm32h7xx_hal::qspi::Qspi::write_stream+66>  ldrb    r3, [r1, #3]                                                                                                                                  │
│   0x8022ec0 <stm32h7xx_hal::qspi::Qspi::write_stream+68>  strb    r3, [r0, #0]                                                                                                                                  │
│   0x8022ec2 <stm32h7xx_hal::qspi::Qspi::write_stream+70>  adds    r3, r1, #4                                                                                                                                    │
│   0x8022ec4 <stm32h7xx_hal::qspi::Qspi::write_stream+72>  cmp     r3, r2                                                                                                                                        │
│   0x8022ec6 <stm32h7xx_hal::qspi::Qspi::write_stream+74>  bne.n   0x8022eb0 <stm32h7xx_hal::qspi::Qspi::write_stream+52>                                                                                        │
│   0x8022ec8 <stm32h7xx_hal::qspi::Qspi::write_stream+76>  pop     {r4, r6, r7, pc}                                                                                                                              │
│   0x8022eca <stm32h7xx_hal::qspi::Qspi::write_stream+78>  ldrb    r1, [r2, #1]                                                                                                                                  │
│   0x8022ecc <stm32h7xx_hal::qspi::Qspi::write_stream+80>  cmp.w   lr, #2                                                                                                                                        │
│   0x8022ed0 <stm32h7xx_hal::qspi::Qspi::write_stream+84>  strb    r1, [r0, #0]                                                                                                                                  │
│   0x8022ed2 <stm32h7xx_hal::qspi::Qspi::write_stream+86>  bne.n   0x8022ed8 <stm32h7xx_hal::qspi::Qspi::write_stream+92>                                                                                        │
│   0x8022ed4 <stm32h7xx_hal::qspi::Qspi::write_stream+88>  adds    r1, r2, #2                                                                                                                                    │
│   0x8022ed6 <stm32h7xx_hal::qspi::Qspi::write_stream+90>  b.n     0x8022ea4 <stm32h7xx_hal::qspi::Qspi::write_stream+40>                                                                                        │
│   0x8022ed8 <stm32h7xx_hal::qspi::Qspi::write_stream+92>  ldrb    r1, [r2, #2]                                                                                                                                  │
│   0x8022eda <stm32h7xx_hal::qspi::Qspi::write_stream+94>  strb    r1, [r0, #0]                                                                                                                                  │
│   0x8022edc <stm32h7xx_hal::qspi::Qspi::write_stream+96>  adds    r1, r2, #3                                                                                                                                    │
│   0x8022ede <stm32h7xx_hal::qspi::Qspi::write_stream+98>  b.n     0x8022ea4 <stm32h7xx_hal::qspi::Qspi::write_stream+40>

jordens · 2020-10-22T08:58:22Z

Well, let's be pragmatic. My guess is that the low hanging fruit are word size access (and letting the fifo do the gearbox if it can), fully unrolling the transfer (i.e. pack/serialize all writes into one array, fixed size, leaving the addresses in the array constant, then do the transfer on the array). This seems to be a rather high opt-level already from the way it tries to distinguish the different remaining slice sizes.
But then I'd stop and call it good enough. DMA would not look much different assuming a similar API.

ryan-summers · 2020-10-22T09:41:38Z

When encoding a single channel into 4 32-bit words ([u32; 4] array, which includes an additional extraneous write to the CSR (CSR + AMP + PHASE + FREQ writes only take 14 bytes), the whole transaction takes 109 clock cycles, which comes out to ~272 nanoseconds.

ryan-summers · 2020-10-22T10:44:52Z

When the floating point math required for frequency/amplitude/phase calculations are simplified to 32-bit, it looks like the QSPI stage + calculation and serialization takes approximately 1.6uS.

We may be able to shave some of this time off further by optimizing the calculations as well.

jordens · 2020-10-22T11:36:04Z

There is no need for those conversions. They can be accounted for in filter coefficients elsewhere.

jordens · 2020-10-22T11:51:05Z

Still, 109 cycles sounds a lot for what is 4 loads and 4 stores and a handful instructions to get source and target addresses into registers. But in any case optimizing this would not affect the API.

ryan-summers · 2020-10-26T15:17:48Z

Closing this - the investigation is complete and it looks like we can hit necessary throughput. Additional development on this is tracked overall by #147 and will be broken out into separate tickets

ryan-summers added the enhancement New feature or request label Oct 21, 2020

ryan-summers self-assigned this Oct 21, 2020

ryan-summers closed this as completed Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate accelerating Pounder profile configuration #155

Investigate accelerating Pounder profile configuration #155

ryan-summers commented Oct 21, 2020 •

edited

Loading

ryan-summers commented Oct 22, 2020

ryan-summers commented Oct 22, 2020

jordens commented Oct 22, 2020

ryan-summers commented Oct 22, 2020

jordens commented Oct 22, 2020 •

edited

Loading

ryan-summers commented Oct 22, 2020

ryan-summers commented Oct 22, 2020

jordens commented Oct 22, 2020

jordens commented Oct 22, 2020

ryan-summers commented Oct 26, 2020

Investigate accelerating Pounder profile configuration #155

Investigate accelerating Pounder profile configuration #155

Comments

ryan-summers commented Oct 21, 2020 • edited Loading

ryan-summers commented Oct 22, 2020

ryan-summers commented Oct 22, 2020

jordens commented Oct 22, 2020

ryan-summers commented Oct 22, 2020

jordens commented Oct 22, 2020 • edited Loading

ryan-summers commented Oct 22, 2020

ryan-summers commented Oct 22, 2020

jordens commented Oct 22, 2020

jordens commented Oct 22, 2020

ryan-summers commented Oct 26, 2020

ryan-summers commented Oct 21, 2020 •

edited

Loading

jordens commented Oct 22, 2020 •

edited

Loading