Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate accelerating Pounder profile configuration #155

Closed
3 of 4 tasks
ryan-summers opened this issue Oct 21, 2020 · 10 comments
Closed
3 of 4 tasks

Investigate accelerating Pounder profile configuration #155

ryan-summers opened this issue Oct 21, 2020 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@ryan-summers
Copy link
Member

ryan-summers commented Oct 21, 2020

@jordens has indicated that there is a desire to configure Pounder asynchronously to avoid CPU overhead. Ideally, pounder could be updated at the 500KHz sampling rate of Stabilizer.

This task involves investigative work into asynchronously sending Pounder configurations over the QSPI interface.

Specifically, the interest is to send up to two channel configurations (phase, amplitude, and frequency) at a rate of 500KHz. The CPU overhead for each configuration should be minimal.

Current investigations:

  • It is possible to configure multiple pounder registers in a single SPI transaction, so once the transaction has been set up, it can be written to the QSPI FIFO or DMA can be used.
  • Measure the amount of time required to configure a transfer in the QSPI peripheral
  • Measure the amount of time required to configure a transfer using DMA
  • Determine if QSPI can operate in "endless transaction" mode
@ryan-summers ryan-summers added the enhancement New feature or request label Oct 21, 2020
@ryan-summers ryan-summers self-assigned this Oct 21, 2020
@ryan-summers
Copy link
Member Author

I have confirmed that the QSPI can operate without needing the instruction/address phase (e.g. data writes to FIFO only) and have confirmed that we can complete an "endless" transaction where CS is held low and we write data into the QSPI FIFO whenever available.

@ryan-summers
Copy link
Member Author

When writing data directly to the QSPI FIFO, 14 bytes took 626 cycles to write, which comes out to a total of ~1.565 microseconds for the operation.

@jordens
Copy link
Member

jordens commented Oct 22, 2020

That's more than an order of magnitude slower than I expected. Surely there is room for improvement, no? No need to do checks and no need to drain it.

@ryan-summers
Copy link
Member Author

I agree - there's a lot of room for optimization here, I'm just trying to understand where the timing is coming from.

The writes to the FIFO occur over the AHB3 bus, which is currently clocked at 200MHz.

The following source results in the following assembly:

pub fn write_stream(&mut self, addr: u8, data: &[u8]) {
    unsafe {
        ptr::write_volatile(&self.rb.dr as *const _ as *mut u8, addr);
        for byte in data {
            ptr::write_volatile(&self.rb.dr as *const _ as *mut u8, *byte);
        }
    }
}
|   0x8022e7c <stm32h7xx_hal::qspi::Qspi::write_stream>     push    {r4, r6, r7, lr}                                                                                                                              │
0x8022e7e <stm32h7xx_hal::qspi::Qspi::write_stream+2>   add     r7, sp, #8
0x8022e80 <stm32h7xx_hal::qspi::Qspi::write_stream+4>   movw    r0, #20512      ; 0x5020                                                                                                                      │
0x8022e84 <stm32h7xx_hal::qspi::Qspi::write_stream+8>   movt    r0, #20992      ; 0x5200                                                                                                                      │
│B+>0x8022e88 <stm32h7xx_hal::qspi::Qspi::write_stream+12>  strb    r1, [r0, #0]
0x8022e8a <stm32h7xx_hal::qspi::Qspi::write_stream+14>  cbz     r3, 0x8022ec8 <stm32h7xx_hal::qspi::Qspi::write_stream+76>                                                                                    │
0x8022e8c <stm32h7xx_hal::qspi::Qspi::write_stream+16>  ands.w  lr, r3, #3
0x8022e90 <stm32h7xx_hal::qspi::Qspi::write_stream+20>  mov     r1, r2                                                                                                                                        │
0x8022e92 <stm32h7xx_hal::qspi::Qspi::write_stream+22>  sub.w   r12, r3, #1
0x8022e96 <stm32h7xx_hal::qspi::Qspi::write_stream+26>  ittt    ne                                                                                                                                            │
0x8022e98 <stm32h7xx_hal::qspi::Qspi::write_stream+28>  ldrbne.w        r4, [r1], #1
0x8022e9c <stm32h7xx_hal::qspi::Qspi::write_stream+32>  strbne  r4, [r0, #0]
0x8022e9e <stm32h7xx_hal::qspi::Qspi::write_stream+34>  cmpne.w lr, #1
0x8022ea2 <stm32h7xx_hal::qspi::Qspi::write_stream+38>  bne.n   0x8022eca <stm32h7xx_hal::qspi::Qspi::write_stream+78>                                                                                        │
0x8022ea4 <stm32h7xx_hal::qspi::Qspi::write_stream+40>  cmp.w   r12, #3
0x8022ea8 <stm32h7xx_hal::qspi::Qspi::write_stream+44>  it      cc                                                                                                                                            │
0x8022eaa <stm32h7xx_hal::qspi::Qspi::write_stream+46>  popcc   {r4, r6, r7, pc}                                                                                                                              │
0x8022eac <stm32h7xx_hal::qspi::Qspi::write_stream+48>  add     r2, r3                                                                                                                                        │
0x8022eae <stm32h7xx_hal::qspi::Qspi::write_stream+50>  subs    r1, #4
0x8022eb0 <stm32h7xx_hal::qspi::Qspi::write_stream+52>  ldrb.w  r3, [r1, #4]!                                                                                                                                 │
0x8022eb4 <stm32h7xx_hal::qspi::Qspi::write_stream+56>  strb    r3, [r0, #0]
0x8022eb6 <stm32h7xx_hal::qspi::Qspi::write_stream+58>  ldrb    r3, [r1, #1]
0x8022eb8 <stm32h7xx_hal::qspi::Qspi::write_stream+60>  strb    r3, [r0, #0]
0x8022eba <stm32h7xx_hal::qspi::Qspi::write_stream+62>  ldrb    r3, [r1, #2]
0x8022ebc <stm32h7xx_hal::qspi::Qspi::write_stream+64>  strb    r3, [r0, #0]
0x8022ebe <stm32h7xx_hal::qspi::Qspi::write_stream+66>  ldrb    r3, [r1, #3]
0x8022ec0 <stm32h7xx_hal::qspi::Qspi::write_stream+68>  strb    r3, [r0, #0]
0x8022ec2 <stm32h7xx_hal::qspi::Qspi::write_stream+70>  adds    r3, r1, #4
0x8022ec4 <stm32h7xx_hal::qspi::Qspi::write_stream+72>  cmp     r3, r2                                                                                                                                        │
0x8022ec6 <stm32h7xx_hal::qspi::Qspi::write_stream+74>  bne.n   0x8022eb0 <stm32h7xx_hal::qspi::Qspi::write_stream+52>                                                                                        │
0x8022ec8 <stm32h7xx_hal::qspi::Qspi::write_stream+76>  pop     {r4, r6, r7, pc}                                                                                                                              │
0x8022eca <stm32h7xx_hal::qspi::Qspi::write_stream+78>  ldrb    r1, [r2, #1]
0x8022ecc <stm32h7xx_hal::qspi::Qspi::write_stream+80>  cmp.w   lr, #2
0x8022ed0 <stm32h7xx_hal::qspi::Qspi::write_stream+84>  strb    r1, [r0, #0]
0x8022ed2 <stm32h7xx_hal::qspi::Qspi::write_stream+86>  bne.n   0x8022ed8 <stm32h7xx_hal::qspi::Qspi::write_stream+92>                                                                                        │
0x8022ed4 <stm32h7xx_hal::qspi::Qspi::write_stream+88>  adds    r1, r2, #2
0x8022ed6 <stm32h7xx_hal::qspi::Qspi::write_stream+90>  b.n     0x8022ea4 <stm32h7xx_hal::qspi::Qspi::write_stream+40>                                                                                        │
0x8022ed8 <stm32h7xx_hal::qspi::Qspi::write_stream+92>  ldrb    r1, [r2, #2]
0x8022eda <stm32h7xx_hal::qspi::Qspi::write_stream+94>  strb    r1, [r0, #0]
0x8022edc <stm32h7xx_hal::qspi::Qspi::write_stream+96>  adds    r1, r2, #3
0x8022ede <stm32h7xx_hal::qspi::Qspi::write_stream+98>  b.n     0x8022ea4 <stm32h7xx_hal::qspi::Qspi::write_stream+40>

@jordens
Copy link
Member

jordens commented Oct 22, 2020

Well, let's be pragmatic. My guess is that the low hanging fruit are word size access (and letting the fifo do the gearbox if it can), fully unrolling the transfer (i.e. pack/serialize all writes into one array, fixed size, leaving the addresses in the array constant, then do the transfer on the array). This seems to be a rather high opt-level already from the way it tries to distinguish the different remaining slice sizes.
But then I'd stop and call it good enough. DMA would not look much different assuming a similar API.

@ryan-summers
Copy link
Member Author

When encoding a single channel into 4 32-bit words ([u32; 4] array, which includes an additional extraneous write to the CSR (CSR + AMP + PHASE + FREQ writes only take 14 bytes), the whole transaction takes 109 clock cycles, which comes out to ~272 nanoseconds.

@ryan-summers
Copy link
Member Author

When the floating point math required for frequency/amplitude/phase calculations are simplified to 32-bit, it looks like the QSPI stage + calculation and serialization takes approximately 1.6uS.

We may be able to shave some of this time off further by optimizing the calculations as well.

@jordens
Copy link
Member

jordens commented Oct 22, 2020

There is no need for those conversions. They can be accounted for in filter coefficients elsewhere.

@jordens
Copy link
Member

jordens commented Oct 22, 2020

Still, 109 cycles sounds a lot for what is 4 loads and 4 stores and a handful instructions to get source and target addresses into registers. But in any case optimizing this would not affect the API.

@ryan-summers
Copy link
Member Author

Closing this - the investigation is complete and it looks like we can hit necessary throughput. Additional development on this is tracked overall by #147 and will be broken out into separate tickets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants