-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate accelerating Pounder profile configuration #155
Comments
I have confirmed that the QSPI can operate without needing the instruction/address phase (e.g. data writes to FIFO only) and have confirmed that we can complete an "endless" transaction where CS is held low and we write data into the QSPI FIFO whenever available. |
When writing data directly to the QSPI FIFO, 14 bytes took 626 cycles to write, which comes out to a total of ~1.565 microseconds for the operation. |
That's more than an order of magnitude slower than I expected. Surely there is room for improvement, no? No need to do checks and no need to drain it. |
I agree - there's a lot of room for optimization here, I'm just trying to understand where the timing is coming from. The writes to the FIFO occur over the AHB3 bus, which is currently clocked at 200MHz. The following source results in the following assembly: pub fn write_stream(&mut self, addr: u8, data: &[u8]) {
unsafe {
ptr::write_volatile(&self.rb.dr as *const _ as *mut u8, addr);
for byte in data {
ptr::write_volatile(&self.rb.dr as *const _ as *mut u8, *byte);
}
}
} | 0x8022e7c <stm32h7xx_hal::qspi::Qspi::write_stream> push {r4, r6, r7, lr} │
│ 0x8022e7e <stm32h7xx_hal::qspi::Qspi::write_stream+2> add r7, sp, #8 │
│ 0x8022e80 <stm32h7xx_hal::qspi::Qspi::write_stream+4> movw r0, #20512 ; 0x5020 │
│ 0x8022e84 <stm32h7xx_hal::qspi::Qspi::write_stream+8> movt r0, #20992 ; 0x5200 │
│B+>0x8022e88 <stm32h7xx_hal::qspi::Qspi::write_stream+12> strb r1, [r0, #0] │
│ 0x8022e8a <stm32h7xx_hal::qspi::Qspi::write_stream+14> cbz r3, 0x8022ec8 <stm32h7xx_hal::qspi::Qspi::write_stream+76> │
│ 0x8022e8c <stm32h7xx_hal::qspi::Qspi::write_stream+16> ands.w lr, r3, #3 │
│ 0x8022e90 <stm32h7xx_hal::qspi::Qspi::write_stream+20> mov r1, r2 │
│ 0x8022e92 <stm32h7xx_hal::qspi::Qspi::write_stream+22> sub.w r12, r3, #1 │
│ 0x8022e96 <stm32h7xx_hal::qspi::Qspi::write_stream+26> ittt ne │
│ 0x8022e98 <stm32h7xx_hal::qspi::Qspi::write_stream+28> ldrbne.w r4, [r1], #1 │
│ 0x8022e9c <stm32h7xx_hal::qspi::Qspi::write_stream+32> strbne r4, [r0, #0] │
│ 0x8022e9e <stm32h7xx_hal::qspi::Qspi::write_stream+34> cmpne.w lr, #1 │
│ 0x8022ea2 <stm32h7xx_hal::qspi::Qspi::write_stream+38> bne.n 0x8022eca <stm32h7xx_hal::qspi::Qspi::write_stream+78> │
│ 0x8022ea4 <stm32h7xx_hal::qspi::Qspi::write_stream+40> cmp.w r12, #3 │
│ 0x8022ea8 <stm32h7xx_hal::qspi::Qspi::write_stream+44> it cc │
│ 0x8022eaa <stm32h7xx_hal::qspi::Qspi::write_stream+46> popcc {r4, r6, r7, pc} │
│ 0x8022eac <stm32h7xx_hal::qspi::Qspi::write_stream+48> add r2, r3 │
│ 0x8022eae <stm32h7xx_hal::qspi::Qspi::write_stream+50> subs r1, #4 │
│ 0x8022eb0 <stm32h7xx_hal::qspi::Qspi::write_stream+52> ldrb.w r3, [r1, #4]! │
│ 0x8022eb4 <stm32h7xx_hal::qspi::Qspi::write_stream+56> strb r3, [r0, #0] │
│ 0x8022eb6 <stm32h7xx_hal::qspi::Qspi::write_stream+58> ldrb r3, [r1, #1] │
│ 0x8022eb8 <stm32h7xx_hal::qspi::Qspi::write_stream+60> strb r3, [r0, #0] │
│ 0x8022eba <stm32h7xx_hal::qspi::Qspi::write_stream+62> ldrb r3, [r1, #2] │
│ 0x8022ebc <stm32h7xx_hal::qspi::Qspi::write_stream+64> strb r3, [r0, #0] │
│ 0x8022ebe <stm32h7xx_hal::qspi::Qspi::write_stream+66> ldrb r3, [r1, #3] │
│ 0x8022ec0 <stm32h7xx_hal::qspi::Qspi::write_stream+68> strb r3, [r0, #0] │
│ 0x8022ec2 <stm32h7xx_hal::qspi::Qspi::write_stream+70> adds r3, r1, #4 │
│ 0x8022ec4 <stm32h7xx_hal::qspi::Qspi::write_stream+72> cmp r3, r2 │
│ 0x8022ec6 <stm32h7xx_hal::qspi::Qspi::write_stream+74> bne.n 0x8022eb0 <stm32h7xx_hal::qspi::Qspi::write_stream+52> │
│ 0x8022ec8 <stm32h7xx_hal::qspi::Qspi::write_stream+76> pop {r4, r6, r7, pc} │
│ 0x8022eca <stm32h7xx_hal::qspi::Qspi::write_stream+78> ldrb r1, [r2, #1] │
│ 0x8022ecc <stm32h7xx_hal::qspi::Qspi::write_stream+80> cmp.w lr, #2 │
│ 0x8022ed0 <stm32h7xx_hal::qspi::Qspi::write_stream+84> strb r1, [r0, #0] │
│ 0x8022ed2 <stm32h7xx_hal::qspi::Qspi::write_stream+86> bne.n 0x8022ed8 <stm32h7xx_hal::qspi::Qspi::write_stream+92> │
│ 0x8022ed4 <stm32h7xx_hal::qspi::Qspi::write_stream+88> adds r1, r2, #2 │
│ 0x8022ed6 <stm32h7xx_hal::qspi::Qspi::write_stream+90> b.n 0x8022ea4 <stm32h7xx_hal::qspi::Qspi::write_stream+40> │
│ 0x8022ed8 <stm32h7xx_hal::qspi::Qspi::write_stream+92> ldrb r1, [r2, #2] │
│ 0x8022eda <stm32h7xx_hal::qspi::Qspi::write_stream+94> strb r1, [r0, #0] │
│ 0x8022edc <stm32h7xx_hal::qspi::Qspi::write_stream+96> adds r1, r2, #3 │
│ 0x8022ede <stm32h7xx_hal::qspi::Qspi::write_stream+98> b.n 0x8022ea4 <stm32h7xx_hal::qspi::Qspi::write_stream+40> |
Well, let's be pragmatic. My guess is that the low hanging fruit are word size access (and letting the fifo do the gearbox if it can), fully unrolling the transfer (i.e. pack/serialize all writes into one array, fixed size, leaving the addresses in the array constant, then do the transfer on the array). This seems to be a rather high opt-level already from the way it tries to distinguish the different remaining slice sizes. |
When encoding a single channel into 4 32-bit words ([u32; 4] array, which includes an additional extraneous write to the CSR (CSR + AMP + PHASE + FREQ writes only take 14 bytes), the whole transaction takes 109 clock cycles, which comes out to ~272 nanoseconds. |
When the floating point math required for frequency/amplitude/phase calculations are simplified to 32-bit, it looks like the QSPI stage + calculation and serialization takes approximately 1.6uS. We may be able to shave some of this time off further by optimizing the calculations as well. |
There is no need for those conversions. They can be accounted for in filter coefficients elsewhere. |
Still, 109 cycles sounds a lot for what is 4 loads and 4 stores and a handful instructions to get source and target addresses into registers. But in any case optimizing this would not affect the API. |
Closing this - the investigation is complete and it looks like we can hit necessary throughput. Additional development on this is tracked overall by #147 and will be broken out into separate tickets |
@jordens has indicated that there is a desire to configure Pounder asynchronously to avoid CPU overhead. Ideally, pounder could be updated at the 500KHz sampling rate of Stabilizer.
This task involves investigative work into asynchronously sending Pounder configurations over the QSPI interface.
Specifically, the interest is to send up to two channel configurations (phase, amplitude, and frequency) at a rate of 500KHz. The CPU overhead for each configuration should be minimal.
Current investigations:
The text was updated successfully, but these errors were encountered: