Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Alternate Manchester SWO implementation using DMA #1960

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ssimek
Copy link

@ssimek ssimek commented Oct 13, 2024

Fast DMA-based SWO Manchester decoding

Detailed description

The current interrupt-based SWO decoding suffers (IMO) from several issues

  • it is limited to ~100 kHz (~50 kbit/s raw, i.e. ~25 kbit/s decoded SWO console) signal
    • while this is usually sufficient for text-based debug output, it gets worse when more data needs to be sent out/graphed/etc
    • it also unnecessarily impacts performance of the tested code, because the MCU has to wait for the ITM to be available
  • high frequency hangs the probe due to an overload of interrupt requests
    • this happens easily when starting with new hardware before the init code is debugged

This PR contains a full reimplementation of the SWO Manchester decoder using DMA that makes capture a lot more efficient:

  1. all edge times of the signal are captured using a timer
  2. DMA is used to record the timings into a circular buffer
  3. the buffer is periodically processed in batches, transformig the edge stream into a byte stream for sending in another circular buffer, resulting in effective processing time per sample on the order of several clock cycles
  4. the output buffer is processed in a lower-priority ISR as time permits

The result is the ability to process SWO signal up to ~3 MHz with the probe being more resilient against higher frequency signals (it just fails to decode it properly). Continuous streaming at these speeds obviously makes USB a bottleneck (1.5 Mbit = ~180 kB/s, with the USB controller seemingly being able to handle up to about 70 kB/s without double buffering). But for reasonably intermittent flows, one big benefit is the lightened load on the target.

Your checklist for this pull request

  • I've read the Code of Conduct
  • I've read the guidelines for contributing to this repository
  • It builds for hardware native (see Building the firmware)
    • it does, but it doesn't fit, neither did the original one, at least not with GCC 13.3 I'm using - it fits with LTO, but then the GDB server doesn't work for some reason. I was developing it with a few targets removed so it fit
  • It builds as BMDA (see Building the BMDA)
  • I've tested it to the best of my ability
  • My commit messages provide a useful short description of what the commits do

Closing issues

None that I'm aware of

Copy link
Member

@dragonmux dragonmux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't got too far into reviewing the DMA code as the review notes we have are already going to be a fair bit of work and with them resolved reviewing the rest of the implementation will be easier, so we're submitting this as an early review.

@@ -58,7 +58,6 @@ bool swo_itm_decoding = false;
uint8_t *swo_buffer;
uint16_t swo_buffer_read_index = 0U;
uint16_t swo_buffer_write_index = 0U;
_Atomic uint16_t swo_buffer_bytes_available = 0U;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might have seemed redundant, but please consider what happens when the two indicies are equal to each other after the buffer fills up - this particular bug has happened when a target is particularly bursty at higher frequencies, and results in a full buffer of data getting dropped when just checking the indexes aren't equal. This is why this variable was introduced.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have. It's not about redundancy - other than the (not insignificant) performance penalty, maintaining the extra variable introduces a higher chance of getting things wrong. When swo_buffer_bytes_available crossed the size of the buffer, it wasn't handled either - it just sent the buffered data twice. Just using the indexes on the other hand tends to "drop" the full buffer. The stream is corrupted in both cases, but I'd argue sending less data and recovering faster is preferable to sending more bad data.

src/platforms/common/stm32/swo_uart.c Outdated Show resolved Hide resolved
@@ -130,13 +130,11 @@ void swo_uart_deinit(void)
{
/* Disable the UART and halt DMA for it, grabbing the number of bytes left in the buffer as we do */
usart_disable(SWO_UART);
const uint16_t space_remaining = dma_get_number_of_data(SWO_DMA_BUS, SWO_DMA_CHAN);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TRM didn't make it clear if disabling the channel reset the value read to the initial value programmed - this will need testing particularly on a GD32F1-based probe to ensure this doesn't cause an unforced error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue it can not, it would violate a basic principle of DMA - disabling the channel is not doing anything, not even stopping pending operations, it just masks future requests. Re-enabling it has to continue where it left off.

I have moved it below jsut to make sure transfers are not going to continue for the few cycles after the readout, but I can keep it where it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include these changes in the Meson build system too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to, hopefully it works

// a proper break in the pulse sequence
//#define SWO_ADVANCED_RECOVERY 1

#define FORCE_INLINE inline __attribute__((always_inline))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not do this. Let the compiler choose by just saying inline. This actively pesimises the inlining vs code size balance and should only be done when absolutely required. Let the compiler do its job otherwise please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found another way to do this, but I cannot agree with the general sentiment :) I also prefer to not use force inlining, but in this case the compiler chose not to inline even when inlining made the code several times faster and smaller. There is a reason this attribute exists.

It might be unfortunate, but writing this kind of extreme performance code is an endless loop of making the compiler generate the assembly one would write on their own (while still being portable to other CPU architectures, maybe with less optimal performance). At the same time, it is also true the compiler is better at ordering the instructions to make better use of various CPU pipeline stages which makes it generally faster than handwritten assembly, not to mention more maintainable. But the optimization heuristics are simply not able to understand the intent/requirements in all cases.

Comment on lines 111 to 113
uint16_t rx, t, q;
uint8_t s;
int32_t b;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use expressive variable names - what are these, what do they mean?! We are not limited on variable name length, so please use the name to describe the purpose.

It might seem a little harsh, but the idea is that you should be describing to the reader, who might be yourself in 6 months, what the intent of all this is and what these mean/do so it can be maintained and adjusted if there are bugs found. That's just not practical with single-letter variable names which is why this is a clang-tidy lint. Every attempt to read this code and figure out what it does becomes a reverse engineering exercise otherwise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of a matter of opinion, and these "variables" are very clearly documented inside the main processing function, this is just a stash to store the state in the meantime. I'd argue the C-style long variables make code very hard to read because there is an abysmal signal to noise ratio.

Anyway, changing this, I don't wish to force my will on the project :)

Comment on lines 234 to 264
if ((*USB_EP_REG(SWO_ENDPOINT) & USB_EP_TX_STAT) != USB_EP_TX_STAT_VALID)
{
swo_send_buffer(usbdev, SWO_ENDPOINT);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The braces are not necessary, please drop them. Please also run this file through clang-format.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been run through clang-format. Also, I'm very hesitant to see code without braces - remember goto fail;?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format is now really applied :) I incorrectly presumed make clang-format does it since it changed a ton of files, but it goes only two levels deep.

src/platforms/common/stm32/swo_manchester_dma.c Outdated Show resolved Hide resolved
src/platforms/common/stm32/swo_manchester_dma.c Outdated Show resolved Hide resolved
src/platforms/common/stm32/swo_manchester_dma.c Outdated Show resolved Hide resolved
@dragonmux
Copy link
Member

dragonmux commented Oct 15, 2024

Note: We are aware of the breakage on the lint pass from GitHub's deployment of Ubuntu 24.04 LTS, we will get this fixed in the mean time.

Edit: This has now been fixed, when you rebase this PR on main next, that fix will automatically get pulled in and used.

@ssimek ssimek force-pushed the feature/fast-dma-traceswo branch 4 times, most recently from 0a26202 to 7b87080 Compare October 18, 2024 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants