-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Math: FIR: Optimize filter core function for HiFi5 #9833
base: main
Are you sure you want to change the base?
Conversation
src/math/fir_hifi5.c
Outdated
@@ -198,21 +188,14 @@ void fir_32x16_2x(struct fir_state_32x16 *fir, ae_int32 x0, ae_int32 x1, | |||
|
|||
/* Write samples to delay */ | |||
AE_S32_L_XC(x0, fir->rwp, -sizeof(int32_t)); | |||
dp = (ae_f32x2 *)fir->rwp; | |||
dp = (ae_int32x2 *)fir->rwp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change, I was trying ae_int32x4 pointer but it can't be used here due to data align issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth looking at the assembly to debug the lower than expected perf gain.
There's some nops but looks quite well packed instructions to slots for the filtering loop in this disassembly. There's two of ae_mula2q32x16.fir.h instructions while C code has one. I'm guessing (and hoping) the 2nd with many nops is after completing the loops (-1).
|
ae_f32x2 d1; | ||
int i; | ||
ae_int32 *dp = fir->rwp; | ||
ae_int16x4 *coefp = (ae_int16x4 *)fir->coef; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is fir->coef defined with SIMD width alignment in the structure definition ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No there isn't. The code assumes the coefficients are without align. The coefficient loads are accessed via a 64 bit align register. So, there's likely one additional load happening in beginning of FIR calculate, then single 64 bit load per 16x4 set.
Unfortunately this FIR blob is not friendly for 64 bits align. There's five 32 bits words before the coefficients vector: https://github.com/thesofproject/sof/blob/main/src/include/user/fir.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you. Can you memcpy the coefs from
struct sof_fir_coef_data {
int16_t length; /* Number of FIR taps */
int16_t out_shift; /* Amount of right shifts at output */
/* reserved */
uint32_t reserved[4];
int16_t coef[]; /* FIR coefficients */
} __attribute__((packed));
to a locally SIMD aligned ptr for alignment and check the MCPS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocating and copying to int16_t coefficients vector to 16 bytes aligned is not helping MCPS. Also after that using the non-aligning load for coefficients is not improving.
I tried to remove totally coefficients load from loop and use same zero coefficients for all MACs. It had only small impact. Same also from removing the MAC instruction, only some tenth of MCPS.
The circular loads of data samples seem to dominate the MCPS load. If I remove everything from FIR function after those data loads only, I get 3 MCPS, so the samples data loads consume that 7 MCPS and the remaining code is same as in HiFi4 version. Maybe the memory can't keep up with the loop.
With unmodified code and with --mem_model 11.82 MCPS run I get this statistics:
Cache Configuration:
ICache: 32768 bytes (32KB), 2-way set associative, 64-byte line
DCache: 98304 bytes (96KB), 3-way set associative, 64-byte line, writeback
Events Number Number
per 100
instrs
Committed instructions 14434798 ( 100.00 )
Instruction fetches 12316053 ( 85.32 )
Uncached 48 ( 0.00 )
ICache fetches 12316005 ( 85.32 )
ICache misses 10798 ( 0.07 ) 0.09% of ICache fetches
Loop buffer hits 8388078 ( 58.11 ) 68.11% of Instruction fetches
Loop Buffer ICache hits 8388078 ( 58.11 ) 68.11% of ICache fetches
Taken branches 640825 ( 4.44 )
Exceptions 46682 ( 0.32 )
WindowUnderflow 23341 ( 0.16 )
WindowOverflow 23341 ( 0.16 )
Loads 15485723 ( 107.28 )
Uncached 8 ( 0.00 )
DCache loads 15485715 ( 107.28 )
DCache load misses 404 ( 0.00 ) 0.00% of DCache loads
Stores 1373651 ( 9.52 )
DCache stores 1373651 ( 9.52 )
DCache store misses 1781 ( 0.01 ) 0.13% of DCache stores
DCache castouts 1860 ( 0.01 )
PIF transfers (16 bytes each) 65260 ( 0.45 )
IFetch reads 40720 ( 0.28 )
Data reads 2212 ( 0.02 )
Data writes 7440 ( 0.05 )
ICache prefetches 7596 ( 0.05 )
DCache prefetches 7292 ( 0.05 )
ICache prefetch hits 630 ( 0.00 ) 5.83% of ICache allocates
DCache prefetch hits 1634 ( 0.01 ) 74.78% of DCache allocates
Cycles: total = 19799426
Summed | Summed
CPI CPI |% Cycle % Cycle
Committed instructions 14434798 ( 1.0000 1.0000 | 72.91 72.91 )
Taken branches 2010942 ( 0.1393 1.1393 | 10.16 83.06 )
Pipeline interlocks 429648 ( 0.0298 1.1691 | 2.17 85.23 )
ICache misses 77260 ( 0.0054 1.1744 | 0.39 85.62 )
DCache misses 23170 ( 0.0016 1.1760 | 0.12 85.74 )
DCache bank conflicts 2256418 ( 0.1563 1.3324 | 11.40 97.14 )
Exceptions 326774 ( 0.0226 1.3550 | 1.65 98.79 )
Uncached ifetches 284 ( 0.0000 1.3550 | 0.00 98.79 )
Uncached loads 56 ( 0.0000 1.3550 | 0.00 98.79 )
Sync replays 7308 ( 0.0005 1.3555 | 0.04 98.82 )
Special instructions 21178 ( 0.0015 1.3570 | 0.11 98.93 )
Loop overhead 144340 ( 0.0100 1.3670 | 0.73 99.66 )
Store buffer stalls 66729 ( 0.0046 1.3716 | 0.34 100.00 )
Non-TIE global stalls 514 ( 0.0000 1.3716 | 0.00 100.00 )
Reset 7 ( 0.0000 1.3716 | 0.00 100.00 )
I have also tried __builtin_prefetch(); for the data but I could not get improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocating and copying to int16_t coefficients vector to 16 bytes aligned is not helping MCPS. Also after that using the non-aligning load for coefficients is not improving.
Can you try cache line size aligned ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried both 64 bytes (size seen above) FIR data and coefficient align. Unfortunately it does not help. The MCPS remains the same.
There is no need for different names for the cores to handle different optimized versions. It avoids unnecessary changes when support for new HiFi versions is added in components those use the FIR functions. Signed-off-by: Seppo Ingalsuo <[email protected]>
The struct member is not implemented and not used so it can be removed as cleanup. Signed-off-by: Seppo Ingalsuo <[email protected]>
0bed7c1
to
848066f
Compare
This patch contains only minimal changes to build this with HiFi5 toolchain. Changed SOF_USE_HIFI() / SOF_USE_MIN_HIFI() macros usage and included hifi5 header. Signed-off-by: Seppo Ingalsuo <[email protected]>
This patch optimizes the function fir_32x16_2x_hifi5(). - The (4x) quad-MAC with AE_MULAFD32X16X2_FIR_HH() and AE_MULAFD32X16X2_FIR_HL() is replaced with a 8x MAC intrinsic AE_MULA2Q32X16_FIR_H(). - Since the 8x MAC is not supporting fractions, a shift left by one is added to adjust the format to Q17.47. - The output sample single saturation and round is replaced with instruction that rounds two 64 bit accumulators. WIP - Currently the MCPS saving with FIR EQ and TDFB components seems much smaller, only 0.2 MCPS. Signed-off-by: Seppo Ingalsuo <[email protected]>
848066f
to
e286577
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, whats the latest savings ?
It's only 0.2 MCPS for stereo equalizer, and same for 2 mic beamformer. Both tests used a quite long filter. |
int *rshift) | ||
{ | ||
*lshift = (fir->out_shift < 0) ? -fir->out_shift : 0; | ||
*rshift = (fir->out_shift > 0) ? fir->out_shift : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do *rshift = fir->out_shift + *lshift;
to avoid a branch, but probably readability is more important here :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not "hot" code, so at least I prefer readability.
No description provided.