Math: FIR: Optimize filter core function for HiFi5 #9833

singalsu · 2025-02-14T18:20:36Z

No description provided.

singalsu · 2025-02-17T08:48:19Z

src/math/fir_hifi5.c

@@ -198,21 +188,14 @@ void fir_32x16_2x(struct fir_state_32x16 *fir, ae_int32 x0, ae_int32 x1,

 	/* Write samples to delay */
 	AE_S32_L_XC(x0, fir->rwp, -sizeof(int32_t));
-	dp = (ae_f32x2 *)fir->rwp;
+	dp = (ae_int32x2 *)fir->rwp;


No need to change, I was trying ae_int32x4 pointer but it can't be used here due to data align issue.

lgirdwood

Probably worth looking at the assembly to debug the lower than expected perf gain.

src/math/fir_hifi5.c

singalsu · 2025-02-17T14:33:46Z

Probably worth looking at the assembly to debug the lower than expected perf gain.

There's some nops but looks quite well packed instructions to slots for the filtering loop in this disassembly. There's two of ae_mula2q32x16.fir.h instructions while C code has one. I'm guessing (and hoping) the 2nd with many nops is after completing the loops (-1).

0003c8c8 <fir_32x16_2x>:
   3c8c8:	004136                                    	entry	a1, 32
   3c8cb:	712c50b140085e87bd12c8160220192f 	{ l32i	a9, a2, 16; or	a7, a4, a4; ae_movi	aed4, 0; ae_movi	aed5, 0 }
   3c8db:	c90b0cac023e                           	{ l32i	a3, a2, 0; or	a6, a3, a3 }
   3c8e1:	3248                                    	l32i.n	a4, a2, 12
   3c8e3:	94e40a75fcae                           	{ movi	a10, -4; srai	a8, a9, 2 }
   3c8e9:	2edfe201180e                           	{ beqz	a8, 3c97d <fir_32x16_2x+0xb5>; addi	a8, a8, -1 }
   3c8ef:	ba0b80342fae033f                  	{ ae_s32.l.xc	aed0, a3, a10; nop; ae_mov	aed0, aed1 }
   3c8f7:	c90d8cad123e                           	{ s32i	a3, a2, 0; or	a11, a3, a3 }
   3c8fd:	ba0bd41328a60b3f                  	{ ae_s32.l.xc	aed0, a11, a10; nop; nop }
   3c905:	02b9                                    	s32i.n	a11, a2, 0
   3c907:	148000110a914cbf                  	{ blti.w15	a9, 4, 3c958 <fir_32x16_2x+0x90>; movi	a2, 8 }
   3c90f:	17c7e562192033bf                  	{ ae_l32x2.xc	aed3, a3, a2; ae_la64.pp	u0, a4 }
   3c917:	18c08166192013bf                  	{ ae_l32x2.xc	aed1, a3, a2; ae_la16x4.ip	aed0, u0, a4 }
   3c91f:	d1b80108155e                           	{ nop; nop }
   3c925:	0020f0                                    	nop
   3c928:	163ac4e1008c1c3f                  	{ loopnez.w15	a8, 3c948 <fir_32x16_2x+0x80>; ae_l32x2.xc	aed2, a3, a2 }
   3c930:	0fac43010a21a7aa011810600022034f 	{ ae_l32x2.xc	aed1, a3, a2; ae_la16x4.ip	aed0, u0, a4; nop; ae_mula2q32x16.fir.h	aed5, aed4, aed3, aed1, aed2, aed0 }
   3c940:	ba0a0c542f3e03bf                  	{ ae_l32x2.xc	aed2, a3, a2; nop; ae_mov	aed3, aed2 }

0003c948 <fir_32x16_2x+0x80>:
   3c948:	0faa54c10a21b1bb0810586000c2090f 	{ nop; nop; nop; ae_mula2q32x16.fir.h	aed5, aed4, aed3, aed1, aed2, aed0 }

0003c958 <fir_32x16_2x+0x90>:
   3c958:	251b                                    	addi.n	a2, a5, 1
   3c95a:	052274                                    	ae_slaa64s	aed0, aed5, a2
   3c95d:	053234                                    	ae_slaa64s	aed1, aed4, a2
   3c960:	75aa54c0701c11bb0810586000c2090f 	{ nop; nop; nop; ae_round32x2f48sasym	aed0, aed0, aed1 }
   3c970:	ba0bd4132800873f                  	{ ae_s32.h.i	aed0, a7, 0; nop; nop }
   3c978:	4a0064                                    	ae_s32.l.i	aed0, a6, 0
   3c97b:	f01d                                    	retw.n

0003c97d <fir_32x16_2x+0xb5>:
   3c97d:	4a0064                                    	ae_s32.l.i	aed0, a6, 0
   3c980:	4a1074                                    	ae_s32.l.i	aed1, a7, 0
   3c983:	f01d                                    	retw.n

lgirdwood · 2025-02-17T15:13:03Z

src/math/fir_hifi5.c

+	ae_f32x2 d1;
+	int i;
+	ae_int32 *dp = fir->rwp;
+	ae_int16x4 *coefp = (ae_int16x4 *)fir->coef;


is fir->coef defined with SIMD width alignment in the structure definition ?

No there isn't. The code assumes the coefficients are without align. The coefficient loads are accessed via a 64 bit align register. So, there's likely one additional load happening in beginning of FIR calculate, then single 64 bit load per 16x4 set.

Unfortunately this FIR blob is not friendly for 64 bits align. There's five 32 bits words before the coefficients vector: https://github.com/thesofproject/sof/blob/main/src/include/user/fir.h

Got you. Can you memcpy the coefs from

struct sof_fir_coef_data { int16_t length; /* Number of FIR taps */ int16_t out_shift; /* Amount of right shifts at output */ /* reserved */ uint32_t reserved[4]; int16_t coef[]; /* FIR coefficients */ } __attribute__((packed));

to a locally SIMD aligned ptr for alignment and check the MCPS

Allocating and copying to int16_t coefficients vector to 16 bytes aligned is not helping MCPS. Also after that using the non-aligning load for coefficients is not improving.

I tried to remove totally coefficients load from loop and use same zero coefficients for all MACs. It had only small impact. Same also from removing the MAC instruction, only some tenth of MCPS.

The circular loads of data samples seem to dominate the MCPS load. If I remove everything from FIR function after those data loads only, I get 3 MCPS, so the samples data loads consume that 7 MCPS and the remaining code is same as in HiFi4 version. Maybe the memory can't keep up with the loop.

With unmodified code and with --mem_model 11.82 MCPS run I get this statistics:

Cache Configuration: ICache: 32768 bytes (32KB), 2-way set associative, 64-byte line DCache: 98304 bytes (96KB), 3-way set associative, 64-byte line, writeback Events Number Number per 100 instrs Committed instructions 14434798 ( 100.00 ) Instruction fetches 12316053 ( 85.32 ) Uncached 48 ( 0.00 ) ICache fetches 12316005 ( 85.32 ) ICache misses 10798 ( 0.07 ) 0.09% of ICache fetches Loop buffer hits 8388078 ( 58.11 ) 68.11% of Instruction fetches Loop Buffer ICache hits 8388078 ( 58.11 ) 68.11% of ICache fetches Taken branches 640825 ( 4.44 ) Exceptions 46682 ( 0.32 ) WindowUnderflow 23341 ( 0.16 ) WindowOverflow 23341 ( 0.16 ) Loads 15485723 ( 107.28 ) Uncached 8 ( 0.00 ) DCache loads 15485715 ( 107.28 ) DCache load misses 404 ( 0.00 ) 0.00% of DCache loads Stores 1373651 ( 9.52 ) DCache stores 1373651 ( 9.52 ) DCache store misses 1781 ( 0.01 ) 0.13% of DCache stores DCache castouts 1860 ( 0.01 ) PIF transfers (16 bytes each) 65260 ( 0.45 ) IFetch reads 40720 ( 0.28 ) Data reads 2212 ( 0.02 ) Data writes 7440 ( 0.05 ) ICache prefetches 7596 ( 0.05 ) DCache prefetches 7292 ( 0.05 ) ICache prefetch hits 630 ( 0.00 ) 5.83% of ICache allocates DCache prefetch hits 1634 ( 0.01 ) 74.78% of DCache allocates Cycles: total = 19799426 Summed | Summed CPI CPI |% Cycle % Cycle Committed instructions 14434798 ( 1.0000 1.0000 | 72.91 72.91 ) Taken branches 2010942 ( 0.1393 1.1393 | 10.16 83.06 ) Pipeline interlocks 429648 ( 0.0298 1.1691 | 2.17 85.23 ) ICache misses 77260 ( 0.0054 1.1744 | 0.39 85.62 ) DCache misses 23170 ( 0.0016 1.1760 | 0.12 85.74 ) DCache bank conflicts 2256418 ( 0.1563 1.3324 | 11.40 97.14 ) Exceptions 326774 ( 0.0226 1.3550 | 1.65 98.79 ) Uncached ifetches 284 ( 0.0000 1.3550 | 0.00 98.79 ) Uncached loads 56 ( 0.0000 1.3550 | 0.00 98.79 ) Sync replays 7308 ( 0.0005 1.3555 | 0.04 98.82 ) Special instructions 21178 ( 0.0015 1.3570 | 0.11 98.93 ) Loop overhead 144340 ( 0.0100 1.3670 | 0.73 99.66 ) Store buffer stalls 66729 ( 0.0046 1.3716 | 0.34 100.00 ) Non-TIE global stalls 514 ( 0.0000 1.3716 | 0.00 100.00 ) Reset 7 ( 0.0000 1.3716 | 0.00 100.00 )

I have also tried __builtin_prefetch(); for the data but I could not get improvement.

Allocating and copying to int16_t coefficients vector to 16 bytes aligned is not helping MCPS. Also after that using the non-aligning load for coefficients is not improving.

Can you try cache line size aligned ?

I tried both 64 bytes (size seen above) FIR data and coefficient align. Unfortunately it does not help. The MCPS remains the same.

There is no need for different names for the cores to handle different optimized versions. It avoids unnecessary changes when support for new HiFi versions is added in components those use the FIR functions. Signed-off-by: Seppo Ingalsuo <[email protected]>

The struct member is not implemented and not used so it can be removed as cleanup. Signed-off-by: Seppo Ingalsuo <[email protected]>

This patch contains only minimal changes to build this with HiFi5 toolchain. Changed SOF_USE_HIFI() / SOF_USE_MIN_HIFI() macros usage and included hifi5 header. Signed-off-by: Seppo Ingalsuo <[email protected]>

This patch optimizes the function fir_32x16_2x_hifi5(). - The (4x) quad-MAC with AE_MULAFD32X16X2_FIR_HH() and AE_MULAFD32X16X2_FIR_HL() is replaced with a 8x MAC intrinsic AE_MULA2Q32X16_FIR_H(). - Since the 8x MAC is not supporting fractions, a shift left by one is added to adjust the format to Q17.47. - The output sample single saturation and round is replaced with instruction that rounds two 64 bit accumulators. WIP - Currently the MCPS saving with FIR EQ and TDFB components seems much smaller, only 0.2 MCPS. Signed-off-by: Seppo Ingalsuo <[email protected]>

lgirdwood

LGTM, whats the latest savings ?

singalsu · 2025-02-20T08:57:26Z

LGTM, whats the latest savings ?

It's only 0.2 MCPS for stereo equalizer, and same for 2 mic beamformer. Both tests used a quite long filter.

lyakh · 2025-02-24T07:56:10Z

src/math/fir_hifi5.c

+		      int *rshift)
+{
+	*lshift = (fir->out_shift < 0) ? -fir->out_shift : 0;
+	*rshift = (fir->out_shift > 0) ? fir->out_shift : 0;


You could do *rshift = fir->out_shift + *lshift; to avoid a branch, but probably readability is more important here :-)

It's not "hot" code, so at least I prefer readability.

singalsu commented Feb 17, 2025

View reviewed changes

lgirdwood reviewed Feb 17, 2025

View reviewed changes

src/math/fir_hifi5.c Show resolved Hide resolved

lgirdwood reviewed Feb 17, 2025

View reviewed changes

singalsu added 2 commits February 18, 2025 18:46

Math: FIR: Remove in_shift from struct fir_state_32x16

94c4221

The struct member is not implemented and not used so it can be removed as cleanup. Signed-off-by: Seppo Ingalsuo <[email protected]>

singalsu force-pushed the math_fir_hifi5 branch from 0bed7c1 to 848066f Compare February 18, 2025 16:58

singalsu added 2 commits February 18, 2025 19:04

Math: FIR: Copy HiFi5 filter functions from HiFi3 version

23c0437

This patch contains only minimal changes to build this with HiFi5 toolchain. Changed SOF_USE_HIFI() / SOF_USE_MIN_HIFI() macros usage and included hifi5 header. Signed-off-by: Seppo Ingalsuo <[email protected]>

singalsu force-pushed the math_fir_hifi5 branch from 848066f to e286577 Compare February 18, 2025 17:04

lgirdwood reviewed Feb 19, 2025

View reviewed changes

singalsu marked this pull request as ready for review February 20, 2025 08:57

singalsu requested review from kv2019i, lyakh, iuliana-prodan, dbaluta, abonislawski, plbossart, mmaka1 and lbetlej as code owners February 20, 2025 08:57

lgirdwood approved these changes Feb 21, 2025

View reviewed changes

lyakh reviewed Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Math: FIR: Optimize filter core function for HiFi5 #9833

Math: FIR: Optimize filter core function for HiFi5 #9833

singalsu commented Feb 14, 2025

singalsu Feb 17, 2025

lgirdwood left a comment

singalsu commented Feb 17, 2025

lgirdwood Feb 17, 2025

singalsu Feb 17, 2025

lgirdwood Feb 18, 2025

singalsu Feb 18, 2025

lgirdwood Feb 18, 2025

singalsu Feb 19, 2025

lgirdwood left a comment

singalsu commented Feb 20, 2025

lyakh Feb 24, 2025

singalsu Feb 24, 2025

Math: FIR: Optimize filter core function for HiFi5 #9833

Are you sure you want to change the base?

Math: FIR: Optimize filter core function for HiFi5 #9833

Conversation

singalsu commented Feb 14, 2025

Choose a reason for hiding this comment

lgirdwood left a comment

Choose a reason for hiding this comment

singalsu commented Feb 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgirdwood left a comment

Choose a reason for hiding this comment

singalsu commented Feb 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment