Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Math: FIR: Optimize filter core function for HiFi5 #9833

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

singalsu
Copy link
Collaborator

No description provided.

@@ -198,21 +188,14 @@ void fir_32x16_2x(struct fir_state_32x16 *fir, ae_int32 x0, ae_int32 x1,

/* Write samples to delay */
AE_S32_L_XC(x0, fir->rwp, -sizeof(int32_t));
dp = (ae_f32x2 *)fir->rwp;
dp = (ae_int32x2 *)fir->rwp;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change, I was trying ae_int32x4 pointer but it can't be used here due to data align issue.

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth looking at the assembly to debug the lower than expected perf gain.

@singalsu
Copy link
Collaborator Author

Probably worth looking at the assembly to debug the lower than expected perf gain.

There's some nops but looks quite well packed instructions to slots for the filtering loop in this disassembly. There's two of ae_mula2q32x16.fir.h instructions while C code has one. I'm guessing (and hoping) the 2nd with many nops is after completing the loops (-1).

0003c8c8 <fir_32x16_2x>:
   3c8c8:	004136                                    	entry	a1, 32
   3c8cb:	712c50b140085e87bd12c8160220192f 	{ l32i	a9, a2, 16; or	a7, a4, a4; ae_movi	aed4, 0; ae_movi	aed5, 0 }
   3c8db:	c90b0cac023e                           	{ l32i	a3, a2, 0; or	a6, a3, a3 }
   3c8e1:	3248                                    	l32i.n	a4, a2, 12
   3c8e3:	94e40a75fcae                           	{ movi	a10, -4; srai	a8, a9, 2 }
   3c8e9:	2edfe201180e                           	{ beqz	a8, 3c97d <fir_32x16_2x+0xb5>; addi	a8, a8, -1 }
   3c8ef:	ba0b80342fae033f                  	{ ae_s32.l.xc	aed0, a3, a10; nop; ae_mov	aed0, aed1 }
   3c8f7:	c90d8cad123e                           	{ s32i	a3, a2, 0; or	a11, a3, a3 }
   3c8fd:	ba0bd41328a60b3f                  	{ ae_s32.l.xc	aed0, a11, a10; nop; nop }
   3c905:	02b9                                    	s32i.n	a11, a2, 0
   3c907:	148000110a914cbf                  	{ blti.w15	a9, 4, 3c958 <fir_32x16_2x+0x90>; movi	a2, 8 }
   3c90f:	17c7e562192033bf                  	{ ae_l32x2.xc	aed3, a3, a2; ae_la64.pp	u0, a4 }
   3c917:	18c08166192013bf                  	{ ae_l32x2.xc	aed1, a3, a2; ae_la16x4.ip	aed0, u0, a4 }
   3c91f:	d1b80108155e                           	{ nop; nop }
   3c925:	0020f0                                    	nop
   3c928:	163ac4e1008c1c3f                  	{ loopnez.w15	a8, 3c948 <fir_32x16_2x+0x80>; ae_l32x2.xc	aed2, a3, a2 }
   3c930:	0fac43010a21a7aa011810600022034f 	{ ae_l32x2.xc	aed1, a3, a2; ae_la16x4.ip	aed0, u0, a4; nop; ae_mula2q32x16.fir.h	aed5, aed4, aed3, aed1, aed2, aed0 }
   3c940:	ba0a0c542f3e03bf                  	{ ae_l32x2.xc	aed2, a3, a2; nop; ae_mov	aed3, aed2 }

0003c948 <fir_32x16_2x+0x80>:
   3c948:	0faa54c10a21b1bb0810586000c2090f 	{ nop; nop; nop; ae_mula2q32x16.fir.h	aed5, aed4, aed3, aed1, aed2, aed0 }

0003c958 <fir_32x16_2x+0x90>:
   3c958:	251b                                    	addi.n	a2, a5, 1
   3c95a:	052274                                    	ae_slaa64s	aed0, aed5, a2
   3c95d:	053234                                    	ae_slaa64s	aed1, aed4, a2
   3c960:	75aa54c0701c11bb0810586000c2090f 	{ nop; nop; nop; ae_round32x2f48sasym	aed0, aed0, aed1 }
   3c970:	ba0bd4132800873f                  	{ ae_s32.h.i	aed0, a7, 0; nop; nop }
   3c978:	4a0064                                    	ae_s32.l.i	aed0, a6, 0
   3c97b:	f01d                                    	retw.n

0003c97d <fir_32x16_2x+0xb5>:
   3c97d:	4a0064                                    	ae_s32.l.i	aed0, a6, 0
   3c980:	4a1074                                    	ae_s32.l.i	aed1, a7, 0
   3c983:	f01d                                    	retw.n

ae_f32x2 d1;
int i;
ae_int32 *dp = fir->rwp;
ae_int16x4 *coefp = (ae_int16x4 *)fir->coef;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is fir->coef defined with SIMD width alignment in the structure definition ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No there isn't. The code assumes the coefficients are without align. The coefficient loads are accessed via a 64 bit align register. So, there's likely one additional load happening in beginning of FIR calculate, then single 64 bit load per 16x4 set.

Unfortunately this FIR blob is not friendly for 64 bits align. There's five 32 bits words before the coefficients vector: https://github.com/thesofproject/sof/blob/main/src/include/user/fir.h

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you. Can you memcpy the coefs from

struct sof_fir_coef_data {
	int16_t length; /* Number of FIR taps */
	int16_t out_shift; /* Amount of right shifts at output */

	/* reserved */
	uint32_t reserved[4];

	int16_t coef[]; /* FIR coefficients */
} __attribute__((packed));

to a locally SIMD aligned ptr for alignment and check the MCPS

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating and copying to int16_t coefficients vector to 16 bytes aligned is not helping MCPS. Also after that using the non-aligning load for coefficients is not improving.

I tried to remove totally coefficients load from loop and use same zero coefficients for all MACs. It had only small impact. Same also from removing the MAC instruction, only some tenth of MCPS.

The circular loads of data samples seem to dominate the MCPS load. If I remove everything from FIR function after those data loads only, I get 3 MCPS, so the samples data loads consume that 7 MCPS and the remaining code is same as in HiFi4 version. Maybe the memory can't keep up with the loop.

With unmodified code and with --mem_model 11.82 MCPS run I get this statistics:

Cache Configuration:
  ICache: 32768 bytes (32KB), 2-way set associative, 64-byte line
  DCache: 98304 bytes (96KB), 3-way set associative, 64-byte line, writeback

Events                             Number  Number
                                           per 100
                                           instrs 

Committed instructions          14434798 ( 100.00 )
Instruction fetches             12316053 (  85.32 )
   Uncached                           48 (   0.00 )
   ICache fetches               12316005 (  85.32 )
      ICache misses                10798 (   0.07 )  0.09% of ICache fetches
   Loop buffer hits              8388078 (  58.11 ) 68.11% of Instruction fetches
      Loop Buffer ICache hits    8388078 (  58.11 ) 68.11% of ICache fetches
Taken branches                    640825 (   4.44 )
Exceptions                         46682 (   0.32 )
   WindowUnderflow                 23341 (   0.16 )
   WindowOverflow                  23341 (   0.16 )
Loads                           15485723 ( 107.28 )
   Uncached                            8 (   0.00 )
   DCache loads                 15485715 ( 107.28 )
      DCache load misses             404 (   0.00 )  0.00% of DCache loads
Stores                           1373651 (   9.52 )
   DCache stores                 1373651 (   9.52 )
      DCache store misses           1781 (   0.01 )  0.13% of DCache stores
DCache castouts                     1860 (   0.01 )
PIF transfers (16 bytes each)      65260 (   0.45 )
   IFetch reads                    40720 (   0.28 )
   Data reads                       2212 (   0.02 )
   Data writes                      7440 (   0.05 )
   ICache prefetches                7596 (   0.05 )
   DCache prefetches                7292 (   0.05 )
ICache prefetch hits                 630 (   0.00 )  5.83% of ICache allocates
DCache prefetch hits                1634 (   0.01 ) 74.78% of DCache allocates

Cycles: total = 19799426
                                                  Summed |          Summed
                                          CPI      CPI   |% Cycle  % Cycle
Committed instructions        14434798 ( 1.0000   1.0000 |  72.91    72.91 )
Taken branches                 2010942 ( 0.1393   1.1393 |  10.16    83.06 )
Pipeline interlocks             429648 ( 0.0298   1.1691 |   2.17    85.23 )
ICache misses                    77260 ( 0.0054   1.1744 |   0.39    85.62 )
DCache misses                    23170 ( 0.0016   1.1760 |   0.12    85.74 )
DCache bank conflicts          2256418 ( 0.1563   1.3324 |  11.40    97.14 )
Exceptions                      326774 ( 0.0226   1.3550 |   1.65    98.79 )
Uncached ifetches                  284 ( 0.0000   1.3550 |   0.00    98.79 )
Uncached loads                      56 ( 0.0000   1.3550 |   0.00    98.79 )
Sync replays                      7308 ( 0.0005   1.3555 |   0.04    98.82 )
Special instructions             21178 ( 0.0015   1.3570 |   0.11    98.93 )
Loop overhead                   144340 ( 0.0100   1.3670 |   0.73    99.66 )
Store buffer stalls              66729 ( 0.0046   1.3716 |   0.34   100.00 )
Non-TIE global stalls              514 ( 0.0000   1.3716 |   0.00   100.00 )
Reset                                7 ( 0.0000   1.3716 |   0.00   100.00 )

I have also tried __builtin_prefetch(); for the data but I could not get improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating and copying to int16_t coefficients vector to 16 bytes aligned is not helping MCPS. Also after that using the non-aligning load for coefficients is not improving.

Can you try cache line size aligned ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both 64 bytes (size seen above) FIR data and coefficient align. Unfortunately it does not help. The MCPS remains the same.

There is no need for different names for the cores to handle
different optimized versions. It avoids unnecessary changes when
support for new HiFi versions is added in components those use the
FIR functions.

Signed-off-by: Seppo Ingalsuo <[email protected]>
The struct member is not implemented and not used so it
can be removed as cleanup.

Signed-off-by: Seppo Ingalsuo <[email protected]>
This patch contains only minimal changes to build this with
HiFi5 toolchain. Changed SOF_USE_HIFI() / SOF_USE_MIN_HIFI()
macros usage and included hifi5 header.

Signed-off-by: Seppo Ingalsuo <[email protected]>
This patch optimizes the function fir_32x16_2x_hifi5().

- The (4x) quad-MAC with AE_MULAFD32X16X2_FIR_HH() and
  AE_MULAFD32X16X2_FIR_HL() is replaced with a 8x MAC intrinsic
  AE_MULA2Q32X16_FIR_H().
- Since the 8x MAC is not supporting fractions, a shift left by
  one is added to adjust the format to Q17.47.
- The output sample single saturation and round is replaced with
  instruction that rounds two 64 bit accumulators.

WIP - Currently the MCPS saving with FIR EQ and TDFB components
seems much smaller, only 0.2 MCPS.

Signed-off-by: Seppo Ingalsuo <[email protected]>
Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, whats the latest savings ?

@singalsu
Copy link
Collaborator Author

LGTM, whats the latest savings ?

It's only 0.2 MCPS for stereo equalizer, and same for 2 mic beamformer. Both tests used a quite long filter.

@singalsu singalsu marked this pull request as ready for review February 20, 2025 08:57
int *rshift)
{
*lshift = (fir->out_shift < 0) ? -fir->out_shift : 0;
*rshift = (fir->out_shift > 0) ? fir->out_shift : 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do *rshift = fir->out_shift + *lshift; to avoid a branch, but probably readability is more important here :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not "hot" code, so at least I prefer readability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants