Further optimize `intermediates_to_table_indices` #1457

andyleiserson · 2024-11-26T00:33:18Z

intermediates_to_table_indices works as follows:

It calls bits_to_table_indices, which takes three u128s each containing the value of one of three intermediates for 128 multiplications, and returns four u128s containing a table index in each nibble.
It then reorders those nibbles into bytes as its output. (Originally, the table lookup was done here, but additional optimization moved the table lookup elsewhere.)

It appears that bits_to_table_indices compiles to <200 instructions (fully unrolled with no loops or branches), while the rearranging of nibbles compiles to >1000 instructions (again, fully unrolled with no loops or branches). Implementing a single transpose-like operation covering both steps would probably be more efficient.

The text was updated successfully, but these errors were encountered:

andyleiserson added the performance This affects protocol performance label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further optimize `intermediates_to_table_indices` #1457

Further optimize `intermediates_to_table_indices` #1457

andyleiserson commented Nov 26, 2024

Further optimize intermediates_to_table_indices #1457

Further optimize intermediates_to_table_indices #1457

Comments

andyleiserson commented Nov 26, 2024

Further optimize `intermediates_to_table_indices` #1457

Further optimize `intermediates_to_table_indices` #1457