Version 2.14.03
Adds mini-block hierarchy level below blocks and above sub-blocks.
Separates unit-of-work for OpenMP threads and cache-block size:
- Blocks, as before, are units-of-work for top-level OpenMP threads. Blocks are evaluated in parallel in each region.
- Mini-blocks are evaluated sequentially within each block and are typically sized for L2 caches.
By default, mini-blocks are the same size as blocks, so most users will see no difference.
It is possible to apply temporal blocking to both blocks and mini-blocks. Using '-bt' will set both by default.
Also removes loop-grouping parameters because they have not shown performance gains and are confusing to users.