-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization of nowrshmsk #3875
Conversation
679ecbd
to
fb77965
Compare
fb77965
to
6d69930
Compare
This PR seems to reduce the Quality-of-Results for yosys 0.32:
PR #3875:
|
Good find! The resource increase seems to be caused by the simplification of I'll see if I can avoid changing BTW please don't use unpacked arrays within packed structs; this is not allowed by the LRM. The following (correct) typedef is functionally equivalent:
|
If you look at the reports in the attached zip, you can see that your solution is actually smaller right up to running So I see three possibilities:
Sadly I am somewhat out of time for the next 2-3 weeks (I will still try to work on it, just less). |
Yeah. I found that the real (current) optimization was always the simplification of the AST_CASE using stride, seemingly the removal of multiplication of index by stride will in certain scenarios only do harm ATM (I guess some optimization pass removes the multiplication in any case). As a pragmatic measure I've disabled the removal of the multiplication for now, and at the same time I generalized the AST_CASE simplification to handle variable slices on the form dst[i*stride +: width] = src
If you like, you can probably use this PR together with #3877 right away for testing of the SoC design. When you continue working on optimization of the shift approach, the results from |
With disabled mult optimization (last commit):
I can confirm that this solves the QoR regression on the given example. And I am absolutely going to use this + PR #3877 as a golden reference for the other approach in #3873 and #3880. |
a2e6956
to
8a925ce
Compare
@povik I did a cleanup by squashing the commits. As a torture test I've also run This resulted in two failures:
Running To confuse things further, the failure summary of Do you have any idea what can be wrong here, either in the generated optimized CASE, or in the equivalence test? |
The only hypothesis I've been able to come up with regarding the failure for Given that |
2292984
to
df8e732
Compare
int start_bit = source_offset + i; | ||
int end_bit = std::min(start_bit+result_width,source_width) - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the earlier code here was confused since start_bit+result_width
had the source_offset
included but was compared to source_width
which had not. So this led to incorrect behavior on wires with non-zero source_offset
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw shouldn't we rename all those source_
variables? It's the target of the assignment, so it's confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the earlier code here was confused since
start_bit+result_width
had thesource_offset
included but was compared tosource_width
which had not. So this led to incorrect behavior on wires with non-zerosource_offset
, right?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw shouldn't we rename all those
source_
variables? It's the target of the assignment, so it's confusing.
I agree! I've been thinking about changing that, however I didn't dare to change the "source_" naming which originates with source_width
in the shift and mask code.
Should I change all source_offset
and source_width
to e.g. dst_offset
and dst_width
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, go ahead. I would suggest wire_offset
and wire_width
, but really anything will be an improvement it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, changed to wire_offset
and wire_width
now.
df8e732
to
34873af
Compare
frontends/ast/simplify.cc
Outdated
if (start_bit%bitno_div != 0 || (stride == 0 && start_bit != 0)) | ||
continue; | ||
|
||
AstNode *cond = new AstNode(AST_COND, mkconst_int(start_bit, case_sign_hint, case_width_hint)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here by forcing the sign and width of shift_expr
on the start_bit
constant in the case, can we make it overflow? Either way what's the purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can overflow. Not considering this can lead to incorrect behavior, as I found here: #3875 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so we consider the overflow in figuring out the reachable values of shift_expr
, but shouldn't we then take the overflown value for the bit-select on the target wire?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, by checking whether it is possible to generate the current index by a possibly overflown shift_expr (start_bit%gcd(stride, shift_mod) == 0
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see now. So we are not relying on mkconst_int(start_bit, case_sign_hint, case_width_hint)
wrapping around the constant since at that point it should never be outside the [min_offset
,max_offset
] range. If that's the case I suggest we have it as mkconst_int(start_bit, true)
so it's clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would unfortunately make tests/verilog/dynamic_range_lhs.v
fail again.
The width is required, since the bit pattern for a negative number depends on the width.
E.g. here is the diff between the working and non-working generated CASE in the first failure in the test above, generated with ../../yosys -DALT=1 -DSPAN=1 -DLEFT=-3 -DRIGHT=0 -p "read_verilog -dump_vlog2 ./dynamic_range_lhs.v"
:
--- case-ok.txt 2023-11-20 21:24:03.486296307 +0100
+++ case-not-ok.txt 2023-11-20 21:25:05.460245527 +0100
@@ -46,15 +46,15 @@
1:
begin
case ((sel1)*(sel2))
- 2'b 00:
+ 0:
out_u[0:0] = data;
endcase
case (/** AST_TO_SIGNED **/)
- 2'b 10:
+ -2:
out_s[-2:-2] = data;
- 2'b 11:
+ -1:
out_s[-1:-1] = data;
- 2'b 00:
+ 0:
out_s[0:0] = data;
endcase
end
34873af
to
b811e48
Compare
All tests pass now with a forced The possible overflow of In the end I had to add an internal IMO this is ready to be merged now, with a possible change of |
No worries! |
frontends/ast/simplify.cc
Outdated
long long shift_mod = 1ll << (case_width_hint - case_sign_hint); | ||
// std::gcd requires C++17 | ||
// long long bitno_div = std::gcd(stride, shift_mod); | ||
long long bitno_div = gcd((long long)stride, shift_mod); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this undoes much of the optimisation making bitno_div
only be power-of-2, not that I don't see why it's done to address the overflown cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's not ideal, although power-of-2 will perfectly optimize common (most?) cases (e.g. gcd(8, 256) = 8
), and partially optimize others (e.g. gcd(12, 256) = 4
).
Since we know the value of stride
, it might be possible to identify cases where the index expression cannot possibly overflow, and fully optimize those cases. I'll look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about now? 😅
All tests still pass with a forced use_case_method = true
, and the results from running the tests in shift_issue3.zip
above are:
Chip area for module '\reg_demux_nowrshmsk_mdim_arr': 19213.891200
Chip area for module '\reg_demux_nowrshmsk_typedef': 19115.271000
Chip area for module '\reg_demux_pow2': 5513.961600
Chip area for module '\reg_demux': 21857.661000
As far as I can tell, the PR now only brings improvements, without any regressions.
1cf4925
to
7fec839
Compare
edcd997
to
be87ff0
Compare
@daglem Sorry for the protracted review. I should get to it soon. |
09026cc
to
d38a50d
Compare
@@ -321,6 +321,9 @@ namespace AST | |||
static AstNode *mkconst_str(const std::vector<RTLIL::State> &v); | |||
static AstNode *mkconst_str(const std::string &str); | |||
|
|||
// helper function to create an AST node for a temporary register | |||
AstNode *mktemp_logic(const std::string &name, AstNode *mod, bool nosync, int range_left, int range_right, bool is_signed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be static
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't be static, as long as it's encoding the filename and location in wire->str
.
Is that a problem? If so, I guess I could make a separate (non-static) function to generate the name, and pass the result of that into mktemp_logic
, but I'm not sure I see the benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, missed that. No, it's not a problem per se
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I've also thought about enhancing the function later, adding a parameter to decide whether the name should be local (no name decoration) or global (name decorated as it is now).
I believe the function could then be used to replace most of the current code to instantiate temporaries, for which I could make a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@povik As promised did some more tests with shift_issue3.zip (including reg_demux_noshift in the Makefile). I applied #3877 to reproduce increased chip area when an unconditional temporary was introduced for the rhs.
Without any changes (no temporary in this case):
Chip area for module '\reg_demux_noshift': 18689.907600
Forcing use of the temporary, which is currently assigned by AST_ASSIGN_EQ in the current block, instantiating a (* nosync *)
register:
Chip area for module '\reg_demux_noshift': 18815.101200
Forcing use of the temporary using AST_ASSIGN at the module level instead, instantiating a wire, since it could possibly be argued that this should use less resources than a register in this case (it did the contrary):
Chip area for module '\reg_demux_noshift': 18896.749200
As it stands for this test case, I'm hesitant to make any further changes.
Perhaps someone wants to investigate why introducing a temporary makes any difference at all (e.g. should it have been completely optimized away in simplify.cc
/ genrtlil.cc
in this case, since it's assigned from a simple 117 bit wire?), however this is outside the scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I experimented with this, and as far as I can tell the difference in area is due to the chaotic nature of the synthesis flow, not because of any systematic effect of introducing the temporary. When that's the case I think we want to make the temporary non-conditional, just so we spare a bit of control flow in the congested simplify.cc
code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done.
d38a50d
to
1e7b3fb
Compare
This makes tests/verilog/dynamic_range_lhs.v pass, after ensuring that nowrshmsk is actually tested. Stride is extracted from indexing of two-dimensional packed arrays and variable slices on the form dst[i*stride +: width] = src, and is used to optimize the generated CASE block. Also uses less confusing variable names for indexing of lhs wires.
Avoid repeating complex rvalue expressions for each condition.
1e7b3fb
to
1a2b475
Compare
Thanks! Let's wait for CI to finish then merge. |
Generation of minimal AST_CASE for variable slices on the form dst[i*stride +: width] = src.
Corrections of corner cases, now correctly tested by tests/verilog/dynamic_range_lhs.v