-
Notifications
You must be signed in to change notification settings - Fork 2
/
notes
711 lines (534 loc) · 25.4 KB
/
notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
The general intuition is that pushes become syncs; visibility edges
become lwsyncs; and execution edges become either control-isyncs, or
data or address dependencies. There are some subtleties though.
Ignoring read-modify-writes, the algorithm is as follows:
1. Take the transitive closure of the visibility edges, and take the
transitive closure of the execution edges.
2. Discard any edges involving a nop.
3. For all triples of the form i -vo-> push -xo/vo-> i', create a push
edge i -push-> i'. (These exist in the algorithm only, not in the
theory.)
XXX: I think that this isn't right. We can't necessarily find all such
edges, since they might be in other functions or whatever.
4. Discard any edges involving a push.
5. Insert synchronization into the code such that:
a. Every push edge is cut by a sync.
b. Every W->W and R->W visibility edge is cut by a sync or lwsync.
c. Every R->R edge is cut by a sync, lwsync,
control-isync, or data/address dependency.
d. Every R->W execution edge is cut by a sync, lwsync,
control-isync, data/address dependency, or control dependency.
Doing so is interesting, particularly when edges cross from one
basic block to another.
Also, the execution stuff is a good deal trickier than that
because of edges across loops. (And there are /always/ edges
across loops, because the function will get called again!)
Note that:
a. R->R visibility edges have no more force than execution (other
than transitive implications).
b. W->R visibility and execution edges, and W->W execution edges,
are discarded (after transitive closure).
c. Pushes without both a visibility predecessor and a
visibility/execution successor are discarded.
--
Maybe we can do a bit more filtering of operations here based on
whether they are atomic or not.
I think that:
* Non-atomic reads can't be the source of an edge
* Non-atomic writes can't be the destination of a visibility edge
--
pre and post edges we will just add the stuff and then try to notice it.
* False; we don't actually do it that way.
honestly maybe that's what we should do for push also? although, won't
actually be hard to do push right, but maybe not worth the state space
blowup.
* But we do do it that way for push.
Should be a little careful because probably a common idiom will be:
L(push, PUSH); VEDGE(pre, push); XEDGE(push, post);
and we should just generate a single sync in place and not do dumb things.
It will be common because there is a macro for it, so...
Will actually need special handling of pre/push to posts even just for
correctness!
--
I sort of dislike inserting all these barriers so early. I feel like
it might inhibit llvm more than we want.
--
But since we *do* insert all the barriers early, we actually can
implement pre and post by making dummy actions immediately before and
after actions with pre/post constraints. The only tricky thing is that
we need to make sure we don't try any funny business with dependencies
for xo edges.
* DONE: handle this fine
------------------------------------------------------
Things about the paper:
-- we need pushes for two threads writing to locations to get ordering:
maybe we should require co+vo to be acyclic? is this important
-- ISA2+lwsync+data+addr: our semantics don't allow us to generate this
code and be correct; vo+xo+xo would allow the forbidden behavior;
vo+vo+xo works, as does push+xo+xo;
push+xo+xo is maaaybe what we want on ARM, but vo+vo+xo is
definitely better on x86
--------------------------------------------
I think we are going to want to be able to adjust where labels get
"bound": that is, relative to loops, the function call, etc. This
seems important for doing consume style stuff. The thing that matches
the theory would be to handle it at the label declaration, but it
seems cleaner for code implementation to do it when specifying edges.
We *definitely* want to be able to bind inside a function. Loops more
complicated. What about gotos, etc.
Actually maybe not definitely. If rcu_read_unlock() has a barrier then
maybe we don't give a shit anymore.
*** Idea: allow explicitly specifying a binding location for a
label. That binding location then needs to dominate all occurances of
the label as well as anything with edges to/from that label. Then,
when looking at paths, we can ignore paths that go through the binding
site, I think. Not totally sure how to properly handle differing
binding sites between src/dst (technically they will probably always
differ, even if trivially...): maybe ignore the least common ancestor
dominator.
Probably also want a way to specify binding something at the "most
narrow possible binding" (least common dominator?).
--
OK the binding setup where it is anything that doesn't pass through
the binding site is janky as shit when we try to express it in the
formalism... Maybe it works but we need to make it so that the
binding site dominates the use sites.
OH WAIT THAT IS HOW WE DEFINE IT NOW COOL.
--- How elaborate of real dependencies do we want to try to preserve?
And how do we preserve them? Detecting and preserving straightforward
deps should be eaaaaasy.
We probably need to be able to handle linux RCU list style things,
which means handling bitcasts and GEPs.
DONE *** We are definitely definitely going to need a notion of path dependence
for detecting dependency. A super common example of RCU is searching a
linked list, so...
- Wee.
DONE * I should run some benchmarks to see how much faster it is with RCU.
- Like twice as fast
-----
** Combining isync and depedencies might actually be quite profitable,
since all addresses need to be resolved, right?
----
DONE *** We need to adjust our notion of cost to take loops into account!
We will basically preferentially put things into loops instead of
taking them out, which is totally backwards. Dunno if it should be
part of capacity or a seperate score. Maybe increase the cost by a
factor of 4^n where n is the loop nesting count? (4 because a loop
will probably have 1/2 the capacity of its parent; 4^n makes a loop
have twice the cost of its parent).
---
*** We could track which functions always have barriers (dmbs, isyncs)
and factor that information in when doing analysis.
---
Hilariously, calling mprotect
--
In C++: 1.10.28: "An implementation should ensure that the last value (in modification order) assigned by an atomic or
synchronization operation will become visible to all other threads in a finite period of time."
This precludes (I think) transformations like (assuming mo_relaxed):
x = 1; while (y); --> while (y); x = 1;
We don't really have a story for this sort of thing. I was originally
thinking that I wanted to use W->R execution order edges to prevent
this, but I am less sure, now. Maybe it should just be a side condition
in general that things "eventually" execute, "eventually" become visible.
I don't think that this side condition is that much painful than the
other, saves us annotations, and I *think* already fits our compiler
strategy (since we are turning all our accesses into mo_relaxed).
(RMW related things are complicating this analysis.)
--
Do we have a theory story for using weak cmpxchg?
Maybe:
weak_cmpxchg(p, ev, nv) = RMW(p, x. if x == ev && !spurious() then nv else x)
Does this work? Is ordering stuff reasonable?
Oh, it also doesn't quite work, since it needs to output whether it succeeded, which can't be
divined from the value.
Actually, does /strong/ cmpxchg work right in failure cases??
The RW semantics in RMC /always/ does a write, but that won't actually
be the case, right? Does this cause trouble?
cmpxchg(p, ev, nv) = RMW(p, x. if x == ev then nv else x)
Implementing CAS with RMW can result in rf edges that don't actually
correspond with real rf edges in the hardware. I'm not totally sure
what all implications this has. The most obvious one is that it can
establish visibility in places where it might not make sense. I
/think/ that is fixed on ARM by making sure that all xo edges /out/ of
a CAS are dmbs (and not something dep based). On POWER it is a little worse:
all vo edges into and xo edges out of need to be *sync* (not lwsync).
This can be fixed by adding a new CAS directly to the language; like
RMW, it will evaluate into RWs and speculations; two rules: one for
success, one for failure.
---
DONE *** Add some factor so that one lwsync is better than two, even if the
cost is the same. Reduces code size, maybe inhibits optimizations
less in certain cases?
*** Maybe we should be more directly merging
things when all the edges out of a node are cut.
---
We /do/ need to do value obfuscation or some such on data-dep
things. I suspected this but was hoping it wasn't true. It is going to
be annoying and I worry it will hurt optimizations noticably.
I /think/ that we can maybe get away with one obfuscation per chain,
at the source value?
That's not true: we could get screwed on a comparison from a derived
value.
* Maybe obfuscate every time a value in the chain has multiple uses.
* Maybe do some checking of how the values are used? Seems
tricky. May need to follow far.
These obfuscating copies /will/ actually cause extra register copies,
etc, I think. Yeah, they definitely do!
*** Could we run another pass that removes all of our obfuscating
copies? Or do they keep the backend stuff honest as well?
It does seem necessary, at least to keep the POWER backend on -O3
honest.
Aha! It is not done on a post-IR level, though! The POWER backend
on -O3 runs some extra IR level optimizations.
* hideAllUses - well, only need to hide N-1, really
---
*** Need to think more about the atomic/non-atomic distinction and
optimizations. *Not* everything that is labeled needs to be atomic
(e.g reads from an RCU-protected pointer, the data in a standard MP
example) and some things that need to be atomic don't need to be
labeled (esp. when using pre/post). Maybe the thing to do is to
imitate C11 more and have rmc_store/rmc_load, maybe even _Rmc/Rmc<T>?
* Done
------
We should think about ARMv8s acquire/release operations.
Acquire might actually be C++ acquire, basically.
Release is way stronger than C++ release and also even C++ SC:
a released write becomes globally visible; kind of an integrated
write-push, I suppose. x86 also has integrated write-pushes
(xchg or even just mov; MFENCE, since all writes are globally
visible).
-
It seems like dmb ld probably can't be used as a visibility edge from
R->W, though the situation is complicated!
-----
Running optimizations early can cause us some trouble, since inlining
might cause collision between label names, which could cause bad
codegen. Maybe we need a uniqifying pass that runs early or
something... but that's not great.
Actually, wait. What is the relationship between inlining and cross
function call edges. I think this can cause us problems:
void f() { L(a, FOO); VEDGE(a, a); }
void g() { f(); f_caller(); f(); }
could compile to
void f() { FOO; dmb; }
void g() { FOO; f_caller(); dmb; FOO; dmb; }
which is wrong.
... maybe make __attribute__((noinline)) the way we indicate that we
want outside binding...
RMC_BIND_OUTSIDE
* DONE
* UNDONE
*** Maybe in some cases we can then remove the noinline after the rmc
pass.
Actually, inlining isn't that much of a problem necessarily. We set
noduplicate on the rmc functions, which prevent inlining unless there
is exactly one local callsite and it can be inlined there. So we can
get rid of the noinline thing and make BIND_OUTSIDE the default...
* DONE
- XXX: do we need to go mark some of the functions noinline?
-
Maybe some of these problems can be addressed by explicit declaration
of labels (and then using the label's *addresses* as the tags to edge
registration...
Binding outside could be represented by globally declared labels,
although that will pose plenty of its own problems.
Then maaaaybe we can pass labels to functions.
Or maybe we *don't* do the label passing thing but instead, when we
want that sort of thing, we make a global label part of the API and
other functions declare edges to it.
- We say it only works inside a compilation unit
- We ditch the noinline hack and do the bind outside with globals
- When we look at global labels, we need to find all the functions
they are used in
- XXX: /might/ be fucked
- I think the above things are bad ideas
- We settled on allowing scoped *edge* binding, not label binding
(which matches the theory better)
- And this thing can work across files
*** support edges to/from "argument pass"/"function return"
- Can model by passing labels to functions
- RMC doesn't directly support passing labels but it is easy to do by
passing "unit susp"s instead:
- reify-label l = susp (l # Nop)
- relabel rl l' = l' # force rl
* DONE
-----
Wait again. What is the relationship between /recursion/ and cross
call edges. Consider:
int f(int x, int y) {
VEDGE(a, b);
if (x) { L(a, ...); }
if (something) { f(0, 1); }
if (y) { L(b, ...); }
}
Honestly, most reasonable compilation things seem like they'll just
work fine, but...
I guess we could add edges from all function calls that could possibly
invoke us recursively... but that sounds annoying.
* I think for now we will not support this???
* XXX: I HAVE NOTES FOR A PLAN ABOUT THIS SOMEWHERE
-----
*** Treat CAS operations as having a built in control dependency??
*** Could cut down on how many dummy copies we have when doing control
dep stuff. Could LLVM merge two relaxed loads? That could cause us
trouble. (But mark it volatile?)
*** But actually maybe we need *more* copies. Maybe we need to
hideAllUses of the load to avoid other data getting pulled in...
-----
We could actually use regular store/load notation for RMC atomics in C
*if* we are willing to rewrite all the seq_cst atomics in a module to
monotonic... We could maybe turn it on per module?
(We can probably turn things on conditionally by declaring magic
extern functions that the pass detects...)
-------
C++ seq cst fences:
* early drafts said that fences synchronized-with the next fence in
the SC order
* change in language intended to *strengthen* SC fences, it seems?
well, it also weakens them
* The reason that SC fences can't be used to recover SC behavior is
that they don't push out everything they have seen. In acq/rel
IRIW, it is exactly because they don't synchronize-with each other
that we don't get SC
* Even if we added the synchronizes-with thing, though, I think
relaxed IRIW would not be SC, since even if we observed the write
we aren't synchronizing with it.
* ACTUALLY: With the synchronizes-with thing, I think IRIW where
the readers are relaxed would be ruled out and the writers are
release, since the fence would handle the visibility stuff.
IN GENERAL, WE HAVE STRONGER TRANSITIVE VISIBILITY PROPERTIES.
--------
Optimizations that can break SC (partial list, obvs):
* CSE (r1 = *p; r2 = *q; r3 = *p; -- CSEing *p)
* Loop hoisting
* Dead store elimination (*p = 1; *q = 2; *p = 3; -- eliminating first store)
* Rematerializing loads
-----
I want a better story for using atomics in an SC way.
The problem is that putting pushes in is stronger than we need and
requires worse code. x86 shouldn't need to MFENCE on SC reads, ARMv8
should be able to use its new instructions.
Originally I had thought that I could manage something like this by
adding a notion of "push edges" which could be optimized based on the
source/sink (so on x86, only a R->W push edge would require an
MFENCE). This is complicated by transitivity issues, but might still
be workable. (NO: probably not workable)
New idea: can we *model* them as either:
* lock; memory access; unlock
* RMW (with vpost after?) - bet not - see "read don't modify write"
- skeptical of the first also - same reason I think
- THIS DOES NOT WORK AT ALL :( :( :( same reason
-
Actual plan is mostly to just add a new sort of "sc" operation with
appropriate coherence rules.
- The example for why pushes don't work:
T1: x = 1 -push-> r1 = y
T2: y = 1 -push-> r2 = x
Assuming that y is our SC location, this is stronger than we want. If
r1 = 0, then the T1 push must be before the T2 push (or else we
couldn't do that read), which means we must have r2 = 0. This is
stronger than just SC on SC locations would imply.
--------
We are actually a little weaker than mo_relaxed:
- We don't support the CoRR property (when we plug po in for hb),
which says that program-order reads from the same location respect
coherence order
- But we /do/ support the CoRW rule through the kind of weird
RWP-PO/WWP-READ rules...
There is a nice priority based model of the system in which we have
one sort of priority, which is (po|conflicting u vt|conflicting)+,
where |conflicting means "restricted to conflicting accesses". If we
change it to be (po|sameloc u vt|sameloc)+, then we support CoRR.
I think we should just add the CoRR rule.
Joe suggests that maybe we could, as an optimization, compile
R -xo-> R edges using the same location to just take advantage of the
hardware's coherence order. I *think* this might be workable;
definitely it doesn't work to establish any edges via transitivity,
but it might be fine.
The optimization would be tricky, though, since it is important to
make sure that the location is actually the same, even though it could
vary over loops and stuff.
I think it's not worth it.
-
OK, I have at least one example of a case where it could matter:
bool somebody_lock(mutex *lock) {
XEDGE(lock, post);
while (lock->locked) {}
return L(lock, lock->locked.compare_and_swap(0, 1));
}
Then you could have a case where like, whenever threads take the lock,
they check some flag and maybe do something. Then, some thread T1
might set the flag and wakeup T2 and T2 wants to make sure to do that
processing. So T2 does somebody_lock: T1's lock of the mutex is vt to
it, so if the while loop observes an unlock, it is because somebody
unlocked it. So if we have RR, then the only way that the function can
fail is if some other thread actually *has* taken the lock again. So
then we can know, for example, that once the lock *is* actually taken,
the work will have already been done.
I'm not sure how practical this is.
-------
I think the reason isb might be slow is because it is really
more for changing the instruction stream?
---
How to intuitively model execution order on POWER/ARM? Maybe:
* A write has "executed" when it is committed
* A read has "executed" when it has satisfied the read
using the write it will commit with??
-------
Look into whether this is going to wreck me:
11:06 < bluss> llvm will still do optimizatoin across functions even when
forbidden to inline
11:06 < sully> wait really
11:06 < sully> bluss: do you have an example of what it can do?
11:07 < Quxxy> sully: I know it'll specialise even inline(never) functions
11:07 < bluss> sully: specialize the noinline function so that it only handles
the cases it's actually called with
- I stopped using noinline so should be fine...
-------
See drc_mutex_lock/drc_mutex_unlock in lock.cpp for a discussion of
how to satisfy the drf execution requirement.
-------
I think it is Right and Proper to have a different definition of data
race for proving SC and defining HCF for plain accesses.
First, note that the execution post edge is really important in the SC
proof. If we just make it VT, the theorem is super false. Consider:
T1: x.xchg(1); r1 = y.xadd(0);
T2: y.xchg(1); r2 = x.xadd(0);
This does not have data races by the VT definition but it does have
data races if you require xo-post edges. That's right, because it
admits the non-SC behavior. The things can just get reordered.
Adding execution edges prevents that (with straightforward trace order
reasoning).
-
Consider another variant on store buffering. In this example, x and y
are plain accesses and the mutexes are bs mutexes that need edges
drawn explicitly.
T1:
lock xl; -x-> x = 1; -v-> unlock xl;
lock yl; -x-> r1 = y; -v-> unlock yl;
T2:
lock yl; -x-> y = 1; -v-> unlock yl;
lock xl; -x-> r1 = x; -v-> unlock xl;
Here, all of the accesses to plain variables are protected by locks
and have appropriate constraints so that they can't be accessed
"concurrently" and will see anything done in a previous section of
that lock.
-
And another, that is important:
T1:
setup listA -v-> *post
pa = listA
setup listB -v-> *post
pb = list B
T2:
rb = listB -x-> uses of rb
ra = listA -x-> uses of ra
Here, this allows rb=listB, ra=initA, a non-SC behavior.
But that is totally fine!
The values you get when reading out of the list will still be totally
defined by which writes the ra and rb reads read-from.
-
I think I will call programs that satisfy the synchronized-before,
SC-enforcing definition "strongly data-race-free" and programs that
satisfy the visible-to based definition "weakly data-race-free".
-----------------------------------
On integrity with -p 2 -c 2, at least:
The fastest ms_queue I could produce has redundant barriers.
1. compiler does not actually produce the barriers I expect it to; I expected things to be sunk down more. - WAIT, false? - Totally false - the get_tail -xo-> get_next edge needs that lwsync in order to order between loops
2. the thing I think it ought to have emitted was slightly slower than the best
- actually no it is bimodally fucked also
!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!
The slow runs generally use waaaaay more memory.
Though the extra time *is* in userland. But CREAM, sooo.
-
the 'if (tail != this->tail_) continue;' check does seem to actually help
- some of the bimodality I've seen feels reaaaaaly fishy
- gets into "runs" of fast or slow things
-------------------------------------------
Notes on the RMC execution model:
In all models:
- Threads "initiate" (init) actions in order
- Some mechanism (differing between models) allows for multiple
actions per thread to be initiated (but unexecuted)
- Initiated actions can be executed out-of-order subject to "executable"
-- RMCv1
- Initiated actions replaced by an 'identifier'
- The RHS of a bind can be stepped if the LHS is initiated and the
bound variable isn't free in the "transaction"
- When an action is executed, its identifier is replaced by "ret v"
- A speculation rule allows free variables to be eliminated by
substituting in for it and adding an equality constraint to the
term.
- N.B: speculations are tracked in the *expr*
-- RMCv2
- When actions are initiated, they also have a value speculated
- Initiated actions are immediately replaced by the speculated value
- When an action is executed, the speculated value must match
- N.B: speculations are tracked in the *history*
- RMW mildly annoying: Need to know the speculated value *before* the
value being written is ready (since we need to compute it), so we
pre-speculate a value and track it in the term until we actually
speculate the RW
- N.B: In RMCv2, the thread semantics are *super* untethered from the
semantics of the store!
-- RMCv2.5
- Like RMCv2, but:
- Threads may nondeterministically "promise" to perform a write that
has not yet been initiated
- Needs to show an alternate trace where it initiates that write with
no undischarged read speculations
- For a write to execute, there must be an outstanding promise
- XXX: might be some problems: not executable? RWs wrong?
----------------------------------------------
Even less formal execution model:
- Actions may be executed in any order except as constrained by
execution order
- This includes when there is a data or a control dependency: reads
can have a value speculatively returned before the read is actually
executed
- BUT: A write may only be executed if the same write could have been
executed without any outstanding speculated reads.
----------------------------------------------
Compilation shortcomings:
- If we do something like this:
VEDGE(a, b); VEDGE(b, c);
L(a, ...);
if (...) L(b, rmc_noop());
L(c, ...);
The semantics only generate an "a -vo-> c" edge when b gets
executed. But our compilation strategy takes the transitive closure
of all the edges in a flow-insensitive way, which will result in
inserting an unconditional lwsync and not one in the branch.
- We don't use binding sites when doing sync/lwsync cuts. See
examples/crit_edge.c for an case where this comes up.
*** FIXED
---------------------------------------------------------------------
Push edges have some seemingly fiddly bits, but it is actually fine.
Given 'i1 -pu-> t' and 'i2 -pu-> t', the semantics will generate two
pushes, p1 and p2, with i1 -vo-> p1 and i2 -vo-> p2. But this is fine,
since one of the pushes will be to after, and so both i1/i2 will be vo
before that one.
---------------------------------------------------------------------
When busy-waiting on a flag on ARMv7, it actually seems to be *better*
in some cases to execute barriers in the loop:
In my qspinlock tests on ARMv7:
XEDGE(ready_wait, lock);
while (!L(ready_wait, me.ready)) delay();
is slower than:
XEDGE(ready_wait, lock);
XEDGE(ready_wait, ready_wait);
which is slower than:
XEDGE(ready_wait, post);
The fastest one will insert a barrier immediately after the read,
*before* the branch. The middle one, in this case, will put a barrier
inside the loop and then one well after it.
I've observed stuff like this with seqlocks also: busy-waiting stuff
is *faster* when you use a post edge.
---
Adding push edges between everything gives us SC.
Does adding visibility edges gives us TSO!?
Nope! See 2+2W.
That's interesting and I think Brooks had something to say related to
this and his models.
--