-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
500k query size onwards starts to hang at ipa::protocol::sort::apply_sort #758
Comments
Tried to repro this on our AWS servers but unable to do so. Even 5M query successfully completed in 19hours! Next option is to keep 2 of our servers and 1 of helpers and see if we can repro it
|
Tried 10s of runs with our internal cluster but didnt repro |
Successful run: apply_sort{gate=5453}: ipa::protocol::sort::apply_sort: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=11.3ms time.idle=10.9µs
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}: ipa::protocol::sort::apply_sort: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=306ms time.idle=7.38µs
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=27.1ms time.idle=712ms
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=27.6ms time.idle=132ms
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=66.0ms time.idle=844ms
apply_sort{gate=5453}: ipa::protocol::sort::apply_sort: close time.busy=66.1ms time.idle=844ms
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=649ms time.idle=743ms
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=636ms time.idle=378ms
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=1.59s time.idle=1.12s
apply_sort{gate=5665}: ipa::protocol::sort::apply_sort: close time.busy=1.59s time.idle=1.12spply_sort{gate=5665}: ipa::protocol::sort::apply_sort: close time.busy=786s time.idle=528s
Unsuccessful run apply_sort{gate=5453}: ipa::protocol::sort::apply_sort: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=5.53s time.idle=11.2µs
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}: ipa::protocol::sort::apply_sort: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=154s time.idle=8.02µs
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=320s time.idle=328s
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=312s time.idle=200s
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=786s time.idle=528s
apply_sort{gate=5665}: ipa::protocol::sort::apply_sort: close time.busy=786s time.idle=528s applying sort permutation to sidecar data gets stuck |
view from another helper apply_sort{gate=5453}: ipa::protocol::sort::apply_sort: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}: ipa::protocol::sort::apply_sort: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=316s time.idle=50.4s
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=14.9s time.idle=590s
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=314s time.idle=127s
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=152s time.idle=14.1µs
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=781s time.idle=177s
looks like it gets stuck at the same step (second shuffle) |
H2 logs indicate something different happening on it apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=358s time.idle=8.78s
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=181s time.idle=14.4µs
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=16.4s time.idle=588s
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=6.53s time.idle=24.7µs
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=378s time.idle=394s
a it is clearly pass the second shuffle, but understandably gets stuck at third shuffle (because other helpers did not finish the second step). Second shuffle does not require H2 to communicate, it explains why it finishes it while others don't |
our updated logging indicates that it stalls in reshare to H2 phase 2023-08-02T17:04:22.239807Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}: ipa::protocol::sort::apply_sort: new
2023-08-02T17:04:22.239835Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-02T17:04:22.239847Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: new
2023-08-02T17:04:25.926799Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: close time.busy=3.69s time.idle=11.0µs
2023-08-02T17:04:25.927999Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=3.69s time.idle=9.74µs
2023-08-02T17:04:25.928022Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-02T17:04:25.941257Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: new
2023-08-02T17:04:25.960223Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}: ipa::protocol::sort::apply_sort: new
2023-08-02T17:04:25.960240Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-02T17:04:25.960253Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: new
2023-08-02T17:06:18.058877Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: close time.busy=112s time.idle=9.15µs
2023-08-02T17:06:18.102140Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=112s time.idle=12.3µs
2023-08-02T17:06:18.102178Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-02T17:06:18.116398Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: new
2023-08-02T17:16:27.777301Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: close time.busy=230s time.idle=380s
2023-08-02T17:16:27.802013Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=230s time.idle=380s
2023-08-02T17:16:27.802048Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H3}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-02T17:16:27.816262Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H3}:reshare{to=H3}: ipa::protocol::basics::reshare: new
2023-08-02T17:21:13.043025Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H3}:reshare{to=H3}: ipa::protocol::basics::reshare: close time.busy=222s time.idle=63.1s
2023-08-02T17:21:13.140881Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H3}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=222s time.idle=63.1s
2023-08-02T17:21:13.155081Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}: ipa::protocol::sort::apply_sort: close time.busy=564s time.idle=443s |
not sure if it is related, but I see connection reset on the RC side
|
We had another stalled run, also accompanied with a connection reset on the client (RC) side. Confirmed that stall occurs at reshare to H2 (not Zzure and not AWS machine) Here is how successful (1000 inputs) run looks like in H1 logs. I added the following code to trace reshare let right_step = ctx.gate().as_ref().contains("after_convert_all_bits");
ctx.try_join(
zip(repeat(ctx.set_total_records(self.len())), self.iter())
.enumerate()
.map(|(i, (c, x))| async move {
let r = x.reshare(c, RecordId::from(i), to_helper).await;
if to_helper == Role::H2 && right_step {
match &r {
Ok(_) => tracing::info!("reshare({i}) succeeded"),
Err(e) => tracing::error!("reshare({i}) failed: {e:?}"),
}
};
r
}),
)
.await 2023-08-03T10:08:39.204196Z INFO ipa_query{sz=1000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: new
2023-08-03T10:08:39.623487Z INFO ipa_query{sz=1000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: reshare(0) succeeded
2023-08-03T10:08:39.623515Z INFO ipa_query{sz=1000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: reshare(1) succeeded
2023-08-03T10:08:39.623523Z INFO ipa_query{sz=1000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: reshare(2) succeeded for 400k run it stalls 23-08-03T13:17:38.030666Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}: ipa::protocol::sort::apply_sort: new
2023-08-03T13:17:38.030696Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-03T13:17:38.030713Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: new
2023-08-03T13:17:41.794800Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: close time.busy=3.76s time.idle=11.1µs
2023-08-03T13:17:41.796031Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=3.77s time.idle=15.2µs
2023-08-03T13:17:41.796057Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-03T13:17:41.809185Z INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: new
2023-08-03T13:17:41.828201Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}: ipa::protocol::sort::apply_sort: new
2023-08-03T13:17:41.828217Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-03T13:17:41.828228Z INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: new
2023-08-03T13:19:07.316767Z WARN ipa::helpers::gateway: [timeout]: waiting to receive messages: |
We found the root cause - Azure Firewall is configured to drop inactive TCP sessions after 4 mins. from: https://learn.microsoft.com/en-us/azure/firewall/long-running-sessions
This explains why this issue is reproducible with at least one Azure VM and on the large inputs - there is a higher chance for helpers to open another connection (concurrent shuffle for sorting binary and arithmetic shares) and not push data through it for long time. #742 contributes to this problem too. There are a couple of potential mitigations I see
@richajaindce just completed a 400k run with |
See private-attribution#758 for details
Good find. For reference, our QUIC client stack sends a keep-alive at the QUIC layer whenever a request is outstanding. That is for this exact reason. Firewalls and NATs. SO_KEEPALIVE is similar in function, but crude. SO_KEEPALIVE can't tell if you have an outstanding request or not. You could end up with zombie connections if you aren't careful, which is why people generally don't use it. |
See private-attribution#758 for details
See private-attribution#758 for details
Synced up with @martinthomson - HTTP2 does have keep alive mechanism and it is better suited for our needs. I remain wary about migrating Hyper to Tokio runtime as stalls on the reactor will lead to missed pings and connection drop on firewalls. However it shouldn't be the case when timeout is set to be sufficiently large. |
for the record, it is not just Azure behavior, AWS has similar timeout defined as 350 seconds for their implementation of NAT
That could explain RST packets I am seeing in tcp dump. Interestingly while Azure FAQ claims RST packets sent to both sides, AWS only sends it to the instance behind the NAT
I am still suspecting Azure to be the culprit because AWS -> AWS -> Linode configuration worked fine. However it could be possible that having Azure instance makes the whole ring a lot slower, so we are actually timing out on AWS side |
https://joonas.fi/2017/01/23/microsoft-azures-networking-is-fundamentally-broken/ Article mentions pretty much our use case
|
We also tried with aws, linode and bare metal machine but we could finish a 1M run even when it took 9 hours. Not sure if the ring being slower should be an issue |
This article resonates so well with exactly what we saw. Azure keeps a stateful NAT table with public IPs for no great reasons.. lol |
we saw a few more failed runs with 1M records and SO_KEEPALIVE enabled and set to 30 seconds. I suspected that keep-alive packets are causing issues now because they are not acknowledged in time. Increasing the interval to 3 mins let us finish the 1M record run, but it does feel like an uphill battle now. Until we significantly improve IPA performance, issues like this will keep arising. Another thought - I looked at tcp dump and it turns out Hyper is opening 10-20 connections between each pair of helpers. With poor throughput IPA has currently, this seems redundant and increases our chances for observing connection resets. Maybe we should limit number of connections per host to a single digit number |
Closing this for now |
2023-07-11T09:55:27.225980Z INFO ipa_query{query_config=IpaQueryConfig { per_user_credit_cap: 5, max_breakdown_key: 20, attribution_window_seconds: None, num_multi_bits: 3, plaintext_match_keys: false } query_size=500000}:apply_sort{gate=5453}: ipa::protocol::sort::apply_sort: new
This consistently stalled at the same point for 2 runs so maybe this is a pattern.
We didn't see this issue before July so suspecting it might be due to new changes.
The text was updated successfully, but these errors were encountered: