Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500k query size onwards starts to hang at ipa::protocol::sort::apply_sort #758

Closed
richajaindce opened this issue Jul 17, 2023 · 17 comments
Closed

Comments

@richajaindce
Copy link
Contributor

2023-07-11T09:55:27.225980Z INFO ipa_query{query_config=IpaQueryConfig { per_user_credit_cap: 5, max_breakdown_key: 20, attribution_window_seconds: None, num_multi_bits: 3, plaintext_match_keys: false } query_size=500000}:apply_sort{gate=5453}: ipa::protocol::sort::apply_sort: new

This consistently stalled at the same point for 2 runs so maybe this is a pattern.

We didn't see this issue before July so suspecting it might be due to new changes.

@richajaindce
Copy link
Contributor Author

richajaindce commented Jul 17, 2023

Tried to repro this on our AWS servers but unable to do so. Even 5M query successfully completed in 19hours! Next option is to keep 2 of our servers and 1 of helpers and see if we can repro it

  1. 500k ran succeeded. Trying 100k and then will try 1M

@richajaindce richajaindce changed the title 500k query size onwards starts to hang 500k query size onwards starts to hang at ipa::protocol::sort::apply_sort Jul 17, 2023
@richajaindce
Copy link
Contributor Author

richajaindce commented Jul 27, 2023

Tried 10s of runs with our internal cluster but didnt repro
Tried with DTC, Akamai and Meta for 500k, 1M but it passes too.
Now will retry with real helpers and also with Meta, Akamai and Cybernetica

@akoshelev
Copy link
Collaborator

Successful run:

apply_sort{gate=5453}: ipa::protocol::sort::apply_sort: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=11.3ms time.idle=10.9µs
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}: ipa::protocol::sort::apply_sort: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=306ms time.idle=7.38µs
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=27.1ms time.idle=712ms
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=27.6ms time.idle=132ms
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=66.0ms time.idle=844ms
apply_sort{gate=5453}: ipa::protocol::sort::apply_sort: close time.busy=66.1ms time.idle=844ms
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=649ms time.idle=743ms
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=636ms time.idle=378ms
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=1.59s time.idle=1.12s
apply_sort{gate=5665}: ipa::protocol::sort::apply_sort: close time.busy=1.59s time.idle=1.12spply_sort{gate=5665}: ipa::protocol::sort::apply_sort: close time.busy=786s time.idle=528s

Unsuccessful run

apply_sort{gate=5453}: ipa::protocol::sort::apply_sort: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=5.53s time.idle=11.2µs
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}: ipa::protocol::sort::apply_sort: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=154s time.idle=8.02µs
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=320s time.idle=328s
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=312s time.idle=200s
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=786s time.idle=528s
apply_sort{gate=5665}: ipa::protocol::sort::apply_sort: close time.busy=786s time.idle=528s

applying sort permutation to sidecar data gets stuck

@akoshelev
Copy link
Collaborator

view from another helper

apply_sort{gate=5453}: ipa::protocol::sort::apply_sort: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}: ipa::protocol::sort::apply_sort: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=316s time.idle=50.4s
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=14.9s time.idle=590s
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=314s time.idle=127s
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=152s time.idle=14.1µs
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=781s time.idle=177s

looks like it gets stuck at the same step (second shuffle)

@akoshelev
Copy link
Collaborator

akoshelev commented Aug 1, 2023

H2 logs indicate something different happening on it

apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=358s time.idle=8.78s
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=181s time.idle=14.4µs
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=16.4s time.idle=588s
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=6.53s time.idle=24.7µs
apply_sort{gate=5453}:shuffle_shares{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: new
apply_sort{gate=5665}:shuffle_shares{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}:shuffle_once{gate=binary_validator/malicious_protocol/apply_sort_permutation/shuffle_inputs}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=378s time.idle=394s
a

it is clearly pass the second shuffle, but understandably gets stuck at third shuffle (because other helpers did not finish the second step).

Second shuffle does not require H2 to communicate, it explains why it finishes it while others don't

@akoshelev
Copy link
Collaborator

our updated logging indicates that it stalls in reshare to H2 phase

2023-08-02T17:04:22.239807Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}: ipa::protocol::sort::apply_sort: new
2023-08-02T17:04:22.239835Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-02T17:04:22.239847Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: new
2023-08-02T17:04:25.926799Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: close time.busy=3.69s time.idle=11.0µs
2023-08-02T17:04:25.927999Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=3.69s time.idle=9.74µs
2023-08-02T17:04:25.928022Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-02T17:04:25.941257Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: new
2023-08-02T17:04:25.960223Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}: ipa::protocol::sort::apply_sort: new
2023-08-02T17:04:25.960240Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-02T17:04:25.960253Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: new
2023-08-02T17:06:18.058877Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: close time.busy=112s time.idle=9.15µs
2023-08-02T17:06:18.102140Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=112s time.idle=12.3µs
2023-08-02T17:06:18.102178Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-02T17:06:18.116398Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: new
2023-08-02T17:16:27.777301Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: close time.busy=230s time.idle=380s
2023-08-02T17:16:27.802013Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=230s time.idle=380s
2023-08-02T17:16:27.802048Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H3}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-02T17:16:27.816262Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H3}:reshare{to=H3}: ipa::protocol::basics::reshare: new
2023-08-02T17:21:13.043025Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H3}:reshare{to=H3}: ipa::protocol::basics::reshare: close time.busy=222s time.idle=63.1s
2023-08-02T17:21:13.140881Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H3}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=222s time.idle=63.1s
2023-08-02T17:21:13.155081Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}: ipa::protocol::sort::apply_sort: close time.busy=564s time.idle=443s

@akoshelev
Copy link
Collaborator

not sure if it is related, but I see connection reset on the RC side

     Running `target/debug/report_collector --network /home/ec2-user/SyntheticTestConfigs/0208Network.toml --input-file /home/ec2-user/SyntheticTestConfigs/events/events-400000.csv --output-file output.txt malicious-ipa --max-breakdown-key 20`
2023-08-02T14:01:17.152310Z  INFO ipa::cli::verbosity: Logging setup at level info
2023-08-02T14:18:35.662657Z  INFO ipa::cli::playbook::ipa: Starting query after finishing encryption
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: HyperPassthrough(hyper::Error(Io, Kind(ConnectionReset)))', /home/ec2-user/workspace/alex-ipa/src/cli/playbook/ipa.rs:122:14

@akoshelev
Copy link
Collaborator

We had another stalled run, also accompanied with a connection reset on the client (RC) side. Confirmed that stall occurs at reshare to H2 (not Zzure and not AWS machine)

Here is how successful (1000 inputs) run looks like in H1 logs. I added the following code to trace reshare

        let right_step = ctx.gate().as_ref().contains("after_convert_all_bits");
        ctx.try_join(
            zip(repeat(ctx.set_total_records(self.len())), self.iter())
                .enumerate()
                .map(|(i, (c, x))| async move {
                    let r = x.reshare(c, RecordId::from(i), to_helper).await;
                    if to_helper == Role::H2 && right_step {
                        match &r {
                            Ok(_) => tracing::info!("reshare({i}) succeeded"),
                            Err(e) => tracing::error!("reshare({i}) failed: {e:?}"),
                        }
                    };

                    r
                }),
        )
        .await
2023-08-03T10:08:39.204196Z  INFO ipa_query{sz=1000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: new
2023-08-03T10:08:39.623487Z  INFO ipa_query{sz=1000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: reshare(0) succeeded
2023-08-03T10:08:39.623515Z  INFO ipa_query{sz=1000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: reshare(1) succeeded
2023-08-03T10:08:39.623523Z  INFO ipa_query{sz=1000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: reshare(2) succeeded

for 400k run it stalls

23-08-03T13:17:38.030666Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}: ipa::protocol::sort::apply_sort: new
2023-08-03T13:17:38.030696Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-03T13:17:38.030713Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: new
2023-08-03T13:17:41.794800Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: close time.busy=3.76s time.idle=11.1µs
2023-08-03T13:17:41.796031Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: close time.busy=3.77s time.idle=15.2µs
2023-08-03T13:17:41.796057Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-03T13:17:41.809185Z  INFO ipa_query{sz=400000}:apply_sort{gate=after_convert_all_bits/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H2}:reshare{to=H2}: ipa::protocol::basics::reshare: new
2023-08-03T13:17:41.828201Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}: ipa::protocol::sort::apply_sort: new
2023-08-03T13:17:41.828217Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}: ipa::protocol::sort::apply_sort::shuffle: new
2023-08-03T13:17:41.828228Z  INFO ipa_query{sz=400000}:apply_sort{gate=binary_validator/malicious_protocol/apply_sort_permutation}:shuffle_once{to=H1}:reshare{to=H1}: ipa::protocol::basics::reshare: new
2023-08-03T13:19:07.316767Z  WARN ipa::helpers::gateway: [timeout]: waiting to receive messages:

@akoshelev
Copy link
Collaborator

We found the root cause - Azure Firewall is configured to drop inactive TCP sessions after 4 mins.

from: https://learn.microsoft.com/en-us/azure/firewall/long-running-sessions

An idle timer is in place to recycle idle sessions. The default value is four minutes for east-west connections and can't be changed. Applications that maintain keepalives don't idle out.

This explains why this issue is reproducible with at least one Azure VM and on the large inputs - there is a higher chance for helpers to open another connection (concurrent shuffle for sorting binary and arithmetic shares) and not push data through it for long time. #742 contributes to this problem too.

There are a couple of potential mitigations I see

  • Limit number of connections per host. This will effectively enforce multiplexing of all communications through a single TCP/TLS session.
  • Enable SO_KEEPALIVE.

@richajaindce just completed a 400k run with SO_KEEPALIVE set to 30 secs, so the latter mitigation strategy works. Given #742 I am not 100% confident that IPA process won't sleep for more than 4 mins awaiting data from other helpers. So my plan is to enable keep-alive until IPA performance profile improves.

akoshelev added a commit to akoshelev/raw-ipa that referenced this issue Aug 9, 2023
@martinthomson
Copy link
Member

Good find. For reference, our QUIC client stack sends a keep-alive at the QUIC layer whenever a request is outstanding. That is for this exact reason. Firewalls and NATs. SO_KEEPALIVE is similar in function, but crude. SO_KEEPALIVE can't tell if you have an outstanding request or not. You could end up with zombie connections if you aren't careful, which is why people generally don't use it.

akoshelev added a commit to akoshelev/raw-ipa that referenced this issue Aug 9, 2023
akoshelev added a commit to akoshelev/raw-ipa that referenced this issue Aug 9, 2023
@akoshelev
Copy link
Collaborator

Synced up with @martinthomson - HTTP2 does have keep alive mechanism and it is better suited for our needs. I remain wary about migrating Hyper to Tokio runtime as stalls on the reactor will lead to missed pings and connection drop on firewalls. However it shouldn't be the case when timeout is set to be sufficiently large.

@akoshelev
Copy link
Collaborator

for the record, it is not just Azure behavior, AWS has similar timeout defined as 350 seconds for their implementation of NAT

That could explain RST packets I am seeing in tcp dump. Interestingly while Azure FAQ claims RST packets sent to both sides, AWS only sends it to the instance behind the NAT

When a connection has an Idle Timeout (four minutes of no activity), Azure Firewall gracefully terminates the connection by sending a TCP RST packet.

When this happens with NAT Gateways, NLBs, and interface VPC Endpoints, the transmitting end of the connection receives a TCP RST packet back from the appliance which timed out. However, the remote end won’t receive any TCP packets. Therefore, it still may consider the connection to be in an established state, even though it’s no longer usable, until it also times out.

I am still suspecting Azure to be the culprit because AWS -> AWS -> Linode configuration worked fine. However it could be possible that having Azure instance makes the whole ring a lot slower, so we are actually timing out on AWS side

@akoshelev
Copy link
Collaborator

https://joonas.fi/2017/01/23/microsoft-azures-networking-is-fundamentally-broken/

Article mentions pretty much our use case

  • Azure instance has public IP but still uses SNAT for no obvious reason
  • Packets are blackholed, instead of explicit RST

@richajaindce
Copy link
Contributor Author

for the record, it is not just Azure behavior, AWS has similar timeout defined as 350 seconds for their implementation of NAT

That could explain RST packets I am seeing in tcp dump. Interestingly while Azure FAQ claims RST packets sent to both sides, AWS only sends it to the instance behind the NAT

When a connection has an Idle Timeout (four minutes of no activity), Azure Firewall gracefully terminates the connection by sending a TCP RST packet.

When this happens with NAT Gateways, NLBs, and interface VPC Endpoints, the transmitting end of the connection receives a TCP RST packet back from the appliance which timed out. However, the remote end won’t receive any TCP packets. Therefore, it still may consider the connection to be in an established state, even though it’s no longer usable, until it also times out.

I am still suspecting Azure to be the culprit because AWS -> AWS -> Linode configuration worked fine. However it could be possible that having Azure instance makes the whole ring a lot slower, so we are actually timing out on AWS side

We also tried with aws, linode and bare metal machine but we could finish a 1M run even when it took 9 hours. Not sure if the ring being slower should be an issue

@richajaindce
Copy link
Contributor Author

https://joonas.fi/2017/01/23/microsoft-azures-networking-is-fundamentally-broken/

Article mentions pretty much our use case

  • Azure instance has public IP but still uses SNAT for no obvious reason
  • Packets are blackholed, instead of explicit RST

This article resonates so well with exactly what we saw. Azure keeps a stateful NAT table with public IPs for no great reasons.. lol

@akoshelev
Copy link
Collaborator

we saw a few more failed runs with 1M records and SO_KEEPALIVE enabled and set to 30 seconds. I suspected that keep-alive packets are causing issues now because they are not acknowledged in time. Increasing the interval to 3 mins let us finish the 1M record run, but it does feel like an uphill battle now. Until we significantly improve IPA performance, issues like this will keep arising.

Another thought - I looked at tcp dump and it turns out Hyper is opening 10-20 connections between each pair of helpers. With poor throughput IPA has currently, this seems redundant and increases our chances for observing connection resets. Maybe we should limit number of connections per host to a single digit number

@richajaindce
Copy link
Contributor Author

Closing this for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants