Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix concurrency issues #993

Merged
merged 16 commits into from
Nov 25, 2024
Merged

Fix concurrency issues #993

merged 16 commits into from
Nov 25, 2024

Conversation

markspanbroek
Copy link
Member

Switches nim-ethers to http subscriptions instead of websocket subscriptions. This changes the timing of subscriptions and unearths some concurrency issues. This PR fixes those issues.

Includes:

  • fixes to nim-ethers: Fix http subscription polling nim-ethers#94
  • transactions are confirmed with .confirm(1), previously we used .confirm(0), which didn't work
  • many fixes to tests that had hidden assumptions about timing

This is split off from the work in #988

To work around this issue when subscriptions are
inactive for more than 5 minutes:
NomicFoundation/hardhat#2053

Use 100 millisecond polling; default polling interval
of 4 seconds is too close to the 5 second timeout for
`check eventually`.
confirm(0) doesn't wait at all, confirm(1) waits
for the transaction to be mined
includes fixes for http polling and .confirm()
allow for a bit more time to withdraw funds
to ensure that the transaction has been processed
before continuing with the test
there were two logic errors in this test:
- a slot is freed anyway at the end of the contract
- when starting the request takes a long time, the
  first slot can already be freed because there were
  too many missing proofs
currentTime() doesn't always correctly reflect
the time of the next transaction
otherwise the windows runner in the CI won't
be able to start the request before it expires
allow for a bit more time for a request to
be submitted
@markspanbroek markspanbroek force-pushed the fix-concurrency-issues branch 3 times, most recently from d239164 to ec962a1 Compare November 14, 2024 06:54
windows ci is so slow, it can take up to 40 seconds
just to submit a storage request to hardhat
reason: with the increased period length of 90 seconds, it
can take longer to wait for a stable challenge at the
beginning of a period.
apparently it takes windows 2-3 seconds to
resolve "localhost" to 127.0.0.1 for every
json-rpc connection that we make 🤦
Copy link
Contributor

@benbierens benbierens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm expecting this to make our tests even slower. But it does seem correct.

@veaceslavdoina
Copy link
Contributor

I've performed some tests to check if slowness on Windows can be related to IPv6.

Results

Branch PRC Endpoint Run status make testAll IPv6
fix-concurrency-issues 127.0.0.1 66m21.431s 🟢
localhost 149m10.858s 🟢
localhost 153m0.365s 🟢
localhost 152m45.421s 🟢
localhost 57m50.468s 🔵
localhost 58m8.511s 🔵
localhost 58m4.198s 🔵
127.0.0.1 56m48.298s 🔵

🟢 - IPv6 is enabled or primary
🔵 - IPv6 is disabled or secondary

details

Server

Cloud Region Instance
Hetzner Germany, Nuremberg CPX41 - 8 vCPU, 16 GB RAM, 240 GB SSD

Commands

# Clone
git clone https://github.com/codex-storage/nim-codex.git repos/nim-codex && cd repos/nim-codex

# Branch
git switch -C fix-concurrency-issues origin/fix-concurrency-issues

# Update
time make -j 8 update # 7m55.096s

# Build
time make -j 8 # 8m36.045s

# Tests
time make -j8 testAll # 66m21.431s

# --> localhost
find tests -type f -name "*.nim" -exec sed -i 's/127.0.0.1:8545/localhost:8545/g' {} \;
grep -r 'localhost:8545' tests

# --> 127.0.0.1
find tests -type f -name "*.nim" -exec sed -i 's/localhost:8545/127.0.0.1:8545/g' {} \;
grep -r '127.0.0.1:8545' tests

Disable IPv6

Guidance for configuring IPv6 in Windows for advanced users

:: Prefer IPv4 over IPv6 in prefix policies
netsh interface ipv6 show prefixpolicies
netsh interface ipv6 set prefix ::/96 60 3
netsh interface ipv6 set prefix ::ffff:0:0/96 55 4

:: Permanently
:: Prefer IPv4 over IPv6
:: reg add HKLM\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters /v DisabledComponents /t REG_DWORD /d 0x20 /f

:: Disable IPv6
:: reg add HKLM\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters /v DisabledComponents /t REG_DWORD /d 0xFF /f

:: restart
:: shutdown -r -t 0

:: reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters /v DisabledComponents
:: check
ping localhost -n 1

:: Pinging WIN-IP4MNH4JT01 [127.0.0.1] with 32 bytes of data:
:: Reply from 127.0.0.1: bytes=32 time<1ms TTL=128

Enable IPv6

:: Prefer IPv6 over IPv4 in prefix policies
netsh interface ipv6 show prefixpolicies
netsh interface ipv6 set prefix ::/96 1 3 store=persistent
netsh interface ipv6 set prefix ::ffff:0:0/96 35 4 store=persistent

:: delete the key
:: reg delete HKLM\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters /v DisabledComponents /f

:: restart
:: shutdown -r -t 0

:: reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters /v DisabledComponents
:: check
ping localhost -n 1

:: Pinging WIN-IP4MNH4JT01 [::1] with 32 bytes of data:
:: Reply from ::1: time<1ms

RPC request time

# localhost - 0m0.326s
time curl -m 5 -X POST \
  -s http://localhost:8545 \
  -H 'Content-Type: application/json' \
  -d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}'

# 127.0.0.1 - 0m0.116s
time curl -m 5 -X POST \
  -s http://127.0.0.1:8545 \
  -H 'Content-Type: application/json' \
  -d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}'

We can fix that for CI, but it will require to update the docs and perform one more step, locally as well.

      - name: Prefer IPv4 over IPv6 in prefix policies
        if: matrix.os == 'windows'
        run: |
          netsh interface ipv6 set prefix ::/96 60 3
          netsh interface ipv6 set prefix ::ffff:0:0/96 55 4

@markspanbroek
Copy link
Member Author

I've performed some tests to check if slowness on Windows can be related to IPv6.

Thanks @veaceslavdoina, interesting to see that it is indeed IPv6 related, and not just DNS.

We can fix that for CI, but it will require to update the docs and perform one more step, locally as well.

I prefer to keep the windows environment on the CI as it is, because it reflects how a windows machine works out of the box. To ensure that users of Codex do not run into this issue, I think I'd rather update the docs and our CLI defaults to replace localhost with 127.0.0.1 everywhere.

@emizzle
Copy link
Contributor

emizzle commented Nov 21, 2024

I prefer to keep the windows environment on the CI as it is, because it reflects how a windows machine works out of the box.

I agree that this is useful, but it means that tests take 3x as long, from sub-1h to 2.5h. In my opinion, that is a substantial tradeoff. Perhaps we could run a single test using default Windows settings, to cover the default Windows settings case and not slowdown the rest of the Windows tests?

Copy link
Contributor

@emizzle emizzle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, well done mark 👏

Comment on lines +32 to +33
nodes = 3,
tolerance = 1).get
Copy link
Contributor

@emizzle emizzle Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 I know this is probably faster but having 5 nodes has picked up on some issues in the past (around erasure coding).

@veaceslavdoina
Copy link
Contributor

veaceslavdoina commented Nov 21, 2024

I agree that this is useful, but it means that tests take 3x as long, from sub-1h to 2.5h.

It should be fast now, as we are using 127.0.0.1 instead of localhost.

@emizzle
Copy link
Contributor

emizzle commented Nov 21, 2024

It should be fast now, as we are using 127.0.0.1 instead of localhost.

Still, integration tests on the last commit are taking 15-22 mins longer than macos and linux, respectively. (53m on Linux, 1h on macos, 1h15m on Windows). According to your data, disabling IPv6 would save around 10 minutes.

@veaceslavdoina
Copy link
Contributor

veaceslavdoina commented Nov 21, 2024

I had same impression and did a test on GitHub Actions. Branch was based on the PR branch, but it was deleted already.

Just a note, the tests above were performed on a Cloud VM, not GitHub Actions.

VM: 8 vCPU, 16 GB RAM, 240 GB SSD
GA: 4 vCPU, 16 GB RAM,  14 GB SSD
Screenshot 2024-11-21 at 08 25 01

Just created a new branch again and run a new test, let's see the results.

Will check the speed of all tests in a single run as well.

@markspanbroek
Copy link
Member Author

Still, integration tests on the last commit are taking 15-22 mins longer than macos and linux, respectively. (53m on Linux, 1h on macos, 1h15m on Windows). According to your data, disabling IPv6 would save around 10 minutes.

The slow connection to localhost is solved in this PR by ensuring that we specify 127.0.0.1 instead. The tests are slower to run because we now do polling instead of websockets and because we now have a proof period of 90 seconds instead of 60 seconds.

The good news is that the windows test runs are now no longer the bottleneck. So we could split the CI test runs on Mac and Linux just like we did for Windows and probably reduce the overall CI time to about 30 minutes.

@veaceslavdoina
Copy link
Contributor

Btw, CI run on Linux fails now

error: rustc 1.78.0 is not supported by the following packages: [User]
  [email protected] requires rustc 1.79.0 [User]
  [email protected] requires rustc 1.79.0 [User]
  [email protected] requires rustc 1.79.0 [User]
  [email protected] requires rustc 1.79.0 [User]
Either upgrade rustc or select compatible dependency versions with [User]
`cargo update <name>@<current-ver> --precise <compatible-ver>` [User]
where `<compatible-ver>` is the latest version supporting rustc 1.78.0 [User]

rust_version:
description: "Rust version"
default: "1.78.0"

@veaceslavdoina
Copy link
Contributor

veaceslavdoina commented Nov 21, 2024

The good news is that the windows test runs are now no longer the bottleneck. So we could split the CI test runs on Mac and Linux just like we did for Windows and probably reduce the overall CI time to about 30 minutes.

Probably just in some cases, because as it shown on my screenshot, integration tests on Windows are still 1h 21m and all Linux is 1h 23m.

For some reason, in some cases all tests on Linux and macOS might take 1h 43m, like in your latest run, but it is an exception rather than a rule.

@veaceslavdoina
Copy link
Contributor

CI configuration

  • Prefer IPv4 over IPv6 in prefix policies was set in CI
  • Large runners were used to run testAll and only for Windows

Results

OS Runner Tests Duration Unittest Contracts Integrations Tools
🐧 General All 1h 26m 15m 23s 5m 54s 55m 28s 6m 59s
🍏 General All 1h 40m 21m 15s 9m 37s 57m 43s 6m 1s
🪟 General All 2h 6m 29m 48s 12m 28s 1h 2m 16s 15m 17s
🪟 8 vCPU All 1h 41m 18m 29s 8m 2s 56m 52s 9m 14s
🪟 16 vCPU All 1h 32m 13m 8s 5m 56s 59m 17s 6m 11s
🪟 32 vCPU All 1h 23m 10m 59s 4m 48s 54m 22s 5m 8s
🪟 General Unittest 35m 25s 29m 41s - - -
🪟 General Contracts 19m 51s - 13m 16s - -
🪟 General Integrations 1h 15m - - 1h 8m 34s -
🪟 General Tools 22m 56s - - - 16m 10s
details

https://github.com/codex-storage/nim-codex/actions/runs/11952751417

Screenshot 2024-11-21 at 17 38 46

Conclussions

  1. Windows integration tests duration is between 54m 22s(32 vCPU) and 1h 8m 34s (4 vCPU and whole workflow run is 1h 15m )
  2. Longest run (excluding Windows General / All) is macOS - 1h 40m

As Mark mentioned, it make sense to try to split tests for Linux and mcOS like we do it for Windows.

Next steps

Performed tests does not answer the question if IPv4/IPv6 settings improve run time on Windows with the IP over hostname we already using for RPC calls.

Now we are performing next set of runs to check

  • Does IPv4/IPv6 improve the run time on Windows
  • Minimal run time we can get if split Linux/macOS tests

@veaceslavdoina
Copy link
Contributor

Tests

# Tests 🐧 🍏 🪟 IPv6 on Windows
1 integration 59m 32s 1h 2m 1h 19m 🟢
2 integration 57m 52s 1h 10m 1h 16m 🟢
3 integration 1h 4m 1h 16m 1h 16m 🟢
4 integration 1h 2m 1h 17m 1h 20m 🔵
5 integration 1h 1m 1h 22m 1h 20m 🔵
6 integration 59m 24s 1h 12m 1h 20m 🔵

🟢 - IPv6 is primary
🔵 - IPv6 is secondary

Conclusion

  1. When we split Linux/macOS/Windows tests, max workflow duration is around 1h 20m. Probably, it make sense to check that option, to minimize total workflow duration, because sometimes we have slow run on at least on macOS.
    Split also might help to re-run just a specific set of tests in case of an intermittent failure.

  2. When we are using 127.0.0.1 as RPC endpoint over hostname localhost, setting Prefer IPv4 over IPv6 in prefix policies in Windows, does not improve tests run duration.

Based on the Usage limits, we can run concurrently 500 jobs and 50 macOS jobs.

Next steps

I will create a PR with CI split for Linux and macOS.

@emizzle
Copy link
Contributor

emizzle commented Nov 25, 2024

@markspanbroek let's merge this so we can try it in a release?

@markspanbroek markspanbroek added this pull request to the merge queue Nov 25, 2024
Merged via the queue into master with commit 29433ba Nov 25, 2024
17 of 25 checks passed
@markspanbroek markspanbroek deleted the fix-concurrency-issues branch November 25, 2024 12:44
veaceslavdoina pushed a commit that referenced this pull request Nov 26, 2024
* Use http subscriptions instead of websocket for tests

To work around this issue when subscriptions are
inactive for more than 5 minutes:
NomicFoundation/hardhat#2053

Use 100 millisecond polling; default polling interval
of 4 seconds is too close to the 5 second timeout for
`check eventually`.

* use .confirm(1) instead of confirm(0)

confirm(0) doesn't wait at all, confirm(1) waits
for the transaction to be mined

* speed up partial payout integration test

* update nim-ethers to version 0.10.0

includes fixes for http polling and .confirm()

* fix timing of marketplace tests

allow for a bit more time to withdraw funds

* use .confirm(1) in marketplace tests

to ensure that the transaction has been processed
before continuing with the test

* fix timing issue in validation unit test

* fix proof integration test

there were two logic errors in this test:
- a slot is freed anyway at the end of the contract
- when starting the request takes a long time, the
  first slot can already be freed because there were
  too many missing proofs

* fix intermittent error in contract tests

currentTime() doesn't always correctly reflect
the time of the next transaction

* reduce number of slots in integration test

otherwise the windows runner in the CI won't
be able to start the request before it expires

* fix timing in purchasing test

allow for a bit more time for a request to
be submitted

* fix timing of request submission in test

windows ci is so slow, it can take up to 40 seconds
just to submit a storage request to hardhat

* increase proof period to 90 seconds

* adjust timing of integration tests

reason: with the increased period length of 90 seconds, it
can take longer to wait for a stable challenge at the
beginning of a period.

* increase CI timeout to 2 hours

* Fix slow builds on windows

apparently it takes windows 2-3 seconds to
resolve "localhost" to 127.0.0.1 for every
json-rpc connection that we make 🤦
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants