-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More tolerance for network errors pushing build products to primary builders #380
Comments
20231207 run log for kjohnson3:
|
Let's first try to figure out what's going on between kjohnson3 and nebbiolo1. Communications between machines located on the internal network at DFCI has been flawless so far, so it's kind of surprising that kjohnson3 would not be able to communicate with nebbiolo1 reliably. On our side, we could probably try to improve the situation by configuring kjohnson3 like kunpeng2 by using With this mode the machine doesn't send back the build products at all. This means rsync will no longer be needed on kjohnson3 and the machine will no longer need to use SSH keys to access the central node. Instead the central node will be in charge of retrieving the build products from kjohnson3, by calling rsync at regular intervals (e.g. every hour) like we do right now to retrieve the build products from kunpeng2. This should be a lot more robust to network instabilities because what can't be retrieved by a call to rsync will be retrieved by a later call to rsync when the network is back. It won't solve the But let's wait and hear what the DFCI IT folks have to say about this first. |
Not all build products are being sent from kjohnson3 to nebbiolo1. Checking the tail of the install-push.log:
If I run
/usr/bin/rsync --rsh 'ssh -F /Users/biocbuild/.ssh/config' -av /Users/biocbuild/bbs-3.19-bioc-mac-arm64/products-out/install/ biocbuild@nebbiolo1:/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install
, I am able to push the remaining products.On nebbiolo1, we see errors like the following in the postrun.log when this error happens:
I haven't looked at the code that performs the push, but maybe it needs to wait a little longer for the network disturbance to possibly resolve and try again and send a notification if after X attempts, it fails to rsync all products.
The text was updated successfully, but these errors were encountered: