Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More tolerance for network errors pushing build products to primary builders #380

Open
jwokaty opened this issue Dec 7, 2023 · 2 comments

Comments

@jwokaty
Copy link
Collaborator

jwokaty commented Dec 7, 2023

Not all build products are being sent from kjohnson3 to nebbiolo1. Checking the tail of the install-push.log:

ssh: connect to host 155.52.47.135 port 22: Network is unreachable^M
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at /AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/rsync/rsync/io.c(453) [sender=2.6.9]
-----------------------------------------------
2023-12-04 18:45:55 -0500 (Mon, 04 Dec 2023)
nb_jobs_completed_since_last_push: 10
push command: /usr/bin/rsync --rsh 'ssh -F /Users/biocbuild/.ssh/config' -av /Users/biocbuild/bbs-3.19-bioc-mac-arm64/products-out/install/ biocbuild@nebbiolo1:/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install

ssh: connect to host 155.52.47.135 port 22: Network is unreachable^M
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at /AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/rsync/rsync/io.c(453) [sender=2.6.9]
-----------------------------------------------
LAST PUSH!
2023-12-04 18:45:55 -0500 (Mon, 04 Dec 2023)
nb_jobs_completed_since_last_push: 2
push command: /usr/bin/rsync --rsh 'ssh -F /Users/biocbuild/.ssh/config' -av /Users/biocbuild/bbs-3.19-bioc-mac-arm64/products-out/install/ biocbuild@nebbiolo1:/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install

ssh: connect to host 155.52.47.135 port 22: Network is unreachable^M
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at /AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/rsync/rsync/io.c(453) [sender=2.6.9]
-----------------------------------------------

If I run /usr/bin/rsync --rsh 'ssh -F /Users/biocbuild/.ssh/config' -av /Users/biocbuild/bbs-3.19-bioc-mac-arm64/products-out/install/ biocbuild@nebbiolo1:/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install, I am able to push the remaining products.

On nebbiolo1, we see errors like the following in the postrun.log when this error happens:

BBS> [make_all_LeafReports] Current working dir '/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/report'
BBS> [make_all_LeafReports] Creating report package subfolders and populating them with index.html files ... OK
BBS> [make_node_LeafReports] Node kjohnson3: BEGIN ...
Traceback (most recent call last):
  File "/home/biocbuild/BBS/BBS-report.py", line 2200, in <module>
    make_all_LeafReports(allpkgs, allpkgs_inner_rev_deps,
  File "/home/biocbuild/BBS/BBS-report.py", line 1867, in make_all_LeafReports
    make_node_LeafReports(allpkgs, node, long_link)
  File "/home/biocbuild/BBS/BBS-report.py", line 1758, in make_node_LeafReports
    make_LeafReport(leafreport_ref, allpkgs, long_link)
  File "/home/biocbuild/BBS/BBS-report.py", line 1732, in make_LeafReport
    write_Summary_asHTML(out, node_hostname, pkg, node_id, stage)
  File "/home/biocbuild/BBS/BBS-report.py", line 1382, in write_Summary_asHTML
    shutil.copyfile(filepath, dest)
  File "/usr/lib/python3.10/shutil.py", line 254, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install/ADaCGH2.install-summary.dcf'

I haven't looked at the code that performs the push, but maybe it needs to wait a little longer for the network disturbance to possibly resolve and try again and send a notification if after X attempts, it fails to rsync all products.

@jwokaty
Copy link
Collaborator Author

jwokaty commented Dec 11, 2023

20231207 run log for kjohnson3:

BBS> ==============================================================
BBS>   (Re)make BBS_CENTRAL_BASEURL/products-in/kjohnson3/... OK
BBS> [STAGE2] STARTING STAGE2 at Thu Dec  7 23:16:08 2023
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1346, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1257, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1303, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1252, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1012, in _send_output
    self.send(msg)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 952, in send
    self.connect()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 923, in connect
    self.sock = self._create_connection(
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/socket.py", line 843, in create_connection
    raise err
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/socket.py", line 831, in create_connection
    sock.connect(sa)
OSError: [Errno 51] Network is unreachable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/biocbuild/BBS/BBS-run.py", line 811, in <module>
    STAGE2()
  File "/Users/biocbuild/BBS/BBS-run.py", line 423, in STAGE2
    waitForTargetRepoToBeReady()
  File "/Users/biocbuild/BBS/BBS-run.py", line 218, in waitForTargetRepoToBeReady
    f = urllib.request.urlopen(PACKAGES_url)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 517, in open
    response = self._open(req, data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1375, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1349, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 51] Network is unreachable>

@hpages
Copy link
Contributor

hpages commented Dec 11, 2023

Let's first try to figure out what's going on between kjohnson3 and nebbiolo1. Communications between machines located on the internal network at DFCI has been flawless so far, so it's kind of surprising that kjohnson3 would not be able to communicate with nebbiolo1 reliably.

On our side, we could probably try to improve the situation by configuring kjohnson3 like kunpeng2 by using export BBS_PRODUCT_TRANSMISSION_MODE="none".

With this mode the machine doesn't send back the build products at all. This means rsync will no longer be needed on kjohnson3 and the machine will no longer need to use SSH keys to access the central node. Instead the central node will be in charge of retrieving the build products from kjohnson3, by calling rsync at regular intervals (e.g. every hour) like we do right now to retrieve the build products from kunpeng2.

This should be a lot more robust to network instabilities because what can't be retrieved by a call to rsync will be retrieved by a later call to rsync when the network is back.

It won't solve the waitForTargetRepoToBeReady() error that occured on Dec 7 at the beginning on STAGE2 though, but it will be a start.

But let's wait and hear what the DFCI IT folks have to say about this first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants