Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connectivity problems with the RETORCH agent #144

Closed
augustocristian opened this issue Aug 16, 2024 · 7 comments
Closed

Connectivity problems with the RETORCH agent #144

augustocristian opened this issue Aug 16, 2024 · 7 comments
Assignees

Comments

@augustocristian
Copy link
Contributor

augustocristian commented Aug 16, 2024

For the past two weeks, we've been experiencing problems with all the pipelines in the CI system. I've researching about the problem and the first insights show connectivity issues with the agent.
image
I think that I've found the root cause: it seems that the virtual machine is not detecting all the installed memory, not allowing to execute the containers in parallel (I've noticed a significant slowdown when attempting to perform tasks manually during execution).
I observed this using htop command, the VM starts increasing the use of the swap space instead of the available memory:
image

I've attempted to restore the only thing that changed from the previous executions: the Docker version. I created a script to remove the current version and manually install an older one (26.0.1), but the problem still remains.
The /proc/meminfo is aligned with htop:
image
There could be some issues or misconfigurations with the hypervisor? @javiertuya

@javiertuya
Copy link
Contributor

@augustocristian I have checked the host settings. no changes were made during this year (the only exception is the disk size extension of the retorch VM several months ago). The current configuration of this VM seems to be the same as other VMs:

  • Dynamic memory, from 512 to 32768 MB. Memory buffer 20%
  • 8 virtual procesors
  • all integration services offered to the VM

Currently, the host is not under memory preasure: free unused host memory used is 78GB. The VM is taking 4098MB.

I'm doing some tests now...

@javiertuya
Copy link
Contributor

@augustocristian I rerun one of the failed PRs: #139 and observed from the host:

  • Job run 2: Memory starts at 4098MB and does not change, the tjobs start failing and I cancel the run.
  • Try to shutdown the VM, but it is not possible, the host says indefinitely that it is shutting down the VM, but it never shutdowns.
  • I hard disconnect the VM an startup it again. From the first second, it gets 4GB
  • Job run 3: when the jenkins agent gets connected, I run again the PR. Memory assigned starts raising just after Stage 0 starts. After 3 minutes run, it reaches 32GB. All tjobs finish successfully, and the VM returns part of the memory to the host, now it is using 26GB
  • Job run 4: As in the run 3, after a few minutes, memory assigned goes up to 32 GB. When stage 1 is finishing, the VM returns memory down to 10GB, when stage 2 starts, goes up to 20GB. All tjobs finish successfully, and the VM is now using 20GB.
  • I try to shutdown the VM and everything is ok, the VM starts releasing memory down to 4GB and shutdowns.
  • Finally, I startup again

From this scenario;

  • It seems that the VM is able to successfully request memory to the host and releasing memory.
  • The initial situation where the VM is stuck at 4GB, unable to request more memory and unable to shutdown is a degraded state
  • Check if the VM is using the right kernel to use the integration services

@augustocristian
Copy link
Contributor Author

@javiertuya thanks for all. I've checked this morning and reinstalled the kernel recommended by Microsoft
https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/supported-ubuntu-virtual-machines-on-hyper-v
Currently, the version of kernel used is: 6.5.0-1025-azure.
I'll continue monitoring these days, the VM never reported me a "degraded" state, at least from ssh (maybe in the local terminal yes).

@javiertuya
Copy link
Contributor

@augustocristian If the VM did not have this kernel installed, this is the most probable cause:

  • What I don't understand is that you say that you installed this kernel in the morning, but when I did the tests, it was afternoon, and the VM was stuck at 4GB. Did you reboot the VM after install?
  • The "degraded" state is a conclusion from the tests, not a status. If suddently, the VM is unable to request memory, this means that it is somewhow degraded.

@augustocristian
Copy link
Contributor Author

I rebooted the VM several times and conducted multiple tests using the original kernel, the Azure kernel, and other kernels recommended by the community to address the memory allocation issues. None of these solutions worked. When I start modifying the RETORCH tool, I plan to include this debugging information to prevent similar issues in the future (currently, it only logs the Docker and Compose versions).
Thank you, sincerly

@javiertuya
Copy link
Contributor

@augustocristian So, i guess that in the afternoon, the VM was Not using the Azure kernel, right? This is the official kernel to support the host integration services, and one of the most important services is the dynamic memory management.

Tell me if the rerun of all failing updates succeeds to perform the merges

@augustocristian
Copy link
Contributor Author

All the branches (without regarding the Jupyter ones) passing @javiertuya, waiting for include the changes of #145 to update #143 and check what changes should be included to solve the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants