Connectivity problems with the RETORCH agent #144

augustocristian · 2024-08-16T09:04:17Z

For the past two weeks, we've been experiencing problems with all the pipelines in the CI system. I've researching about the problem and the first insights show connectivity issues with the agent.

I think that I've found the root cause: it seems that the virtual machine is not detecting all the installed memory, not allowing to execute the containers in parallel (I've noticed a significant slowdown when attempting to perform tasks manually during execution).
I observed this using htop command, the VM starts increasing the use of the swap space instead of the available memory:

I've attempted to restore the only thing that changed from the previous executions: the Docker version. I created a script to remove the current version and manually install an older one (26.0.1), but the problem still remains.
The /proc/meminfo is aligned with htop:

There could be some issues or misconfigurations with the hypervisor? @javiertuya

The text was updated successfully, but these errors were encountered:

javiertuya · 2024-08-16T14:37:41Z

@augustocristian I have checked the host settings. no changes were made during this year (the only exception is the disk size extension of the retorch VM several months ago). The current configuration of this VM seems to be the same as other VMs:

Dynamic memory, from 512 to 32768 MB. Memory buffer 20%
8 virtual procesors
all integration services offered to the VM

Currently, the host is not under memory preasure: free unused host memory used is 78GB. The VM is taking 4098MB.

I'm doing some tests now...

javiertuya · 2024-08-16T15:20:55Z

@augustocristian I rerun one of the failed PRs: #139 and observed from the host:

Job run 2: Memory starts at 4098MB and does not change, the tjobs start failing and I cancel the run.
Try to shutdown the VM, but it is not possible, the host says indefinitely that it is shutting down the VM, but it never shutdowns.
I hard disconnect the VM an startup it again. From the first second, it gets 4GB
Job run 3: when the jenkins agent gets connected, I run again the PR. Memory assigned starts raising just after Stage 0 starts. After 3 minutes run, it reaches 32GB. All tjobs finish successfully, and the VM returns part of the memory to the host, now it is using 26GB
Job run 4: As in the run 3, after a few minutes, memory assigned goes up to 32 GB. When stage 1 is finishing, the VM returns memory down to 10GB, when stage 2 starts, goes up to 20GB. All tjobs finish successfully, and the VM is now using 20GB.
I try to shutdown the VM and everything is ok, the VM starts releasing memory down to 4GB and shutdowns.
Finally, I startup again

From this scenario;

It seems that the VM is able to successfully request memory to the host and releasing memory.
The initial situation where the VM is stuck at 4GB, unable to request more memory and unable to shutdown is a degraded state
Check if the VM is using the right kernel to use the integration services

augustocristian · 2024-08-16T15:27:45Z

@javiertuya thanks for all. I've checked this morning and reinstalled the kernel recommended by Microsoft
https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/supported-ubuntu-virtual-machines-on-hyper-v
Currently, the version of kernel used is: 6.5.0-1025-azure.
I'll continue monitoring these days, the VM never reported me a "degraded" state, at least from ssh (maybe in the local terminal yes).

javiertuya · 2024-08-16T15:56:35Z

@augustocristian If the VM did not have this kernel installed, this is the most probable cause:

What I don't understand is that you say that you installed this kernel in the morning, but when I did the tests, it was afternoon, and the VM was stuck at 4GB. Did you reboot the VM after install?
The "degraded" state is a conclusion from the tests, not a status. If suddently, the VM is unable to request memory, this means that it is somewhow degraded.

augustocristian · 2024-08-16T16:01:15Z

I rebooted the VM several times and conducted multiple tests using the original kernel, the Azure kernel, and other kernels recommended by the community to address the memory allocation issues. None of these solutions worked. When I start modifying the RETORCH tool, I plan to include this debugging information to prevent similar issues in the future (currently, it only logs the Docker and Compose versions).
Thank you, sincerly

javiertuya · 2024-08-16T16:10:35Z

@augustocristian So, i guess that in the afternoon, the VM was Not using the Azure kernel, right? This is the official kernel to support the host integration services, and one of the most important services is the dynamic memory management.

Tell me if the rerun of all failing updates succeeds to perform the merges

augustocristian · 2024-08-18T15:24:11Z

All the branches (without regarding the Jupyter ones) passing @javiertuya, waiting for include the changes of #145 to update #143 and check what changes should be included to solve the problem

augustocristian assigned augustocristian and javiertuya Aug 16, 2024

javiertuya assigned augustocristian and unassigned javiertuya and augustocristian Aug 16, 2024

augustocristian closed this as completed Aug 18, 2024

augustocristian reopened this Aug 18, 2024

augustocristian closed this as completed Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connectivity problems with the RETORCH agent #144

Connectivity problems with the RETORCH agent #144

augustocristian commented Aug 16, 2024 •

edited

Loading

javiertuya commented Aug 16, 2024

javiertuya commented Aug 16, 2024

augustocristian commented Aug 16, 2024

javiertuya commented Aug 16, 2024

augustocristian commented Aug 16, 2024

javiertuya commented Aug 16, 2024

augustocristian commented Aug 18, 2024

Connectivity problems with the RETORCH agent #144

Connectivity problems with the RETORCH agent #144

Comments

augustocristian commented Aug 16, 2024 • edited Loading

javiertuya commented Aug 16, 2024

javiertuya commented Aug 16, 2024

augustocristian commented Aug 16, 2024

javiertuya commented Aug 16, 2024

augustocristian commented Aug 16, 2024

javiertuya commented Aug 16, 2024

augustocristian commented Aug 18, 2024

augustocristian commented Aug 16, 2024 •

edited

Loading