Resolve port conflict when starting multiple PMM agents #2573

GuanqunYang193 · 2023-10-24T23:09:58Z

In our setting in production, we often deploy different instances of PMM clients on the same machines. However, we occasionally encounter port conflict issues, which are not automatically resolved until we restart the PMM client. The reason for this problem is as follows:

When the PMM client starts agents, it checks whether the port is unoccupied and then uses the selected port to launch other agents. However, when we start multiple PMM clients simultaneously (in our experiments, reproducing this problem consistently requires 5 or more concurrent instances), there is a race condition issue with the ports. This means that different clients simultaneously consider the same port available.
When the PMM client fails to start the agent process, it keeps retrying with the same parameters. This situation prevents the process from starting successfully.

Therefore, I submitted this PR to address the issue:

implemented a limit on the number of retries for the process. When the retries exceed this limit, the process will request new parameters from the supervisor.
When the supervisor detects that the process needs new parameters, it will select a new port and pass the updated parameters to the process.
To test this mechanism, I also submitted a new test using "nc" to simulate scenarios of port conflicts.

it-percona-cla · 2023-10-24T23:10:02Z

All committers have signed the CLA.

BupycHuk · 2023-10-25T07:59:40Z

Hi @GuanqunYang193, can we learn the reason you run multiple PMM clients on the same machine? You can configure each PMM to use different port ranges to not have conflicts between them.

GuanqunYang193 · 2023-10-25T14:59:50Z

Hi @GuanqunYang193, can we learn the reason you run multiple PMM clients on the same machine? You can configure each PMM to use different port ranges to not have conflicts between them.

Thanks for quick reply! We have multiple database instance deployed with docker on single machine and each database has their PMM agent because:

We don't want tightly couples the client to the PMM server it reports to, which makes our instance and PMM client hard to migrate.
If a single PMM client on the host, responsible for all instances, encounters an issue like a crash or outage, it results in the loss of metrics for every instance.
We can easily limit the resource of pmm client, which allows for greater flexibility and scalability when adding or modifying database instance.

GuanqunYang193 · 2023-10-25T15:10:20Z

Hi @GuanqunYang193, can we learn the reason you run multiple PMM clients on the same machine? You can configure each PMM to use different port ranges to not have conflicts between them.

Using different port ranges can also be relatively complex in this shared environment. The number of available port on shared machines is limited. We also need to deploy new instance and remove old one, which means we have to manage the ports like memory management!
We believe that PMM client should be able to dynamically select and retry new ports within the port range, rather than getting stuck indefinitely when the initial process startup fails.

…rt_conflict

BupycHuk · 2024-01-19T16:25:45Z

Hi @GuanqunYang193,

Using different port ranges can also be relatively complex in this shared environment. The number of available port on shared machines is limited. We also need to deploy new instance and remove old one, which means we have to manage the ports like memory management!

Even with this feature implemented you still have to manage ports pmm-agent providing API on.

We believe that PMM client should be able to dynamically select and retry new ports within the port range, rather than getting stuck indefinitely when the initial process startup fails.

Althought I agree with this statement, I believe that it should be implemented other way:
When supervisor starts process and process fails because of busy port
instead of trying to pass new params into existing object, process should completely stop and be destroyed and then supervisor should create new Process object and start it. We have status changes, so we can add one more status for completely failure case and use it as a trigger on supervisor side.

GuanqunYang193 · 2024-01-19T18:51:25Z

When supervisor starts process and process fails because of busy port
instead of trying to pass new params into existing object, process should completely stop and be destroyed and then supervisor should create new Process object and start it. We have status changes, so we can add one more status for completely failure case and use it as a trigger on supervisor side.

Thanks for providing another option!
The idea of adding a failure statue is reasonable and I will change how this feature is implemented.

GuanqunYang193 added 5 commits October 16, 2023 00:38

support refresh the params

2cab6b4

add test

1fd300e

add test and lock

07523a4

add some comments

a8177a9

Merge branch 'main' into port_conflict

82633ed

GuanqunYang193 requested a review from a team as a code owner October 24, 2023 23:09

GuanqunYang193 requested review from idoqo and artemgavrilov and removed request for a team October 24, 2023 23:09

Merge branch 'main' into port_conflict

d6edff4

GuanqunYang193 and others added 6 commits October 26, 2023 12:44

Merge branch 'main' into port_conflict

982a4bc

nit

b04a029

Merge branch 'port_conflict' of github.com:GuanqunYang193/pmm into po…

90b15ec

…rt_conflict

Merge branch 'main' into port_conflict

d2698e2

Merge branch 'main' into port_conflict

2ce35ff

Merge branch 'port_conflict' of github.com:GuanqunYang193/pmm into po…

01b6692

…rt_conflict

GuanqunYang193 closed this Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve port conflict when starting multiple PMM agents #2573

Resolve port conflict when starting multiple PMM agents #2573

GuanqunYang193 commented Oct 24, 2023

it-percona-cla commented Oct 24, 2023 •

edited

Loading

BupycHuk commented Oct 25, 2023

GuanqunYang193 commented Oct 25, 2023

GuanqunYang193 commented Oct 25, 2023

BupycHuk commented Jan 19, 2024

GuanqunYang193 commented Jan 19, 2024 •

edited

Loading

Resolve port conflict when starting multiple PMM agents #2573

Resolve port conflict when starting multiple PMM agents #2573

Conversation

GuanqunYang193 commented Oct 24, 2023

it-percona-cla commented Oct 24, 2023 • edited Loading

BupycHuk commented Oct 25, 2023

GuanqunYang193 commented Oct 25, 2023

GuanqunYang193 commented Oct 25, 2023

BupycHuk commented Jan 19, 2024

GuanqunYang193 commented Jan 19, 2024 • edited Loading

it-percona-cla commented Oct 24, 2023 •

edited

Loading

GuanqunYang193 commented Jan 19, 2024 •

edited

Loading