Idea: to reduce cluster breaking due to pipeline/processing large files, add a quick check to see cache node burden & I/O traffic before final file transfer from /scratch to /hot #116

Alfredo-Enrique · 2023-06-24T01:19:10Z

Alfredo-Enrique
Jun 24, 2023
Collaborator

One of the recent our cluster keeps going down is due to the burden our cache nodes suffer when reading/writing large files created by our pipelines.

A good fix could be that when a heavy load pipeline is finishing up, and UCLA_CDS flag = True, we can use a simple command to check the traffic in the cache node before transferring the final files from /scratch to /hot. If the traffic is higher than a certain threshold, we can sleep for 20 min, and check again. We can be conservative in the amount too.

Taka mentioned we have 6 cache nodes, with about 4GB/s I/O . While the I/O is dynamic and can vary, for this purpose let's say it's equally divided so 0.66GB/s per single cache node. Evens o we can still take a conservative approach, and say if the burden on the cache node is less than 0.50 GB/s I/O then we can proceed and transfer from /scratch/ to /hot/. ` We can also easily take into account the size of the final file before making this sleep call if we want more dynamic handling.

As this is the last part of the pipelines, we don't have to worry about nextlfow parallelization, as it's transferring the final file.

This approach should take only a few lines of code, and means that we don't have to keep worrying about bringing down the cluster.

Alfredo-Enrique · 2023-06-24T01:25:18Z

Alfredo-Enrique
Jun 24, 2023
Collaborator Author

Possible commands we could use the az redis show -h or az monitor metrics list -h set of commands or something similar.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: to reduce cluster breaking due to pipeline/processing large files, add a quick check to see cache node burden & I/O traffic before final file transfer from /scratch to /hot #116

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Idea: to reduce cluster breaking due to pipeline/processing large files, add a quick check to see cache node burden & I/O traffic before final file transfer from /scratch to /hot #116

Alfredo-Enrique Jun 24, 2023 Collaborator

Replies: 1 comment

Alfredo-Enrique Jun 24, 2023 Collaborator Author

Alfredo-Enrique
Jun 24, 2023
Collaborator

Alfredo-Enrique
Jun 24, 2023
Collaborator Author