Idea: to reduce cluster breaking due to pipeline/processing large files, add a quick check to see cache node burden & I/O traffic before final file transfer from /scratch to /hot #116
Alfredo-Enrique
started this conversation in
Ideas
Replies: 1 comment
-
Possible commands we could use the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
One of the recent our cluster keeps going down is due to the burden our cache nodes suffer when reading/writing large files created by our pipelines.
A good fix could be that when a heavy load pipeline is finishing up, and UCLA_CDS flag = True, we can use a simple command to check the traffic in the cache node before transferring the final files from
/scratch
to/hot
. If the traffic is higher than a certain threshold, we can sleep for 20 min, and check again. We can be conservative in the amount too.Taka mentioned we have 6 cache nodes, with about 4GB/s I/O . While the I/O is dynamic and can vary, for this purpose let's say it's equally divided so 0.66GB/s per single cache node. Evens o we can still take a conservative approach, and say if the burden on the cache node is less than 0.50 GB/s I/O then we can proceed and transfer from
/scratch/ to
/hot/. ` We can also easily take into account the size of the final file before making this sleep call if we want more dynamic handling.As this is the last part of the pipelines, we don't have to worry about nextlfow parallelization, as it's transferring the final file.
This approach should take only a few lines of code, and means that we don't have to keep worrying about bringing down the cluster.
Beta Was this translation helpful? Give feedback.
All reactions