-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collecting outputs from running in parallel is very slow #19
Comments
On 2019-02-14 Benjamin Krikler (bkrikler) wrote: @vmilosev you are one of the people that has mentioned this to me. |
On 2019-03-29 Benjamin Krikler (bkrikler) wrote: mentioned in merge request gitlab:!43 |
On 2019-04-08 Benjamin Krikler (bkrikler) wrote: I've just noticed / remembered that the collector for cut flows uses very similar code to what the old binned dataframe stage code was doing, and since that was accelerated quite a bit by the rewrite, it would be worth adopting this approach for the cut-flow stage collector. It's not obvious that this is a bottle-neck at this point, but could still be helpful. We need to increase the amount of testing of the cut-flow collector, since at this point it is poorly covered. |
On 2019-04-17 Benjamin Krikler (bkrikler) wrote: mentioned in merge request gitlab:!44 |
According to @vukasinmilosevic, the above two merge requests actually do seem to have solved the slowness he was seeing. I'm going to close this for now, but we'll keep an eye on the situation and can re-open if it becomes an issue again. |
I'm finding that the collecting of outputs can be pretty slow. For ~ 100 datasets, it can take 5+ minutes plus to merge all the dataframes |
Ah poop. That's still better than it was when this was opened, but that's obviously not very good, so I'll reopen this. @rob-tay can you give any extra details? Which systems are you using, what sorts of steps are you doing with the data, what size dataframes (number of bins)/ cut-flows (total number of cuts) are you using, etc? |
I have found a similar issue when running over a large dataset in |
The most recent insight into this is that the poor performance is to do with the number of bins being used. @asnaylor, correct me if I'm wrong, but for your above comment, were you actually binning your values? I plan to implement a parallelised merge step which takes advantage of multiple cores / batch cluster, if that's available. This should improve things considerably. However, it's awkward to achieve this using the AlphaTwirl backend, hence it's waiting for us to add in Parsl support, and / or Coffea executors. |
yeah @benkrikler i'm pretty sure i didn't actually bin the values. |
BTW: with boost-histogram now being relatively mature, we try to change the intermediate format. That said, we probably should not fixate on a particular solution. Instead, we can go the route of uproot where you can choose between multiple implementations. This way we can try new things more frequently and make it easier to contribute new implementations. All these implementations would need to provide is a way to
|
Imported from gitlab issue 19
It's been noticed by several people that when a large number of events and files are processed using a parallel mode (batch system, local multiprocessing, etc) the individual tasks run quickly, but the final collecting step can take a long time. It would be good to understand why this is and accelerate the steps as much as possible; certainly merging many pandas dataframes shouldn't take too long.
The text was updated successfully, but these errors were encountered: