Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing isolation #22

Open
dborowiec10 opened this issue Apr 30, 2021 · 0 comments
Open

Multiprocessing isolation #22

dborowiec10 opened this issue Apr 30, 2021 · 0 comments

Comments

@dborowiec10
Copy link

@KnowingNothing
I have a question regarding the call path of the parallel_evaluate routines in FlexTensor.
When tuning for an "llvm" target, it makes sense that the "parallel" option should be set to <= num cores as to have a form of isolation on the HW and thus have accurate performance measurements, where each measurement occurs on a separate core.
In the case of "cuda" target, where the candidate schedules are executed on the GPU, this isn't as straightforward.

When looking at the number of simultaneously executing schedules on the GPU, I could observe several simultaneous processes being present on the device - confirmed via inspection through nvidia-smi - during a single measurement trial. This is somewhat concerning as it causes multiple processes to potentially compete for GPU resources at the same time during measurement, effectively causing the measurement to be inaccurate as the candidates can be scheduled in a different order internally by the nvidia's cuda driver scheduler.

FlexTensor seems to reuse some of the tvm's RPC pipeline for evaluating modules, namely "func.time_evaluator" which goes through the RPCTimeEvaluator module, in basically the same fashion as it happens in AutoTVM, with the main difference being that in AutoTVM, the LocalRunner (locally spawned RPC server & tracker) keeps the "parallel" option hardcoded to 1 as to maintain the isolation between 2 separate candidate schedules being measured on the device. I thought its maybe something to do with the torch.multiprocessing module, however it seems to be a simple wrapper over Python's multiprocessing which allows to share a portion of memory between several processes and on face-value, has nothing to do with isolation when it comes to GPU execution.

For "cuda" targets, would you recommend to stick with parallel = 1 to ensure process isolation or maybe I'm missing something obvious that ensures the isolation is maintained despite multiple processes executing on the device simultaneously?
I've tested this with AutoTVM and Ansor and they both seem to isolate each candidate and as such, you can only see a single process being registered in nvidia-smi at one time.

Any chance you could clarify the above?

Many thanks! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant