You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@KnowingNothing
I have a question regarding the call path of the parallel_evaluate routines in FlexTensor.
When tuning for an "llvm" target, it makes sense that the "parallel" option should be set to <= num cores as to have a form of isolation on the HW and thus have accurate performance measurements, where each measurement occurs on a separate core.
In the case of "cuda" target, where the candidate schedules are executed on the GPU, this isn't as straightforward.
When looking at the number of simultaneously executing schedules on the GPU, I could observe several simultaneous processes being present on the device - confirmed via inspection through nvidia-smi - during a single measurement trial. This is somewhat concerning as it causes multiple processes to potentially compete for GPU resources at the same time during measurement, effectively causing the measurement to be inaccurate as the candidates can be scheduled in a different order internally by the nvidia's cuda driver scheduler.
FlexTensor seems to reuse some of the tvm's RPC pipeline for evaluating modules, namely "func.time_evaluator" which goes through the RPCTimeEvaluator module, in basically the same fashion as it happens in AutoTVM, with the main difference being that in AutoTVM, the LocalRunner (locally spawned RPC server & tracker) keeps the "parallel" option hardcoded to 1 as to maintain the isolation between 2 separate candidate schedules being measured on the device. I thought its maybe something to do with the torch.multiprocessing module, however it seems to be a simple wrapper over Python's multiprocessing which allows to share a portion of memory between several processes and on face-value, has nothing to do with isolation when it comes to GPU execution.
For "cuda" targets, would you recommend to stick with parallel = 1 to ensure process isolation or maybe I'm missing something obvious that ensures the isolation is maintained despite multiple processes executing on the device simultaneously?
I've tested this with AutoTVM and Ansor and they both seem to isolate each candidate and as such, you can only see a single process being registered in nvidia-smi at one time.
Any chance you could clarify the above?
Many thanks! :)
The text was updated successfully, but these errors were encountered:
@KnowingNothing
I have a question regarding the call path of the parallel_evaluate routines in FlexTensor.
When tuning for an "llvm" target, it makes sense that the "parallel" option should be set to <= num cores as to have a form of isolation on the HW and thus have accurate performance measurements, where each measurement occurs on a separate core.
In the case of "cuda" target, where the candidate schedules are executed on the GPU, this isn't as straightforward.
When looking at the number of simultaneously executing schedules on the GPU, I could observe several simultaneous processes being present on the device - confirmed via inspection through nvidia-smi - during a single measurement trial. This is somewhat concerning as it causes multiple processes to potentially compete for GPU resources at the same time during measurement, effectively causing the measurement to be inaccurate as the candidates can be scheduled in a different order internally by the nvidia's cuda driver scheduler.
FlexTensor seems to reuse some of the tvm's RPC pipeline for evaluating modules, namely "func.time_evaluator" which goes through the RPCTimeEvaluator module, in basically the same fashion as it happens in AutoTVM, with the main difference being that in AutoTVM, the LocalRunner (locally spawned RPC server & tracker) keeps the "parallel" option hardcoded to 1 as to maintain the isolation between 2 separate candidate schedules being measured on the device. I thought its maybe something to do with the torch.multiprocessing module, however it seems to be a simple wrapper over Python's multiprocessing which allows to share a portion of memory between several processes and on face-value, has nothing to do with isolation when it comes to GPU execution.
For "cuda" targets, would you recommend to stick with parallel = 1 to ensure process isolation or maybe I'm missing something obvious that ensures the isolation is maintained despite multiple processes executing on the device simultaneously?
I've tested this with AutoTVM and Ansor and they both seem to isolate each candidate and as such, you can only see a single process being registered in nvidia-smi at one time.
Any chance you could clarify the above?
Many thanks! :)
The text was updated successfully, but these errors were encountered: