Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification to zeInit() description #232

Open
HoppeMateusz opened this issue Sep 29, 2023 · 5 comments
Open

Clarification to zeInit() description #232

HoppeMateusz opened this issue Sep 29, 2023 · 5 comments
Labels
API: Core documentation Improvements or additions to documentation needs discussion

Comments

@HoppeMateusz
Copy link

Current description allows calling zeInit() multiple times with different environment variables.
https://spec.oneapi.io/level-zero/latest/core/api.html#zeinit

The application may call this function multiple times with different flags or environment variables enabled.

It should be stated that calling zeInit() with the same flags and different Environment Variables will not have effect on Driver - as driver is initialized once

Only one instance of each driver will be initialized per process.

And that spec defined env vars (https://spec.oneapi.io/level-zero/latest/core/PROG.html#environment-variables) will only be honored at first initialization.

@wdamon-intel wdamon-intel added documentation Improvements or additions to documentation API: Core labels Oct 2, 2023
@jandres742
Copy link

thanks @HoppeMateusz . There have been requests from customers to modify that behavior, and that actually multiple calls to zeInit to work.

Of course, in terms of the L0 GPU driver, that would imply a complex refactoring of code. But putting aside implementation details, i think the first question here to answer is:

what is the best behavior for customers:

  • that only first zeInit is valid and only then env vars are taken?
  • or should spec relax this to allow for multiple calls to zeInit to take place, maybe changing values of env vars?

@MichalMrozek
Copy link

It is not even possible to refactor the code, you would need to change the whole specification.
If you allow mulitple zeInit with different variables, then you can have a scenario where within one process one library calls zeInit to use GPU , then another library calls ze init to use VPU only and this second call would invalidate all submissions in flight done by the first zeInit call. You would need to update all entry points to reflect that.

The reason why you have single initialization is to have single point in time where you set up the driver and all associated classes.
if you allow this step to happen multiple times, you would create gigantic overhead as you would need to introduce many checks for thing that were immutable to see if they changed. This would sacrifice a lot of L0 efficiency and would create horribly complex driver implementation that wouldn't be maintainable in the long run.

The only way to really move forward and have efficiency is to update the spec that only first initialization is valid and subsequent ones are not updating anything.

@jandres742
Copy link

jandres742 commented Oct 16, 2023

thanks @MichalMrozek . The problem here is this:

The reason why you have single initialization is to have single point in time where you set up the driver and all associated classes.

In a multi-library application, there's no single point in time where to call zeInit. Imagine an HPC application with the following libraries:

  • SYCL
  • MKL
  • Communication Libraries or MPI for internode communication
  • Library for intranode communication, like libfabric
  • Library for profiling

Each of these may call zeInit(), each with different requirements. For instance, the profiling tool may need tools and tracing, but if the zeInit from SYCL comes first, then tools and tracing may not be used. Or you have the communication libraries or MPI using multiple ranks (processes), and some use CPU and other GPU, each initializing L0 differently. So the single point of entry actually becomes a data race, depending on which library loads first.

As you say, fully supporting that mode would provide an enormous overhead, so maybe something in middle could be provided. Maybe zeInit can allow for incremental initialization (e.g., if zeInit has initialized a GPU, then later it can initialize a CPU, but not remove the GPU), or maybe we can find other alternatives.

@MichalMrozek
Copy link

That's why zeInit shouldn't have any parameters and always expose all devices.

Incremental initialization is the same problem, it has enormous overhead as you cannot assume that some portions of driver are already initialized and will not change in future, if you need to assume that they may increment at any point of time, that's where you have additional overhead.

If you need to add some capabilities in the middle like tracing, this should be via new APIs, not via zeInit which is already heavily overloaded.

@jandres742
Copy link

thanks @MichalMrozek . I agree with this:

this should be via new APIs, not via zeInit which is already heavily overloaded.

I think instead of relying on environment variables and flags passed to zeInit, we can have explicit APIs, so each component initializes what it needs. zeInit will take care only of general initialization, but other things could be taken care of by extra APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API: Core documentation Improvements or additions to documentation needs discussion
Projects
None yet
Development

No branches or pull requests

4 participants