Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zesDeviceProcessesGetState is returning 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE) #809

Open
jketreno opened this issue Feb 8, 2025 · 7 comments
Labels
bug in queue L0 Sysman Issue related to L0 Sysman

Comments

@jketreno
Copy link

jketreno commented Feb 8, 2025

I'm writing a small ze-top like utility to monitor the B580. It looks like zesDeviceProcessesGetState should be able to tell me the info for processes using the GPU. However, it always returns ZE_RESULT_ERROR_UNSUPPORTED_FEATURE. That error return code is documented for other APIs, but doesn't seem to be in the list of valid return codes for zesDeviceProcessesGetState

I have a valid device handle, which I'm using to call zesDeviceEnumEngineGroups to get usage info from the engines, and that's working well.

I've tried running as sudo in case there was a permissions issue, but that didn't help.

#define _MAX_PROCESSES 2048
processCount = _MAX_PROCESSES;
zes_process_state_t allProcesses[_MAX_PROCESS];
ret = zesDeviceProcessesGetState(hSysmanHandle, &processCount, allProcesses);
if (ret != ZE_RESULT_SUCCESS && ret != ZE_RESULT_ERROR_INVALID_SIZE) {
    fprintf(stderr, "Unable to get process information (ret count %u): %08X (%s)\n", processCount, ret, ze_error_to_str(ret));
}
...

The above outputs:

Unable to get process information (ret count 2048): 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE)

I've tried setting processCount to 0 to have it tell me how many process items to use, but that has the same error code returned.

I'm using libze-intel-gpu1 version 24.52.32224.5-124.10ppa2, and libze1 version 1.19.2.0-1076~24.10.

Thanks,
James

@JablonskiMateusz JablonskiMateusz added the L0 Sysman Issue related to L0 Sysman label Feb 10, 2025
@jketreno
Copy link
Author

Adding additional context; it looks like the device handle I was using was for the integrated Intel UHD 770:

Output while UHD 770 is running a workload, and I monitor the UHD 770:

Device 0: 868080A7-0400-0000-0002-000000000000
 BDF: 0000:0000:0002:0000
 PCI ID: 8086:A780
 Subdevices: 0
 Serial Number: unknown
 Board Number: unknown
 Brand Name: unknown
 Model Name: Intel(R) UHD Graphics 770
 Vendor Name: Intel(R) Corporation
 Driver Version: 7209A40C3CFCD5142354A9F
 Type: GPU
 Is integrated with host: Yes
 Is a sub-device: No
 Supports error correcting memory: No
 Supports on-demand pauge-faulting: No
Device 0: 7 engines found.
 Engine 0:
  Type: ZES_ENGINE_GROUP_RENDER_SINGLE
  Sub-device: No
 Engine 1:
  Type: ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE
  Sub-device: No
 Engine 2:
  Type: ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE
  Sub-device: No
 Engine 3:
  Type: ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE
  Sub-device: No
 Engine 4:
  Type: ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE
  Sub-device: No
 Engine 5:
  Type: ZES_ENGINE_GROUP_COPY_SINGLE
  Sub-device: No
 Engine 6:
  Type: ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE
  Sub-device: No
INFO: No temperature sensors to monitor.
Monitoring 7 engines.
ZES_ENGINE_GROUP_RENDER_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE: N/A
ZES_ENGINE_GROUP_COPY_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE: N/A
Unable to get process information (ret count 2048): 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE)
ZES_ENGINE_GROUP_RENDER_SINGLE: 98%
ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE: 0%
ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE: 0%
ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE: 0%
ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE: 0%
ZES_ENGINE_GROUP_COPY_SINGLE: 0%
ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE: 0%
Unable to get process information (ret count 2048): 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE)
...

I had mistakenly thought the B580 would have engine groups, so mistook the existence of engine groups meaning it was running on the B580. So while zesDeviceProcessesGetState is working correctly on the B580, it is failing on the UHD 770.

When I run the workload on the B580 and and monitor it, zesDeviceProcessesGetState is showing activity on engine type ZES_ENGINE_TYPE_FLAG_COMPUTE, zesDeviceEnumEngineGroups is not returning any engine groups for the B580. Is there another way to track compute utilization w/ the B580 or is there a kernel parameter required to turn that on in the Xe driver?

Output while running workload on B580 and monitor its usage:

Device 0: 86800BE2-0000-0000-0300-000000000000
 BDF: 0000:0003:0000:0000
 PCI ID: 8086:E20B
 Subdevices: 0
 Serial Number: unknown
 Board Number: unknown
 Brand Name: unknown
 Model Name: Intel(R) Graphics [0xe20b]
 Vendor Name: Intel(R) Corporation
 Driver Version: 977D4CB66F62C239FD56D33
 Type: GPU
 Is integrated with host: No
 Is a sub-device: No
 Supports error correcting memory: No
 Supports on-demand pauge-faulting: Yes
Device 0: 0 engines found.
INFO: No temperature sensors to monitor.
INFO: No engines to monitor.
       26537 python chat.py                 MEM: 5556486144           SHR: 0                    FLAGS: COMPUTE
       26537 python chat.py                 MEM: 5556486144           SHR: 0                    FLAGS: COMPUTE
       26537 python chat.py                 MEM: 5556486144           SHR: 0                    FLAGS: COMPUTE
...

An oddity is when running the workload on the integrated GPU (i915) the query to the B580 for process stats is showing the process that the i915 driver is using, but with no engine group flags:

Output while UHD 770 is running a workload, and I monitor the B580:

Device 0: 86800BE2-0000-0000-0300-000000000000
 BDF: 0000:0003:0000:0000
 PCI ID: 8086:E20B
 Subdevices: 0
 Serial Number: unknown
 Board Number: unknown
 Brand Name: unknown
 Model Name: Intel(R) Graphics [0xe20b]
 Vendor Name: Intel(R) Corporation
 Driver Version: 977D4CB66F62C239FD56D33
 Type: GPU
 Is integrated with host: No
 Is a sub-device: No
 Supports error correcting memory: No
 Supports on-demand pauge-faulting: Yes
Device 0: 0 engines found.
INFO: No temperature sensors to monitor.
INFO: No engines to monitor.
       23724 python chat.py                 MEM: 3420160              SHR: 0                    FLAGS:

@saik-intel
Copy link
Contributor

@jketreno we will look into internally and update you

@saik-intel
Copy link
Contributor

When I run the workload on the B580 and and monitor it, zesDeviceProcessesGetState is showing activity on engine type ZES_ENGINE_TYPE_FLAG_COMPUTE, zesDeviceEnumEngineGroups is not returning any engine groups for the B580. Is there another way to track compute utilization w/ the B580 or is there a kernel parameter required to turn that on in the Xe driver?

[Sai] XE driver upstream patch is in review and waiting for merge. once it is ready, it will merge and regarding other issue you raised for UHD770 , we able to see its working as per below log

root@DUT6051BMGSVC:/home/gta/level_zero/bin# export ZELLO_SYSMAN_USE_ZESINIT=1; export ZES_ENABLE_SYSMAN=1; export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/gta/level_zero/libs/:/home/gta/level_zero/latest_loa der/:/home/gta/level_zero/bin/;
root@DUT6051BMGSVC:/home/gta/level_zero/bin# ./zello_sysman -g
ZES_ENABLE_SYSMAN environment variable Set
Sysman Initialization done via zesInit ---- Global Operations tests ----
properties.numSubdevices = 0
properties.serialNumber = unknown
properties.boardNumber = unknown
properties.brandName = Intel(R) Corporation
properties.modelName = Intel(R) UHD Graphics 770
properties.vendorName = Intel(R) Corporation
properties.driverVersion = BABE9C47939376BE4C71D06
properties.core.type = 1
properties.core.vendorId = 32902
properties.core.deviceId = 42880
properties.core.flags = 1
properties.core.coreClockRate = 1650
properties.core.maxHardwareContexts = 65536
properties.core.maxCommandQueuePriority = 0
properties.core.numThreadsPerEU = 7
properties.core.numEUsPerSubslice = 16
properties.core.numSubslicesPerSlice = 2
properties.core.numSlices = 1
properties.core.timerResolution = 52
properties.core.timestampValidBits = 36
properties.core.kernelTimestampValidBits = 32
properties.core.uuid =
134 128 128 167 4 0 0 0 0 2 0 0 0 0 0 0
properties.core.name = Intel(R) UHD Graphics 770
reset status: 0
repair0 ---- Global Operations tests ----
properties.numSubdevices = 0
properties.serialNumber = unknown
properties.boardNumber = unknown
properties.brandName = Intel(R) Corporation
properties.modelName = Intel(R) Arc(TM) B580 Graphics
properties.vendorName = Intel(R) Corporation
properties.driverVersion = BABE9C47939376BE4C71D06
properties.core.type = 1
properties.core.vendorId = 32902
properties.core.deviceId = 57867
properties.core.flags = 8
properties.core.coreClockRate = 2850
properties.core.maxHardwareContexts = 65536
properties.core.maxCommandQueuePriority = 0
properties.core.numThreadsPerEU = 8
properties.core.numEUsPerSubslice = 8
properties.core.numSubslicesPerSlice = 4
properties.core.numSlices = 5
properties.core.timerResolution = 52
properties.core.timestampValidBits = 64
properties.core.kernelTimestampValidBits = 64
properties.core.uuid =
134 128 11 226 0 0 0 0 3 0 0 0 0 0 0 0
properties.core.name = Intel(R) Arc(TM) B580 Graphics
reset status: 0
repair0

@eero-t
Copy link

eero-t commented Feb 12, 2025

This looks like relevant kernel patch series, but it's for Xe KMD tree, not upstream: https://patchwork.freedesktop.org/series/144408/

@jketreno
Copy link
Author

[...] regarding other issue you raised for UHD770 , we able to see its working as per below log

root@DUT6051BMGSVC:/home/gta/level_zero/bin#
export ZELLO_SYSMAN_USE_ZESINIT=1;
export ZES_ENABLE_SYSMAN=1;
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/gta/level_zero/libs/:/home/gta/level_zero/latest_loader/:/home/gta/level_zero/bin/;
root@DUT6051BMGSVC:/home/gta/level_zero/bin/zello_sysman -g
...

Reproduction of U770 failure

Running Ubuntu Oracular (24.10) with the linux-intel kernel and all other packages updated to latest versions as of 2025-02-20.

Find version of libze-intel-gpu1 on system

$ uname -a
Linux battle-linux 6.11.0-1006-intel #6-Ubuntu SMP PREEMPT_DYNAMIC Thu Jan  9 18:18:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
$ apt-cache policy libze-intel-gpu1
libze-intel-gpu1:
  Installed: 24.52.32224.14-1~24.10~ppa2
  Candidate: 24.52.32224.14-1~24.10~ppa2
  Version table:
 *** 24.52.32224.14-1~24.10~ppa2 500
        500 https://ppa.launchpadcontent.net/kobuk-team/intel-graphics/ubuntu oracular/main amd64 Packages
        100 /var/lib/dpkg/status
     24.35.30872.24-1 500
        500 http://us.archive.ubuntu.com/ubuntu oracular/universe amd64 Packages
apt-cache policy libze-dev
libze-dev:
  Installed: 1.19.2.0-1076~24.10
  Candidate: 1.19.2.0-1076~24.10
  Version table:
 *** 1.19.2.0-1076~24.10 500
        500 https://ppa.launchpadcontent.net/kobuk-team/intel-graphics/ubuntu oracular/main amd64 Packages
        100 /var/lib/dpkg/status
     1.17.42-1 500
        500 http://us.archive.ubuntu.com/ubuntu oracular/universe amd64 Packages

Get compute-runtime source matching the version of libze-intel-gpu1

git clone https://github.com/intel/compute-runtime.git
cd compute-runtime
git tag  | grep 24.52.32224.14
git checkout 24.52.32224.14
cd level_zero/tools/test/black_box_tests/

Build zello_sysman

g++ -O2 -Wall -o zello_sysman  zello_sysman.cpp -lze_loader -locloc

Test

export ZELLO_SYSMAN_USE_ZESINIT=1
export ZES_ENABLE_SYSMAN=1
./zello_sysman -g

Output:

ZES_ENABLE_SYSMAN environment variable Set
Sysman Initialization done via zesInit
...
[...deleted 0xe20b output...]
...
 ----  Global Operations tests ---- 
properties.numSubdevices = 0
properties.serialNumber = unknown
properties.boardNumber = unknown
properties.brandName = unknown
properties.modelName = Intel(R) UHD Graphics 770
properties.vendorName = Intel(R) Corporation
properties.driverVersion = 7209A40C3CFCD5142354A9F
properties.core.type = 1
properties.core.vendorId = 32902
properties.core.deviceId = 42880
properties.core.flags = 1
properties.core.coreClockRate = 1650
properties.core.maxHardwareContexts = 65536
properties.core.maxCommandQueuePriority = 0
properties.core.numThreadsPerEU = 7
properties.core.numEUsPerSubslice = 16
properties.core.numSubslicesPerSlice = 2
properties.core.numSlices = 1
properties.core.timerResolution = 52
properties.core.timestampValidBits = 36
properties.core.kernelTimestampValidBits = 32
properties.core.uuid = 
134 128 128 167 4 0 0 0 0 2 0 0 0 0 0 0 
properties.core.name = Intel(R) UHD Graphics 770
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE returned by zesDeviceProcessesGetState(device, &count, nullptr): testSysmanGlobalOperations: 1433
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE returned by zesDeviceProcessesGetState(device, &count, processes.data()): testSysmanGlobalOperations: 1435
reset status: 0
repair0

@jketreno
Copy link
Author

This looks like relevant kernel patch series, but it's for Xe KMD tree, not upstream: https://patchwork.freedesktop.org/series/144408/

I see it is failing in the patch tests:

Image

Assuming those errors get fixed, am I correct that the flow will be Xe KMD tree -> DRM next -> DRM -> kernel.org? Or would these go straight to kernel.org as a bug fix to the existing Xe KMD driver? Or might they get picked up in the linux-intel kernel in the Ubuntu intel-graphics PPA?

I'm just trying to figure out if I should abandon trying to get the B580 to work for a few more months while I wait for these patches to land, or if there might be a shorter path. I'm not too keen on rip/replace my system's kernel with one I build from source as I tend to end up with other random system failures anytime I use a tip-of-tree kernel and I'm trying to keep this system as a "production" config vs. a franken-developer config :)

Thanks,
James

@eero-t
Copy link

eero-t commented Feb 21, 2025

Assuming those errors get fixed, am I correct that the flow will be Xe KMD tree -> DRM next -> DRM -> kernel.org? Or would these go straight to kernel.org as a bug fix to the existing Xe KMD driver?

I'm not a kernel developer, but it's a new feature (for the Xe KMD), and I would think even bug fixes normally go through DRM integration tree, to make sure they do not break anything.

Or might they get picked up in the linux-intel kernel in the Ubuntu intel-graphics PPA?

I'm not familiar with that. Ubuntu HWE packages are LTS backports from things that have been tested for few months in latest non-LTS releases, so those would have quite a lot of delay, but I guess PPAs could include anything. I don't think they would do backporting though, at least not for things like metrics, which do not block using the HW. So either it would be upstream kernel with the Xe stuff already merged, or kernel test package from the Xe driver repo.

In latter case, I personally I would rather build test kernels myself. One might be able to fork-lift latest driver source from the driver integration repo to distro (HWE) kernel version sources; either whole driver, or specific source file(s). If you do that, you could notify upstream whether it worked or not (add your Tested-By tag if you tested the patch series).

I'm just trying to figure out if I should abandon trying to get the B580 to work for a few more months while I wait for these patches to land, or if there might be a shorter path. I'm not too keen on rip/replace my system's kernel with one I build from source as I tend to end up with other random system failures anytime I use a tip-of-tree kernel and I'm trying to keep this system as a "production" config vs. a franken-developer config :)

While one could use stripped distro kernel config for building own kernels, to speed up the builds, I'd use the configs as-is (as much as possible), when wanting to make sure everything works as expected. If you do build your own kernel and it fails, it would be good to notifty upstream about that, at least about reproducible issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug in queue L0 Sysman Issue related to L0 Sysman
Projects
None yet
Development

No branches or pull requests

4 participants