Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NativeVariant.Cpu: add the capability to detect instruction-set extensions #37

Merged
merged 13 commits into from
Mar 9, 2025

Conversation

stephengold
Copy link
Contributor

Here's my first cut at a solution for issue #36. I use the OSHI library to obtain CPU feature information. Any X86 ISA extensions that are present are cached in a private TreeSet that can be used to define custom predicates, like so:

        PlatformPredicate linuxWithHaswell = new PlatformPredicate(
                PlatformPredicate.LINUX_X86_64.evaluatePredicate()
                && NativeVariant.Cpu.hasExtensions("avx", "avx2", "bmi1", "f16c", "fma", "sse4_1", "sse4_2"));

I assume the dynamic libraries are searched in the order they are passed to registerNativeLibraries(), and the first match is used. That way I can use predefined platform predicates as a fallback, like so:

       NativeBinaryLoader loader = new NativeBinaryLoader(info);
        NativeDynamicLibrary[] libraries = {
            new NativeDynamicLibrary("linux/x86-64/haswell", linuxWithHaswell),
            new NativeDynamicLibrary("linux/x86-64", PlatformPredicate.LINUX_X86_64)
        };
        loader.registerNativeLibraries(libraries).initPlatformLibrary();

with the assurance that "linux/x86-64" will be loaded only if linuxWithHaswell is false.

@stephengold
Copy link
Contributor Author

Reviewing last night's work in the light of day, I think extNameCache might be a premature optimization. It might be better to invoke OSHI and parse the CPU feature strings every time hasExtensions() is invoked.

Also, I still need to test hasExtensions() on macOS and Windows platforms.

@pavly-gerges
Copy link
Member

@stephengold Thanks for working on this feature. In the next 24 hours, I will start looking into the requested changes. In the meanwhile, I would like to kindly ask you if it's possible to create a simple testcase to go to the snaploader-examples module.

@pavly-gerges
Copy link
Member

pavly-gerges commented Feb 28, 2025

Also, I still need to test hasExtensions() on macOS and Windows platforms.

I think this is doable in a CI/CD environment (I think I already have a setup for it here). And, also you are free to integrate and test Jolt-Jni in an example (I would actually prefer a real library that uses the feature to be in that test).

@stephengold
Copy link
Contributor Author

I would actually prefer a real library that uses the feature to be in that test

I agree. The next release of jolt-jni ought to make a good test for jSnapLoader. Once I have extension detection working in my snap-jolt workbench, I'll copy one of those tests over to jSnapLoader.

Today I discovered that the CPU feature strings returned by OSHI differ between Linux and Windows, even for the identical hardware! So I need to rework this PR in order for it work on Windows.

@pavly-gerges
Copy link
Member

pavly-gerges commented Mar 1, 2025

Today I discovered that the CPU feature strings returned by OSHI differ between Linux and Windows, even for the identical hardware!

If a check that is OS-specific, then you might think of the NativeVariant.Os utility.

EDIT:
Btw, I have just updated the master branch to fix the deprecated GitHub CI/CD artifacts actions. You can pull them to this branch to be able to test your testcases.

@stephengold
Copy link
Contributor Author

My plan for today is:

  1. simplify the code by removing the cache
  2. adapt the code to work properly on Windows
  3. solve failing checks and
  4. investigate alternatives to the OSHI library.

@stephengold stephengold marked this pull request as draft March 1, 2025 17:57
@stephengold
Copy link
Contributor Author

@pavly-gerges Please authorize the Actions workflow.

investigate alternatives to the OSHI library.

Aecsocket's cpu-features-java library provides Java bindings to Google's popular cpu_features project. It has a couple drawbacks:

  1. apparently not maintained: the last commit was March 2023
  2. not compatible with Java 11: v1.0.1 requires Java 15, v2.0.0 requires Java 19 preview

Nihui's ruapu provides JNI bindings and buildscripts for Java, however there are no artifacts at Maven Central.

Either of these projects could be adapted to address this feature request in case OSHI proves unsuitable for some reason.

@stephengold stephengold marked this pull request as ready for review March 1, 2025 20:14
@pavly-gerges
Copy link
Member

pavly-gerges commented Mar 2, 2025

I assume the dynamic libraries are searched in the order they are passed to registerNativeLibraries(), and the first match is used. That way I can use predefined platform predicates as a fallback, like so:

Yes, that is true. It's here, and once a true predicate is found, the test terminates and file loading is commanded:

if (nativeDynamicLibrary.getPlatformPredicate().evaluatePredicate()) {
this.nativeDynamicLibrary = nativeDynamicLibrary;
isSystemFound[0] = true;
}

EDIT:

I agree. The next release of jolt-jni ought to make a good test for jSnapLoader. Once I have extension detection working in my snap-jolt workbench, I'll copy one of those tests over to jSnapLoader.

It would be nice to have at least a testcase of those on this PR. However, I know there is currently much stress on you delivering some work on Jolt-Jni; thus I don't want to put much pressure on this, and we can have it on another PR later, if you have other things to do.

Copy link
Member

@pavly-gerges pavly-gerges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, this is an initial review. I would like to know if you have any other opinions about the requested changes. Thanks again for your time and effort to deliver the feature.

@stephengold
Copy link
Contributor Author

I assume the dynamic libraries are searched in the order they are passed to registerNativeLibraries(), and the first match is used. That way I can use predefined platform predicates as a fallback

Yes, that is true.

If it's an official feature of the library (and not just an implementation detail) then it would be nice to document it as such.

It would be nice to have at least a testcase of those on this PR

Sometime this week I expect there'll be a jolt-jni release suitable for testing this feature. If you delay integration until the testcase is ready, I'll add it here. Otherwise I'll open another PR for it.

Thanks for your review comments. I'm studying them now...

@stephengold
Copy link
Contributor Author

FYI: the new release of jolt-jni is waiting on jrouwe/JoltPhysics#1545. After that release I'll submit some test code for jSnapLoader.

@stephengold
Copy link
Contributor Author

stephengold commented Mar 3, 2025

In the long run, it seems better to directly invoke cpu-id instructions via JNA. This would mostly eliminate the confusion caused different OSes and OS versions. However, a thorough implementation looks to me like a lot of effort.

I envision cross-compiling a small shared library for each platform on which jSnapLoader might run. However, Java runs on dozens of platforms. Also, testing would be a huge challenge. For instance, I currently lack access to MIPS, PPC, S/390, LoongArch, and RISC-V hardware.

This PR establishes an API that ought to be compatible with such a future implementation.

@pavly-gerges
Copy link
Member

pavly-gerges commented Mar 3, 2025

I envision cross-compiling a small shared library for each platform on which jSnapLoader might run. However, Java runs on dozens of platforms.

For these kind of rare platforms, almost every one of them would run a Linux OS or a variant of it. So, we could hack this by compiling the CPU-ID library on the runtime of jSnapLoader using a system call, then linking the code immediately. This could be all done cleanly with a simple internalized API. This however, will require CMake and GCC/G++ as System Dependencies (But they should be already there on any Linux OS).

This PR establishes an API that ought to be compatible with such a future implementation.

That would be our most goal for now. But, if we cannot do this, it's okay, we could deprecate them later, if we have arrived with better solutions.

Copy link
Member

@pavly-gerges pavly-gerges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, this wraps up the work. Thanks for allocating the time and effort to work on this feature. It's really nice to have you here :-). Let me know the time you would like to get this merged. It's okay for me to keep it open until Jolt-Jni is ready to test this. Or, we could merge this, and arrive with a 1.1.0-alpha version to ease the testing for you. Whichever the case, I don't mind.

EDIT:
We have a third option, as well. I could try deploying this branch as a 1.1.0-alpha w/o merging the PR to ease the testing on your CI/CD setup.

@pavly-gerges pavly-gerges changed the title add the capability to detect instruction-set extensions NativeVariant.Cpu: add the capability to detect instruction-set extensions Mar 3, 2025
@stephengold
Copy link
Contributor Author

stephengold commented Mar 4, 2025

Being curious, I dumped the CPU features on various GitHub Actions runners:

  1. windows-2022:
CPU features:
    pf_avx2_instructions_available
    pf_avx_instructions_available
    pf_compare_exchange128
    pf_compare_exchange_double
    pf_fastfail_available
    pf_mmx_instructions_available
    pf_nx_enabled
    pf_pae_enabled
    pf_rdpid_instruction_available
    pf_rdrand_instruction_available
    pf_rdtsc_instruction_available
    pf_rdtscp_instruction_available
    pf_rdwrfsgsbase_available
    pf_sse3_instructions_available
    pf_sse4_1_instructions_available
    pf_sse4_2_instructions_available
    pf_ssse3_instructions_available
    pf_virt_firmware_enabled
    pf_xmmi64_instructions_available
    pf_xmmi_instructions_available
    pf_xsave_enabled
  1. ubuntu-22.04:
CPU features:
    abm
    adx
    aes
    aperfmperf
    apic
    arat
    avx
    avx2
    bmi1
    bmi2
    clflush
    clflushopt
    clwb
    clzero
    cmov
    cmp_legacy
    constant_tsc
    cpuid
    cr8_legacy
    cx16
    cx8
    de
    decodeassists
    dnowprefetch
    erms
    extd_apicid
    f16c
    flags
    flushbyasid
    fma
    fpu
    fsgsbase
    fsrm
    fxsr
    fxsr_opt
    ht
    hypervisor
    invpcid
    lahf_lm
    lm
    mca
    mce
    misalignsse
    mmx
    mmxext
    movbe
    msr
    mtrr
    nonstop_tsc
    nopl
    npt
    nrip_save
    nx
    osvw
    pae
    pat
    pausefilter
    pcid
    pclmulqdq
    pdpe1gb
    pfthreshold
    pge
    pni
    popcnt
    pse
    pse36
    rdpid
    rdpru
    rdrand
    rdseed
    rdtscp
    rep_good
    sep
    sha_ni
    smap
    smep
    sse
    sse2
    sse4_1
    sse4_2
    sse4a
    ssse3
    svm
    syscall
    topoext
    tsc
    tsc_reliable
    tsc_scale
    umip
    user_shstk
    v_vmsave_vmload
    vaes
    vmcb_clean
    vme
    vmmcall
    vpclmulqdq
    xgetbv1
    xsave
    xsavec
    xsaveerptr
    xsaveopt
    xsaves
  1. macOS-13 (x86_64 architecture)
CPU features:
    adx
    aes
    apic
    avx1
    avx2
    bmi1
    bmi2
    clfsh
    clfsopt
    cmov
    cpu
    cx16
    cx8
    de
    dscpl
    dtes64
    em64t
    erms
    extfeatures
    f16c
    features
    fma
    fpu
    fpu_csds
    fxsr
    gbpage
    hle
    htt
    invpcid
    ipt
    lahf
    leaf7_features
    lzcnt
    machdep
    mca
    mce
    mmx
    movbe
    mpx
    msr
    mtrr
    osxsave
    pae
    pat
    pbe
    pcid
    pclmulqdq
    pge
    popcnt
    prefetchw
    pse
    pse36
    rdrand
    rdseed
    rdwrfsgs
    rtm
    seglim64
    sep
    sgx
    smap
    smep
    ss
    sse
    sse2
    sse3
    sse4
    ssse3
    syscall
    tpr
    tsc
    tsc_thread_offset
    vme
    vmm
    vmx
    x2apic
    xd
    xsave
  1. macOS-15 (arm-v8 architecture)
CPU features:
    advsimd
    advsimd_hpfpcvt
    arm
    arm64
    armv8_1_atomics
    armv8_2_fhm
    armv8_2_sha3
    armv8_2_sha512
    armv8_3_compnum
    armv8_crc32
    armv8_gpi
    breakpoint
    caps
    feat_aes
    feat_afp
    feat_bf16
    feat_bti
    feat_csv2
    feat_csv3
    feat_dit
    feat_dotprod
    feat_dpb
    feat_dpb2
    feat_ecv
    feat_fcma
    feat_fhm
    feat_flagm
    feat_flagm2
    feat_fp16
    feat_fpac
    feat_frintts
    feat_i8mm
    feat_jscvt
    feat_lrcpc
    feat_lrcpc2
    feat_lse
    feat_lse2
    feat_pauth
    feat_pauth2
    feat_pmull
    feat_rdm
    feat_rpres
    feat_sb
    feat_sha1
    feat_sha256
    feat_sha3
    feat_sha512
    feat_sme
    feat_sme2
    feat_sme_f64f64
    feat_sme_i16i64
    feat_specres
    feat_ssbs
    feat_wfxt
    floatingpoint
    fp_syncexceptions
    hw
    neon
    neon_fp16
    neon_hpfp
    optional
    sme_b16f32
    sme_bi32i32
    sme_f16f32
    sme_f32f32
    sme_i16i32
    sme_i8i32
    ucnormal_mem
    watchpoint

@pavly-gerges
Copy link
Member

I envision the API you built will work with general features. So, real CPU bare metal features and not OS-dependent. I see some OS stuff, for example "osxsave". What are these?!

@stephengold
Copy link
Contributor Author

What are these?!

I investigated, and I have bad news. The data returned by getFeatureFlags() on macOS is in yet another format that will require further processing and filtering. Here is a sample from an Arm-based Mac:

hw.optional.arm.FEAT_FlagM: 1
hw.optional.arm.FEAT_FlagM2: 1
hw.optional.arm.FEAT_FHM: 1
hw.optional.arm.FEAT_DotProd: 1
hw.optional.arm.FEAT_SHA3: 1
hw.optional.arm.FEAT_RDM: 1
hw.optional.arm.FEAT_LSE: 1
hw.optional.arm.FEAT_SHA256: 1
hw.optional.arm.FEAT_SHA512: 1
hw.optional.arm.FEAT_SHA1: 1
hw.optional.arm.FEAT_AES: 1
hw.optional.arm.FEAT_PMULL: 1
hw.optional.arm.FEAT_SPECRES: 0
hw.optional.arm.FEAT_SB: 1
hw.optional.arm.FEAT_FRINTTS: 1
hw.optional.arm.FEAT_LRCPC: 1
hw.optional.arm.FEAT_LRCPC2: 1
hw.optional.arm.FEAT_FCMA: 1
hw.optional.arm.FEAT_JSCVT: 1
hw.optional.arm.FEAT_PAuth: 1
hw.optional.arm.FEAT_PAuth2: 0
hw.optional.arm.FEAT_FPAC: 0
hw.optional.arm.FEAT_DPB: 1
hw.optional.arm.FEAT_DPB2: 1
hw.optional.arm.FEAT_BF16: 0
hw.optional.arm.FEAT_I8MM: 0
hw.optional.arm.FEAT_WFxT: 0
hw.optional.arm.FEAT_RPRES: 0
hw.optional.arm.FEAT_ECV: 0
hw.optional.arm.FEAT_AFP: 0
hw.optional.arm.FEAT_LSE2: 1
hw.optional.arm.FEAT_CSV2: 1
hw.optional.arm.FEAT_CSV3: 1
hw.optional.arm.FEAT_DIT: 1
hw.optional.arm.FEAT_FP16: 1
hw.optional.arm.FEAT_SSBS: 1
hw.optional.arm.FEAT_BTI: 0
hw.optional.arm.FEAT_SME: 0
hw.optional.arm.FEAT_SME2: 0
hw.optional.arm.SME_F32F32: 0
hw.optional.arm.SME_BI32I32: 0
hw.optional.arm.SME_B16F32: 0
hw.optional.arm.SME_F16F32: 0
hw.optional.arm.SME_I8I32: 0
hw.optional.arm.SME_I16I32: 0
hw.optional.arm.FEAT_SME_F64F64: 0
hw.optional.arm.FEAT_SME_I16I64: 0
hw.optional.arm.FP_SyncExceptions: 1
hw.optional.arm.caps: 868632120666353663
hw.optional.floatingpoint: 1
hw.optional.neon: 1
hw.optional.neon_hpfp: 1
hw.optional.neon_fp16: 1
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_2_fhm: 1
hw.optional.armv8_2_sha512: 1
hw.optional.armv8_2_sha3: 1
hw.optional.armv8_3_compnum: 1
hw.optional.watchpoint: 4
hw.optional.breakpoint: 6
hw.optional.armv8_crc32: 1
hw.optional.armv8_gpi: 1
hw.optional.AdvSIMD: 1
hw.optional.AdvSIMD_HPFPCvt: 1
hw.optional.ucnormal_mem: 1
hw.optional.arm64: 1

We should probably strip off the "hw.optional." prefix and ignore list entries that end with ": 0".

@pavly-gerges
Copy link
Member

pavly-gerges commented Mar 4, 2025

Have you investigated if OSHI provide an algorithm to pull the CPU feature names out of those? I mean they could be aware of that and have solved it maybe.

@stephengold
Copy link
Contributor Author

I did study the javadoc. I didn't search beyond that.

@stephengold
Copy link
Contributor Author

One last change for your consideration: adding another PlatformPredicate constructor.

The added source code would look like this:

public PlatformPredicate(PlatformPredicate base, String... requiredNames) {
    this.predicate = base.evaluatePredicate()
            && NativeVariant.Cpu.hasExtensions(requiredNames);
}

This would allow efficient creation of new platform predicates that combine a pre-defined platform with one or more ISA extensions. For example:

PlatformPredicate linuxWithFma
        = new PlatformPredicate(PlatformPredicate.LINUX_X86_64, "avx", "fma");

@pavly-gerges
Copy link
Member

One last change for your consideration: adding another PlatformPredicate constructor.

Yep, you can proceed with the proper documentation. I am okay with it. I think it's better to replace requiredNames with a proper reference name (e.g., requestedISAExtensions or just isaExtensions).

@pavly-gerges
Copy link
Member

I've just copied the HEAD of your PR branch commits, and released a 1.1.0-alpha version including a copy of these changes. To further automate your work, you can now integrate these changes on your CI/CD runners on Jolt-Jni using Gradle.

I've staged and released it on the nexus manager, it would be available very soon.

@stephengold
Copy link
Contributor Author

Thanks. Jorrit appears to be inactive this week, so I'll go ahead and cut a jolt-jni release without his fix. That'll give us something to test.

@stephengold
Copy link
Contributor Author

Testing on the snap-jolt project uncovered one minor issue: if the application doesn't explicitly depend on OSHI, then it can crash at runtime with

Error: Exception in thread "main" java.lang.NoClassDefFoundError: oshi/SystemInfo
> Task :PrintConfig FAILED
	at electrostatic4j.snaploader.platform.util.NativeVariant$Cpu.readFeatureFlags(NativeVariant.java:280)
2 actionable tasks: 2 executed
	at electrostatic4j.snaploader.platform.util.NativeVariant$Cpu.hasExtensions(NativeVariant.java:350)
	at electrostatic4j.snaploader.platform.util.PlatformPredicate.<init>(PlatformPredicate.java:147)
	at com.github.stephengold.snapjolt.PrintConfig.main(PrintConfig.java:52)
Caused by: java.lang.ClassNotFoundException: oshi.SystemInfo
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:528)
	... 4 more

Adding an explicit dependency to the buildscript solves the problem:

stephengold/snap-jolt@d88d58b

but it would be better if jSnapLoader simply included a transitive dependency on OSHI.

@stephengold
Copy link
Contributor Author

This morning I added an app to snaploader-examples that test selection between native libraries based on CPU features. However, I'm getting some unexpected results.

I need to investigate further before committing that code.

@stephengold
Copy link
Contributor Author

Ah, I see what's happening. The 2 native libraries for my platform that I'm trying to select between are both named "libjoltjni.so". Once one of them is extracted to the working directory, the loader keeps loading that file.

Is there an easy way to force jSnapLoader to replace the existing file with a freshly-extracted one?

@pavly-gerges
Copy link
Member

Is there an easy way to force jSnapLoader to replace the existing file with a freshly-extracted one?

Sure, set a LoadingCriterion.CLEAN_EXTRACTION here:

public NativeBinaryLoader loadLibrary(LoadingCriterion criterion) throws Exception {

There is another better solution though, extract your libjoltjni.so to a folder specifying the name of the library as a root directory, and load any one of them as required.

@stephengold
Copy link
Contributor Author

Thank you. I think this PR is ready for final review.

@pavly-gerges
Copy link
Member

Thank you. I think this PR is ready for final review.

Perfect! The new changes look good to me. I am curious though, have you tried the submitted testcase on your GitHub runners? What was the output on different systems?

@pavly-gerges
Copy link
Member

pavly-gerges commented Mar 8, 2025

I don't know if changing the CI/CD runners file on the master branch will affect here, and run this test. Let's try.

EDIT:
Lol, it seems to be requiring a change here. Anyway, I don't much care about running it for now. Let me know the time you would like to get this merged.

@stephengold
Copy link
Contributor Author

have you tried the submitted testcase on your GitHub runners? What was the output on different systems?

I've run similar code on the GitHub runners. In fact that's where I copied the expected outputs from. At GitHub, the Windows runners support AVX2 and the Linux runners support FMA.

Note that, as currently written, the new test won't run on MacOS or ARM.

Let me know the time you would like to get this merged.

I'm ready when you are.

@pavly-gerges
Copy link
Member

pavly-gerges commented Mar 8, 2025

Note that, as currently written, the new test won't run on MacOS or ARM.

I expect there would be another PR to extract the features for Some Mac Architectures (e.g., ARM-based). I suggest opening an issue for them, and also to specify whether they are essential for the 1.1.0-stable release. Otherwise, I will merge this PR within hours from now.

@pavly-gerges pavly-gerges merged commit bba58e6 into Electrostat-Lab:master Mar 9, 2025
13 checks passed
@pavly-gerges
Copy link
Member

@stephengold I appreciate your contributions.

@stephengold stephengold deleted the sgold/oshi branch March 9, 2025 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core API related stuff enhancement New feature or request
Projects
None yet
2 participants