Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ULP requirements for fp16 divide #1278

Open
lakshmih opened this issue Nov 5, 2024 · 6 comments
Open

ULP requirements for fp16 divide #1278

lakshmih opened this issue Nov 5, 2024 · 6 comments

Comments

@lakshmih
Copy link
Contributor

lakshmih commented Nov 5, 2024

Referring to OpenCL C spec on ULP requirements

ULP requirements for single precision divide (x/y) and reciprocal(1.0/x) are ≤ 2.5 ulp

However, for half precision these are defined as needing to be 'correctly rounded'

We would like to propose that these be defined with specific (lower) ULP, following the pattern for other built-ins.
We would further like to set a precision requirement of <= 1 ulp for both of these cases for fp16.

Double precision ULP for these cases also suffer from the same discrepancy (specific ULP for float, correctly rounded for double) so these should be reviewed as well.

@svenvh
Copy link
Member

svenvh commented Nov 6, 2024

As an additional consideration, for fp32 the specification defines the CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT value for the CL_DEVICE_SINGLE_FP_CONFIG query:

CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT - divide and sqrt are correctly rounded as defined by the IEEE754 specification.

In case implementations want to keep advertising a correctly rounded divide for half/double, we could consider extending the CL_DEVICE_HALF_FP_CONFIG and CL_DEVICE_DOUBLE_FP_CONFIG with CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT. That may also require revisiting the sqrt ULP requirements.

@bashbaug
Copy link
Contributor

Discussed in the November 5th teleconference. The decisions we need to make are:

  1. Should we reduce the precision requirements for an fp16 divide? (Leaning towards “yes”.)
  2. If we reduce the precision requirements for an fp16 divide, what should we reduce it to? <=1ULP? <=0.5ULP? Something else?
  3. Do we need a build option like “-cl-fp32-correctly-rounded-divide-sqrt” to get back to a more precise fp16 divide?
  4. Do we want to relax the precision requirements for any other fp16 operations, or is this just an issue with the fp16 divide? Consider 1/x in particular.
  5. Do we want to relax the precision requirements for any fp64 operations?

@kpet
Copy link
Contributor

kpet commented Nov 12, 2024

See related CTS issue: KhronosGroup/OpenCL-CTS#1996

@lakshmih
Copy link
Contributor Author

Agreed on WG call 11/12 to:

@bashbaug
Copy link
Contributor

With our current GPU implementation, we have a 0.5 ULP error with some inputs:

ERROR: divide: 0.500000 ulp error at {0x1.ef4p-4 (0x2fbd), -0x1.cp+13 (0xf300)}
Expected: -0x1.1cp-17  (half 0x808e) 
Actual: -0x1.1ap-17 (half 0x808d) at index: 13227

If I relax the ULP requirements to 0.5 ULP then the test passes:

  1:          divide fp16     ................passed        0.50 @ {-0x1.0e4p-6, 0x1.7p+13}

Tested on an Arc A750 with the command line: test_bruteforce -f -r divide

Our current CPU implementation passes without relaxing the ULP requirements.

@lakshmih
Copy link
Contributor Author

Qualcomm would like to request 1 ULP for divide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Needs WG discussion
Development

No branches or pull requests

4 participants