Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccuracy for f32 erf on non-FMA instruction set #517

Open
bnprks opened this issue Feb 15, 2024 · 0 comments
Open

Inaccuracy for f32 erf on non-FMA instruction set #517

bnprks opened this issue Feb 15, 2024 · 0 comments
Labels

Comments

@bnprks
Copy link

bnprks commented Feb 15, 2024

Thanks for writing and maintaining such a fast and useful library!

I've noticed some precision issues where the erf function with f32 inputs may exceed the documented 1.0 ULP error bound when FMA is not available.

I ran my testing on x86, where scalar and AVX2 return consistently accurate values, but SSE4 and SSE2 sometimes return inaccurate values.

For example, the input 0x1.16945ap-126 (1.279174322e-38) has a 1.3 ULP error, and the input -0x1.d1b6c8p-127 (-1.069226882e-38) has 1.8 ULP error.

Here is some example code to reproduce these results, tested on a fresh pull of the main branch (a99491a):

#include <stdio.h>
#include <math.h>

#include "sleef.h"

//cmake -B build -S . -G Ninja
//cmake --build build
//gcc -o sleef_erf_example sleef_erf_example.c -lsleef -lm -L./build/lib -I ./build/include -march=native

int main() {
    float example_input = 0x1.16945ap-126;
    printf("Problem input:\t%1$.10g (%1$a)\n", example_input);
    printf("Sleef scalar:\t%1$.10g (%1$a)\n", Sleef_erff1_u10(example_input));
    printf("Sleef avx2:\t%1$.10g (%1$a)\n", _mm256_cvtss_f32(Sleef_erff8_u10avx2(_mm256_set1_ps(example_input))));
    printf("Sleef sse4:\t%1$.10g (%1$a)\n", _mm_cvtss_f32(Sleef_erff4_u10sse4(_mm_set1_ps(example_input))));
    printf("Sleef sse2:\t%1$.10g (%1$a)\n", _mm_cvtss_f32(Sleef_erff4_u10sse2(_mm_set1_ps(example_input))));
    printf("Correct result:\t%1$.20Lg (%1$La)\n", erfl((long double) example_input));
}

When run, this outputs:

Problem input:  1.279174322e-38 (0x1.16945ap-126)
Sleef scalar:   1.44339361e-38 (0x1.3a57e2p-126)
Sleef avx2:     1.44339361e-38 (0x1.3a57e2p-126)
Sleef sse4:     1.44339347e-38 (0x1.3a57ep-126)
Sleef sse2:     1.44339347e-38 (0x1.3a57ep-126)
Correct result: 1.4433936563104040097e-38 (0x9.d2bf154036c0db7p-129)

It appears the differences arise in the call to dfmul_vf2_vf2_vf on this line 3524 of sleefsimdsp.c

t2 = vsel_vf2_vo_vf2_vf2(vlt_vo_vf_vf(x, vcast_vf_f(1e-4)), dfmul_vf2_vf2_vf(vcast_vf2_f_f(-1.1283792257308959961, 5.8635383422197591097e-08), x), t2);

An algorithm update to improve non-FMA precision would of course be appreciated. For my use-case, however, I don't need the full 1.0 ULP precision. It would be fine to just update the docs to state that the error bound is 2.0 ULP when FMA isn't available.

@blapie blapie added the algo label Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants