Inaccuracy for f32 erf on non-FMA instruction set #517

bnprks · 2024-02-15T09:28:48Z

Thanks for writing and maintaining such a fast and useful library!

I've noticed some precision issues where the erf function with f32 inputs may exceed the documented 1.0 ULP error bound when FMA is not available.

I ran my testing on x86, where scalar and AVX2 return consistently accurate values, but SSE4 and SSE2 sometimes return inaccurate values.

For example, the input 0x1.16945ap-126 (1.279174322e-38) has a 1.3 ULP error, and the input -0x1.d1b6c8p-127 (-1.069226882e-38) has 1.8 ULP error.

Here is some example code to reproduce these results, tested on a fresh pull of the main branch (a99491a):

#include <stdio.h>
#include <math.h>

#include "sleef.h"

//cmake -B build -S . -G Ninja
//cmake --build build
//gcc -o sleef_erf_example sleef_erf_example.c -lsleef -lm -L./build/lib -I ./build/include -march=native

int main() {
    float example_input = 0x1.16945ap-126;
    printf("Problem input:\t%1$.10g (%1$a)\n", example_input);
    printf("Sleef scalar:\t%1$.10g (%1$a)\n", Sleef_erff1_u10(example_input));
    printf("Sleef avx2:\t%1$.10g (%1$a)\n", _mm256_cvtss_f32(Sleef_erff8_u10avx2(_mm256_set1_ps(example_input))));
    printf("Sleef sse4:\t%1$.10g (%1$a)\n", _mm_cvtss_f32(Sleef_erff4_u10sse4(_mm_set1_ps(example_input))));
    printf("Sleef sse2:\t%1$.10g (%1$a)\n", _mm_cvtss_f32(Sleef_erff4_u10sse2(_mm_set1_ps(example_input))));
    printf("Correct result:\t%1$.20Lg (%1$La)\n", erfl((long double) example_input));
}

When run, this outputs:

Problem input:  1.279174322e-38 (0x1.16945ap-126)
Sleef scalar:   1.44339361e-38 (0x1.3a57e2p-126)
Sleef avx2:     1.44339361e-38 (0x1.3a57e2p-126)
Sleef sse4:     1.44339347e-38 (0x1.3a57ep-126)
Sleef sse2:     1.44339347e-38 (0x1.3a57ep-126)
Correct result: 1.4433936563104040097e-38 (0x9.d2bf154036c0db7p-129)

It appears the differences arise in the call to dfmul_vf2_vf2_vf on this line 3524 of sleefsimdsp.c

sleef/src/libm/sleefsimdsp.c

Line 3524 in a99491a

    
           t2 = vsel_vf2_vo_vf2_vf2(vlt_vo_vf_vf(x, vcast_vf_f(1e-4)), dfmul_vf2_vf2_vf(vcast_vf2_f_f(-1.1283792257308959961, 5.8635383422197591097e-08), x), t2);

An algorithm update to improve non-FMA precision would of course be appreciated. For my use-case, however, I don't need the full 1.0 ULP precision. It would be fine to just update the docs to state that the error bound is 2.0 ULP when FMA isn't available.

The text was updated successfully, but these errors were encountered:

blapie added the algo label Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inaccuracy for f32 erf on non-FMA instruction set #517

Inaccuracy for f32 erf on non-FMA instruction set #517

bnprks commented Feb 15, 2024 •

edited

Loading

Inaccuracy for f32 erf on non-FMA instruction set #517

Inaccuracy for f32 erf on non-FMA instruction set #517

Comments

bnprks commented Feb 15, 2024 • edited Loading

bnprks commented Feb 15, 2024 •

edited

Loading