You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our compatibility doc is outdated after we switched string to float to new kernel.
The results of current code mismatch slightly with cpu but in different way from compatibility doc. I'm not sure if we are aware of this. I think the cpu did things correctly because the cpu results are also matched with python.
Steps/Code to reproduce bug
Compatibility doc:
### String to Float
Casting from string to floating-point types on the GPU returns incorrect results when the string
represents any number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The
default behavior in Apache Spark is to return `+Infinity` and `-Infinity`, respectively.
- `1.7976931348623158E308 <= x < 1.7976931348623159E308`
- `-1.7976931348623159E308 < x <= -1.7976931348623158E308`
Also, the GPU does not support casting from strings containing hex values.
This configuration is enabled by default. To disable this operation on the GPU set
[`spark.rapids.sql.castStringToFloat.enabled`](additional-functionality/advanced_configs.md#sql.castStringToFloat.enabled) to `false`.
My test code
Hack into DoubleGen to make it produce string:
class DoubleGen(DataGen):
"""Generate doubles, which some built in corner cases."""
def __init__(self, min_exp=DOUBLE_MIN_EXP, max_exp=DOUBLE_MAX_EXP, no_nans=False,
nullable=True, special_cases = None):
self._min_exp = min_exp
self._max_exp = max_exp
self._no_nans = no_nans
self._use_full_range = (self._min_exp == DOUBLE_MIN_EXP) and (self._max_exp == DOUBLE_MAX_EXP)
if special_cases is None:
special_cases = [
self.make_from(1, self._max_exp, DOUBLE_MAX_FRACTION),
self.make_from(0, self._max_exp, DOUBLE_MAX_FRACTION),
self.make_from(1, self._min_exp, DOUBLE_MAX_FRACTION),
self.make_from(0, self._min_exp, DOUBLE_MAX_FRACTION)
]
if self._min_exp <= 0 and self._max_exp >= 0:
special_cases.append(0.0)
special_cases.append(-0.0)
if self._min_exp <= 3 and self._max_exp >= 3:
special_cases.append(1.0)
special_cases.append(-1.0)
if not no_nans:
special_cases.append(float('inf'))
special_cases.append(float('-inf'))
special_cases.append(float('nan'))
special_cases.append(NEG_DOUBLE_NAN_MAX_VALUE)
- super().__init__(DoubleType(), nullable=nullable, special_cases=special_cases)+ super().__init__(StringType(), nullable=nullable, special_cases=special_cases)
def _cache_repr(self):
return super()._cache_repr() + '(' + str(self._min_exp) + ',' + str(self._max_exp) + ',' + str(self._no_nans) + ')'
@staticmethod
def make_from(sign, exp, fraction):
sign = sign & 1 # 1 bit
exp = (exp + 1023) & 0x7FF # add bias and 11 bits
fraction = fraction & DOUBLE_MAX_FRACTION
i = (sign << 63) | (exp << 52) | fraction
p = struct.pack('L', i)
ret = struct.unpack('d', p)[0]
return ret
def _fixup_nans(self, v):
if self._no_nans and (math.isnan(v) or v == math.inf or v == -math.inf):
v = None if self.nullable else 0.0
return v
def start(self, rand):
if self._use_full_range:
def gen_double():
i = rand.randint(LONG_MIN, LONG_MAX)
p = struct.pack('l', i)
- return self._fixup_nans(struct.unpack('d', p)[0])+ return str(self._fixup_nans(struct.unpack('d', p)[0]))
self._start(rand, gen_double)
else:
def gen_part_double():
sign = rand.getrandbits(1)
exp = rand.randint(self._min_exp, self._max_exp)
fraction = rand.getrandbits(52)
- return self._fixup_nans(self.make_from(sign, exp, fraction))+ return str(self._fixup_nans(self.make_from(sign, exp, fraction)))
self._start(rand, gen_part_double)
and run test:
deftest_cast_string_to_double():
assert_gpu_and_cpu_are_equal_collect(
lambdaspark: unary_op_df(spark, DoubleGen()).selectExpr("cast(a as double)"),
conf= {"spark.rapids.sql.castStringToFloat.enabled": True})
Describe the bug
Our compatibility doc is outdated after we switched string to float to new kernel.
The results of current code mismatch slightly with cpu but in different way from compatibility doc. I'm not sure if we are aware of this. I think the cpu did things correctly because the cpu results are also matched with python.
Steps/Code to reproduce bug
Compatibility doc:
My test code
Hack into DoubleGen to make it produce string:
and run test:
Results (part)
Expected behavior
At least we can note how it mismatches with cpu in the compatibility doc. We may also want to fix it to fully match cpu one day.
The text was updated successfully, but these errors were encountered: