Pyspark将浮点数转换为双精度浮点数不精确

kb5ga3dv  于 2022-11-01  发布在  Spark
关注(0)|答案(1)|浏览(247)

我按总和(浮点数)分组,结果不是我所期望的。
这不仅适用于分组方式,而且在将float转换为double时也会发生。
下面是一个代码示例。

>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>> schema = StructType([ \
...     StructField("firstname",StringType(),True), \
...     StructField("middlename",StringType(),True), \
...     StructField("v",FloatType(),True)])
>>>
>>> df = spark.createDataFrame([["a","b",1.12],["a","b",2.23],["a","c",7.78]],schema=schema)
>>> df.show()
+---------+----------+----+
|firstname|middlename|   v|
+---------+----------+----+
|        a|         b|1.12|
|        a|         b|2.23|
|        a|         c|7.78|
+---------+----------+----+
>>> df.groupBy("firstname","middlename").agg(sum("v")).show()
+---------+----------+-----------------+
|firstname|middlename|           sum(v)|
+---------+----------+-----------------+
|        a|         b|3.350000023841858|
|        a|         c| 7.78000020980835|
+---------+----------+-----------------+
>>> df.groupBy("firstname","middlename").agg(sum("v").cast("float")).show()
+---------+----------+---------------------+
|firstname|middlename|CAST(sum(v) AS FLOAT)|
+---------+----------+---------------------+
|        a|         b|                 3.35|
|        a|         c|                 7.78|
+---------+----------+---------------------+
>>> df.select(col("v"), col("v").cast("double")).show()
+----+------------------+
|   v|                 v|
+----+------------------+
|1.12|1.1200000047683716|
|2.23|2.2300000190734863|
|7.78|  7.78000020980835|
+----+------------------+

我认为这是因为类型精度(4个字节,8个字节),但我认为这是一个bug,因为当float的值被转换为double时,它应该被保留。
我发现了一个解决方案,因为我写的转型后,浮动分组,但我认为这是不清楚的。
有什么好办法吗?

fjaof16o

fjaof16o1#

我找到了一个在聚合列v之前转换为string的答案。
不含)

from pyspark.sql import functions as F

>>> df.withColumn("v",col("v").cast("string").cast("double"))\
    .groupBy("firstname","middlename").F.agg(sum("v")).show()
+---------+----------+------+
|firstname|middlename|sum(v)|
+---------+----------+------+
|        a|         b|  3.35|
|        a|         c|  7.78|
+---------+----------+------+
>>> df.withColumn("v",col("v").cast("string").cast("double"))\
    .groupBy("firstname","middlename").F.agg(sum("v")).printSchema()
root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- sum(v): double (nullable = true)

相关问题