在数据块中,当我用'rsd'参数运行approx_count_distinct函数时,它返回错误消息。它没有这个参数也能工作。
数据集
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James |Sales |3000 |
|Michael |Sales |4600 |
|Robert |Sales |4100 |
|Maria |Finance |3000 |
|James |Sales |3000 |
|Scott |Finance |3300 |
|Jen |Finance |3900 |
|Jeff |Marketing |3000 |
|Kumar |Marketing |2000 |
|Saif |Sales |4100 |
+-------------+----------+------+
代码
from pyspark.sql.functions import approx_count_distinct
df.agg(approx_count_distinct(col("salary"))).alias("salaryDistinct")
错误消息
py4j.Py4JException: Method approx_count_distinct([class org.apache.spark.sql.Column, class java.lang.Integer]) does not exist
1条答案
按热度按时间zujrkrfu1#
我复制了上面的内容,得到了同样的错误。
当我们将
rsd
值指定为整数时,会发生上述错误。根据pyspark.sql.functions.approx_count_distinct(),rsd
值应为float
。给定浮点数时的期望结果。