在scala spark中转换多个列上的udf

0lvr5msh  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(429)

我在pyspark中有以下代码,运行良好。

from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import udf, array
prod_cols = udf(lambda arr: float(arr[0])*float(arr[1]), DoubleType())
finalDf = finalDf.withColumn('click_factor', sum_cols(array('rating', 'score')))

现在我在scala中尝试了类似的代码。

val prod_cols = udf((rating: Double, score: Double) => {rating.toDouble*score.toDouble})
finalDf = finalDf.withColumn("cl_rate", prod_cols(finalDf("rating"), finalDf("score")))

不知何故,第二个代码总是不能给出正确的答案 null 或者 zero 你能帮我得到正确的scala代码吗。本质上我只需要一个代码,两个乘以两列,考虑到可能有空值 score 或者 rating .

hm2xizp9

hm2xizp91#

仅通过 Not Null 值到 UDF .
更改以下代码

val prod_cols = udf((rating: Double, score: Double) => {rating.toDouble*score.toDouble})
finalDf.withColumn("cl_rate", prod_cols(finalDf("rating"), finalDf("score")))

val prod_cols = udf((rating: Double, score: Double) => {rating.toDouble*score.toDouble})
finalDf
.withColumn("rating",$"rating".cast("double")) // Ignore this line if column data type is already double 
.withColumn("score",$"score".cast("double")) // Ignore this line if column data type is already double 
.withColumn("cl_rate", 
             when(
                  $"rating".isNotNull && $"score".isNotNull, 
                  prod_cols($"rating", $"score")
             ).otherwise(lit(null).cast("double"))
)

相关问题