我想在sparksql表中转置多列
我发现这个解决方案只适用于两列,我想知道如何使用具有三列的zip函数 varA, varB and varC.
```
import org.apache.spark.sql.functions.{udf, explode}
val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))
df.withColumn("vars", explode(zip($"varA", $"varB"))).select(
$"userId", $"someString",
$"vars._1".alias("varA"), $"vars._2".alias("varB")).show
这是我的Dataframe架构:
root |-- owningcustomerid: string (nullable = true) |-- event_stoptime: string (nullable = true) |-- balancename: string (nullable = false) |-- chargedvalue: string (nullable = false) |-- newbalance: string (nullable = false)
我试过这个代码:
val zip = udf((xs: Seq[String], ys: Seq[String], zs: Seq[String]) => (xs, ys, zs).zipped.toSeq)
df.printSchema
val df4=df.withColumn("vars", explode(zip($"balancename", $"chargedvalue",$"newbalance"))).select(
$"owningcustomerid", $"event_stoptime",
$"vars._1".alias("balancename"), $"vars._2".alias("chargedvalue"),$"vars._2".alias("newbalance"))
我有个错误:
cannot resolve 'UDF(balancename, chargedvalue, newbalance)' due to data type mismatch: argument 1 requires array type, however, 'balancename
' is of string type. argument 2 requires array type, however, 'chargedvalue
' is of string type. argument 3 requires array type, however, 'newbalance
' is of string type.;;
'项目[owningcustomerid#1085,事件ŧ停止时间ŧ1086,余额名称ŧ1159,chargedvalueŧ1160,newbalanceŧ1161,分解(udf(balancenameŧ1159,chargedvalueŧ1160,newbalanceŧ1161))作为变量ŧ1167]
1条答案
按热度按时间eyh26e7m1#
在scala中,通常可以使用
Tuple3.zipped
```val zip = udf((xs: Seq[Long], ys: Seq[Long], zs: Seq[Long]) =>
(xs, ys, zs).zipped.toSeq)
zip($"varA", $"varB", $"varC")
import org.apache.spark.sql.functions.arrays_zip
arrays_zip($"varA", $"varB", $"varC")