我正试图修改中的例子https://medium.com/build-and-learn/spark-aggregating-your-data-the-fast-way-e37b53314fad 使用任意行。目标是返回组的“最新”行。
聚合器是这样实现的
class Latest(val f: Row => String, val schema: StructType) extends Aggregator[Row, (String, Row), Row] {
override def zero: (String, Row) = ("0000-00-00", null)
override def reduce(b: (String, Row), a: Row): (String, Row) = merge(b, (f(a), a))
override def merge(b1: (String, Row), b2: (String, Row)): (String, Row) = Seq(b1, b2).maxBy(_._1)
override def finish(reduction: (String, Row)): Row = reduction._2
override def bufferEncoder: Encoder[(String, Row)] = Encoders.product[(String, Row)]
override def outputEncoder: Encoder[Row] = RowEncoder(schema)
}
我正在用以下代码测试这个聚合器
class AggregatorSpec
extends FunSpec
with DataFrameComparer
with SparkSessionTestWrapper {
import spark.implicits._
describe("main") {
it("works") {
val spark = SparkSession
.builder
.master("local")
.appName("common typed aggregator implementations")
.getOrCreate()
val df = Seq(
("ham", "2019-01-01", 3L, "Yah"),
("cheese", "2018-12-31", 4L, "Woo"),
("fish", "2019-01-02", 5L, "Hah"),
("grain", "2019-01-01", 6L, "Community"),
("grain", "2019-01-02", 7L, "Community"),
("ham", "2019-01-04", 3L, "jamón")
).toDF("Key", "Date", "Numeric", "Text")
println("input data:")
df.show()
println("running latest:")
df.groupByKey(_.getString(0)).agg(new Latest(_.getString(1), ds.schema).toColumn).show()
spark.stop()
}
}
}
运行上面的代码会产生以下错误
[info] - runs***FAILED***
[info] java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
[info] - field (class: "org.apache.spark.sql.Row", name: "_2")
[info] - root class: "scala.Tuple2"
[info] at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:625)
[info] at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
[info] at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
[info] at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info] at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info] at scala.collection.immutable.List.foreach(List.scala:381)
[info] at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
[info] at scala.collection.immutable.List.flatMap(List.scala:344)
[info] at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
[info] at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
我对spark和scala都是比较新的,我甚至不确定我想要达到的目标是否可能。
1条答案
按热度按时间lf3rwulv1#
问题在于创造
bufferEncoder
-修改为此我希望这是一个天真的例子,你想尝试一下
Aggregator
. 如果不是的话,有一种替代方法可以在没有聚合器的情况下获得相同的结果