databricks、spark、scala不能在long上使用lag()

q8l4jmvw 于 2021-07-09 发布在 Spark

关注(0)|答案(2)|浏览(338)

我有一个名为q6的数据文件，它如下所示：

date,count
2019-01-07,9553
2019-01-08,9930
2019-01-28,10160
2019-01-30,9881
2019-01-26,10867
2019-02-01,8
2019-01-20,6823
2019-01-22,9796
2019-01-19,9295
2019-01-05,9432
2019-01-03,10063
2018-12-31,13
2019-01-31,9804
2019-01-10,11051
2019-01-17,11268
2019-01-04,10451

我想计算每个日期和前一个日期之间的差值，以及增加/减少的百分比。以下是我的文件架构：

root
 |-- date: date (nullable = true)
 |-- count: long (nullable = false)

以下是我尝试过的命令（无论如何，其中一些命令）：

q6 = q6.groupBy("date").count()
//q6 = q6.withColumn("count", $"count" as "Int")//col("count").cast("int")
//q6 = q6.sort("date")
//q6.printSchema()
val windowSpec  = Window.partitionBy("date").orderBy("date")
q6 = q6.withColumn("lag", lag("count",1).over(windowSpec))
//q6 = q6.withColumn("prev_value", lag(q6.count).over(windowSpec))
//q6 = q6.withColumn("diff", when(isnull(q6.count - q6.prev_value), 0).otherwise(q6.price - q6.prev_value))
display(q6)

这个运行没有错误，但我得到空值，如下所示：

date,count,lag
2019-01-07,9553,null
2019-01-08,9930,null
2019-01-28,10160,null
2019-01-30,9881,null
2019-01-26,10867,null
2019-02-01,8,null
2019-01-20,6823,null
2019-01-22,9796,null
2019-01-19,9295,null
2019-01-05,9432,null
2019-01-03,10063,null
2018-12-31,13,null

我使用sql server和窗口函数，虽然我对它们不是特别精通，但我基本上可以让它们正常工作。我把我的数据集扔进了sql server，一切都正常了！有什么问题吗？

scala apache-spark apache-spark-sql window-functions

来源：https://stackoverflow.com/questions/66736421/databricks-spark-scala-cannot-use-lag-on-long

2条答案

按热度按时间

8nuwlpux1#

我的问题是我划分了窗口规范。下面是我应该做的：

val windowSpec  = Window.partitionBy().orderBy("date")

请注意，partitionby（）函数不接受参数。它现在工作得很好。
另外，我正在努力解决这个问题，但是能够在SQLServer中实现一个解决方案是非常有帮助的。我们的问题是，我们已经超出了我们的数据库，需要对大量的数据进行分析。我花了很多年才学会sql server。我想我得花上好几年的时间才能学会Spark，但这绝对是值得的！

赞(0）回复(0）举报 2021-07-09

uemypmqf2#

在sql中，您可以将其编写为

lag(count) over (order by date)

所以在scala spark，你会写

val windowSpec = Window.orderBy("date")
q6 = q6.withColumn("lag", lag("count", 1).over(windowSpec))

如果按日期划分，因为每个日期只有一个关联行， lag 将导致null。没有必要按日期划分。

赞(0）回复(0）举报 2021-07-09

我来回答

databricks、spark、scala不能在long上使用lag()

2条答案

相关问题

热门标签

最新问答