spark:复制每一行,但在一列值中进行更改

iszxjhcz  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(431)

如何在spark中执行以下操作,

Initially:
+-----------+-----+------+
|date       |col1 | col2 |
+-----------+-----+------+
|2020-08-16 | 2   | abc  |
|2020-08-17 | 3   | def  |
|2020-08-18 | 4   | ghi  |
|2020-08-19 | 5   | jkl  |
|2020-08-20 | 6   | mno  |
+-----------+-----+------+

Final result:
+-----------+-----+------+
|date       |col1 | col2 |
+-----------+-----+------+
|2020-08-16 | 2   | abc  |
|2020-08-15 | 2   | abc  |
|2020-08-17 | 3   | def  |
|2020-08-16 | 3   | def  |
|2020-08-18 | 4   | ghi  |
|2020-08-17 | 4   | ghi  |
|2020-08-19 | 5   | jkl  |
|2020-08-18 | 5   | jkl  |
|2020-08-20 | 6   | mno  |
|2020-08-19 | 6   | mno  |
+-----------+-----+------+

所以本质上需要复制每一行,其中一列值发生变化,也就是说,对于每一行,复制日期列为减去当前值的1天。

mwngjboj

mwngjboj1#

尝试 date_add 函数,然后创建带有日期列和日期-1列的数组,最后分解该列。 Example: ```
df.show()

/*
+----------+----+----+
| date|col1|col2|
+----------+----+----+
|2020-08-16| 2| abc|
|2020-08-17| 3| def|
+----------+----+----+

  • /

import org.apache.spark.sql.functions._

df.withColumn("new_date",array(col("date"),date_add(col("date"),-1))).
drop("date").
selectExpr("explode(new_date) as date","*").
drop("new_date").
show(10,false)

/*
+----------+----+----+
|date |col1|col2|
+----------+----+----+
|2020-08-16|2 |abc |
|2020-08-15|2 |abc |
|2020-08-17|3 |def |
|2020-08-16|3 |def |
+----------+----+----+

  • /
9jyewag0

9jyewag02#

我在想 union 对于这个解决方案来说是相当优雅的

// Union the two dataframes together, take 1 day away from the date
df.union(df.select(date_add($"date", -1), $"col1", $"col2"))

创建测试数据的完整示例脚本:

import org.apache.spark.sql.functions._

val dfOriginal = Seq(("2020-08-16", 2, "abc"), ("2020-08-17", 3, "def"), ("2020-08-18", 4, "ghi"), ("2020-08-19", 5, "jkl"), ("2020-08-20", 6, "mno"))
  .toDF("date", "col1", "col2")

val df = dfOriginal
  .select (to_date($"date", "yyyy-MM-dd").as("date"), $"col1", $"col2")

// Union the two dataframes together, take 1 day away from the date
df.union(df.select(date_add($"date", -1), $"col1", $"col2"))
  .orderBy("date", "col1", "col2")
  .show

我的结果:

相关问题