我有一张这种格式的表格
date dept rate
2020-07-06 Marketing. 20
2020-07-06 Sales. 15
2020-07-06 Engg. 40
2020-07-06 Sites. 18
2020-07-08 Sales. 5
2020-07-08 Engg. 10
2020-07-08 Sites. 7
我想添加新的“spendrate”列,以便在最近两天(示例中为7月7日和8日)将值从7月6日的“rate”复制到“spendrate”。。
date dept rate. Spendrate
2020-07-06 Marketing. 20 20
2020-07-06 Sales. 15 15
2020-07-06 Engg. 40 40
2020-07-06 Sites. 18 18
2020-07-07 Marketing. 20. 20
2020-07-08 Sales. 5. 15
2020-07-08 Engg. 10 40
2020-07-08 Sites. 7 18
2条答案
按热度按时间ct3nt3jp1#
mwecs4sa2#
使用窗口
first(col,ignoreNulls=True)
用rangebetween子句生成一个框架。Example:
```df.show()
+----------+---------+----+
| date| dept|rate|
+----------+---------+----+
|2020-07-06|Marketing| 20|
|2020-07-06| Sales| 15|
|2020-07-06| Engg| 40|
|2020-07-06| sites| 18|
|2020-07-08| Sales| 5|
|2020-07-08| Engg| 10|
|2020-07-08| sites| 7|
|2020-07-07|Marketing| 20|
+----------+---------+----+
sql("select *, first(rate,True) over(partition by dept order by cast (date as timestamp) RANGE BETWEEN INTERVAL 2 DAYS PRECEDING AND CURRENT ROW) as Spendrate from tmp order by date").show()
for more specific range by checking datediff -1 or 0 then generating Spendrate column.
sql("select date,dept,rate,case when diff=-1 then first(rate,True) over(partition by dept order by cast (date as timestamp) RANGE BETWEEN INTERVAL 2 DAYS PRECEDING AND CURRENT ROW) when diff=0 then first(rate,True) over(partition by dept order by cast (date as timestamp) RANGE BETWEEN INTERVAL 2 DAYS PRECEDING AND CURRENT ROW) else rate end as Spendrate from (select *,datediff(date,current_date)diff from tmp)t order by date").show()
+----------+---------+----+----------+
| date| dept|rate| Spendrate|
+----------+---------+----+----------+
|2020-07-06|Marketing| 20| 20 |
|2020-07-06| sites| 18| 18 |
|2020-07-06| Engg| 40| 40 |
|2020-07-06| Sales| 15| 15 |
|2020-07-07|Marketing| 20| 20 |
|2020-07-08| sites| 7| 18 |
|2020-07-08| Engg| 10| 40 |
|2020-07-08| Sales| 5| 15 |
+----------+---------+----+----------+