我有一个数据框,它有一个当前的日期值,我在数据框中创建一个新的列,一个月前这样计算,
spark_df = spark_df.withColumn("oneMonthAgo", expr("add_months(calendarday, -1)"))
我想返回并找到一行,该行的asofdate与onemonthago列匹配,并将其值作为名为1monthagovalue的新列包含。最初,我试图使用窗口函数返回,但由于没有一个monthHago列与日期匹配,因此它将在1monthagovalue列中返回自己的值。
Beginning Data Frame
+----------+-----------+-----------+
| AsofDate|oneMonthAgo| value|
+----------+-----------+-----------+
|2019-02-23| 2019-02-20| 2|
|2019-03-20| 2019-02-20| 7|
|2019-03-21| 2019-02-21| 12|
|2019-03-22| 2019-02-22| 27|
|2019-03-23| 2019-02-23| 91|
+----------+-----------+-----------+
Data Frame to end up with
+----------+-----------+-----------+-----------------------+
| AsofDate|oneMonthAgo| value| 1MonthAgoValue|
+----------+-----------+-----------+-----------------------+
|2019-02-23| 2019-02-20| 2| null |
|2019-03-20| 2019-02-20| 7| null |
|2019-03-21| 2019-02-21| 12| null |
|2019-03-22| 2019-02-22| 27| null |
|2019-03-23| 2019-02-23| 91| 2 | (oneMonthAgo matches first rows AsofDate include first rows value column in current row as 1MonthAgoValue )
+----------+-----------+-----------+-----------------------+
+---------------+-----------------------+-----------+--------------+-----+-------+----+------------+--------------+------------------+---------------+---------------+-----------------+---------------+-------------------+-------------------------+---------------+
|accountname| calendarday| market|returnposition| year| month| day|yearUnique|last_monday|firstDayOfMonth|oneMonthAgo|twoMonthAgo| threeMonthAgo| sixMonthAgo|twelveMonthAgo| indexCP|inceptionDate|
+--------------+-----------------------+-----------+--------------+-----+-------+----+------------+--------------+------------------+---------------+----------------+-----------------+---------------+-------------------+-------------------------+---------------+
| Giants|2015-01-02 00:00:00| null| null| 2015| 01| 02|2015-01-02| 2014-12-29| 2015-01-01| 2014-12-02| 2014-11-02| 2014-10-02| 2014-07-02| 2014-01-02| 100.0| 2015-01-02|
| Giants|2015-01-05 00:00:00|110086.25| 0.0105| 2015| 01| 05|2015-01-05| 2015-01-05| 2015-01-01| 2014-12-05| 2014-11-05| 2014-10-05| 2014-07-05| 2014-01-05| 101.05| 2015-01-02|
| Giants|2015-01-06 00:00:00| 201251.5| 2.0E-4| 2015| 01| 06|2015-01-06| 2015-01-05| 2015-01-01| 2014-12-06| 2014-11-06| 2014-10-06| 2014-07-06| 2014-01-06| 101.07020999999999| 2015-01-02|
| Giants|2015-01-07 00:00:00| 216786.5| -0.006| 2015| 01| 07|2015-01-07| 2015-01-05| 2015-01-01| 2014-12-07| 2014-11-07| 2014-10-07| 2014-07-07| 2014-01-07| 100.46378874| 2015-01-02|
| Giants|2015-01-08 00:00:00| 215464.5| -0.0063| 2015| 01| 08|2015-01-08| 2015-01-05| 2015-01-01| 2014-12-08| 2014-11-08| 2014-10-08| 2014-07-08| 2014-01-08| 99.830866870938| 2015-01-02|
| Giants|2015-01-09 00:00:00|214103.25| 0.0052| 2015| 01| 09|2015-01-09| 2015-01-05| 2015-01-01| 2014-12-09| 2014-11-09| 2014-10-09| 2014-07-09| 2014-01-09| 100.34998737866687| 2015-01-02|
| Giants|2015-01-12 00:00:00| 215218.5| -4.0E-4| 2015| 01| 12|2015-01-12| 2015-01-12| 2015-01-01| 2014-12-12| 2014-11-12| 2014-10-12| 2014-07-12| 2014-01-12| 100.30984738371541| 2015-01-02|
| Giants|2015-01-13 00:00:00|215125.25| 0.0036| 2015| 01| 13|2015-01-13| 2015-01-12| 2015-01-01| 2014-12-13| 2014-11-13| 2014-10-13| 2014-07-13| 2014-01-13| 100.67096283429677| 2015-01-02|
| Giants|2015-01-14 00:00:00| 215919.5| 8.0E-4| 2015| 01| 14|2015-01-14| 2015-01-12| 2015-01-01| 2014-12-14| 2014-11-14| 2014-10-14| 2014-07-14| 2014-01-14| 100.7514996045642| 2015-01-02|
| Giants|2015-01-15 00:00:00|216103.75| 4.0E-4| 2015| 01| 15|2015-01-15| 2015-01-12| 2015-01-01| 2014-12-15| 2014-11-15| 2014-10-15| 2014-07-15| 2014-01-15| 100.79180020440602| 2015-01-02|
| Giants|2015-01-16 00:00:00| 216205.5| 0.0052| 2015| 01| 16|2015-01-16| 2015-01-12| 2015-01-01| 2014-12-16| 2014-11-16| 2014-10-16| 2014-07-16| 2014-01-16| 101.31591756546894| 2015-01-02|
| Giants|2015-01-19 00:00:00| 347334.0| -0.0045| 2015| 01| 19|2015-01-19| 2015-01-19| 2015-01-01| 2014-12-19| 2014-11-19| 2014-10-19| 2014-07-19| 2014-01-19| 100.85999593642434| 2015-01-02|
| Giants|2015-01-20 00:00:00| 345767.0| 0.0015| 2015| 01| 20|2015-01-20| 2015-01-19| 2015-01-01| 2014-12-20| 2014-11-20| 2014-10-20| 2014-07-20| 2014-01-20| 101.01128593032898| 2015-01-02|
| Giants|2015-01-21 00:00:00| 346314.5| 2.0E-4| 2015| 01| 21|2015-01-21| 2015-01-19| 2015-01-01| 2014-12-21| 2014-11-21| 2014-10-21| 2014-07-21| 2014-01-21| 101.03148818751504| 2015-01-02|
| Giants|2015-01-22 00:00:00|346399.75| 0.0029| 2015| 01| 22|2015-01-22| 2015-01-19| 2015-01-01| 2014-12-22| 2014-11-22| 2014-10-22| 2014-07-22| 2014-01-22| 101.32447950325883| 2015-01-02|
| Giants|2015-01-23 00:00:00|347412.75| -6.0E-4| 2015| 01| 23|2015-01-23| 2015-01-19| 2015-01-01| 2014-12-23| 2014-11-23| 2014-10-23| 2014-07-23| 2014-01-23| 101.26368481555686| 2015-01-02|
| Giants|2015-01-26 00:00:00|348303.75| -6.0E-4| 2015| 01| 26|2015-01-26| 2015-01-26| 2015-01-01| 2014-12-26| 2014-11-26| 2014-10-26| 2014-07-26| 2014-01-26| 101.20292660466752| 2015-01-02|
| Giants|2015-01-27 00:00:00| 348541.0| -0.0044| 2015| 01| 27|2015-01-27| 2015-01-26| 2015-01-01| 2014-12-27| 2014-11-27| 2014-10-27| 2014-07-27| 2014-01-27| 100.75763372760697| 2015-01-02|
| Giants|2015-01-28 00:00:00|347579.25| 0.0015| 2015| 01| 28|2015-01-28| 2015-01-26| 2015-01-01| 2014-12-28| 2014-11-28| 2014-10-28| 2014-07-28| 2014-01-28| 100.9087701781984| 2015-01-02|
| Giants|2015-01-29 00:00:00|348431.75| 2.0E-4| 2015| 01| 29|2015-01-29| 2015-01-26| 2015-01-01| 2014-12-29| 2014-11-29| 2014-10-29| 2014-07-29| 2014-01-29| 100.92895193223403| 2015-01-02|
| Yankees|2015-01-02 00:00:00| null| null| 2015| 01| 02|2015-01-02| 2014-12-29| 2015-01-01| 2014-12-02| 2014-11-02| 2014-10-02| 2014-07-02| 2014-01-02| 100.0| 2015-01-02|
| Yankees|2015-01-05 00:00:00|110086.25| 0.0105| 2015| 01| 05|2015-01-05| 2015-01-05| 2015-01-01| 2014-12-05| 2014-11-05| 2014-10-05| 2014-07-05| 2014-01-05| 101.05| 2015-01-02|
| Yankees|2015-01-06 00:00:00| 201251.5| 2.0E-4| 2015| 01| 06|2015-01-06| 2015-01-05| 2015-01-01| 2014-12-06| 2014-11-06| 2014-10-06| 2014-07-06| 2014-01-06| 101.07020999999999| 2015-01-02|
| Yankees|2015-01-07 00:00:00| 216786.5| -0.006| 2015| 01| 07|2015-01-07| 2015-01-05| 2015-01-01| 2014-12-07| 2014-11-07| 2014-10-07| 2014-07-07| 2014-01-07| 100.46378874| 2015-01-02|
| Yankees|2015-01-08 00:00:00| 215464.5| -0.0063| 2015| 01| 08|2015-01-08| 2015-01-05| 2015-01-01| 2014-12-08| 2014-11-08| 2014-10-08| 2014-07-08| 2014-01-08| 99.830866870938| 2015-01-02|
| Yankees|2015-01-09 00:00:00|214103.25| 0.0052| 2015| 01| 09|2015-01-09| 2015-01-05| 2015-01-01| 2014-12-09| 2014-11-09| 2014-10-09| 2014-07-09| 2014-01-09| 100.34998737866687| 2015-01-02|
| Yankees|2015-01-12 00:00:00| 215218.5| -4.0E-4| 2015| 01| 12|2015-01-12| 2015-01-12| 2015-01-01| 2014-12-12| 2014-11-12| 2014-10-12| 2014-07-12| 2014-01-12| 100.30984738371541| 2015-01-02|
| Yankees|2015-01-13 00:00:00|215125.25| 0.0036| 2015| 01| 13|2015-01-13| 2015-01-12| 2015-01-01| 2014-12-13| 2014-11-13| 2014-10-13| 2014-07-13| 2014-01-13| 100.67096283429677| 2015-01-02|
| Yankees|2015-01-14 00:00:00| 215919.5| 8.0E-4| 2015| 01| 14|2015-01-14| 2015-01-12| 2015-01-01| 2014-12-14| 2014-11-14| 2014-10-14| 2014-07-14| 2014-01-14| 100.7514996045642| 2015-01-02|
| Yankees|2015-01-15 00:00:00|216103.75| 4.0E-4| 2015| 01| 15|2015-01-15| 2015-01-12| 2015-01-01| 2014-12-15| 2014-11-15| 2014-10-15| 2014-07-15| 2014-01-15| 100.79180020440602| 2015-01-02|
| Yankees|2015-01-16 00:00:00| 216205.5| 0.0052| 2015| 01| 16|2015-01-16| 2015-01-12| 2015-01-01| 2014-12-16| 2014-11-16| 2014-10-16| 2014-07-16| 2014-01-16| 101.31591756546894| 2015-01-02|
| Yankees|2015-01-19 00:00:00| 347334.0| -0.0045| 2015| 01| 19|2015-01-19| 2015-01-19| 2015-01-01| 2014-12-19| 2014-11-19| 2014-10-19| 2014-07-19| 2014-01-19| 100.85999593642434| 2015-01-02|
| Yankees|2015-01-20 00:00:00| 345767.0| 0.0015| 2015| 01| 20|2015-01-20| 2015-01-19| 2015-01-01| 2014-12-20| 2014-11-20| 2014-10-20| 2014-07-20| 2014-01-20| 101.01128593032898| 2015-01-02|
| Yankees|2015-01-21 00:00:00| 346314.5| 2.0E-4| 2015| 01| 21|2015-01-21| 2015-01-19| 2015-01-01| 2014-12-21| 2014-11-21| 2014-10-21| 2014-07-21| 2014-01-21| 101.03148818751504| 2015-01-02|
| Yankees|2015-01-22 00:00:00|346399.75| 0.0029| 2015| 01| 22|2015-01-22| 2015-01-19| 2015-01-01| 2014-12-22| 2014-11-22| 2014-10-22| 2014-07-22| 2014-01-22| 101.32447950325883| 2015-01-02|
| Yankees|2015-01-23 00:00:00|347412.75| -6.0E-4| 2015| 01| 23|2015-01-23| 2015-01-19| 2015-01-01| 2014-12-23| 2014-11-23| 2014-10-23| 2014-07-23| 2014-01-23| 101.26368481555686| 2015-01-02|
| Yankees|2015-01-26 00:00:00|348303.75| -6.0E-4| 2015| 01| 26|2015-01-26| 2015-01-26| 2015-01-01| 2014-12-26| 2014-11-26| 2014-10-26| 2014-07-26| 2014-01-26| 101.20292660466752| 2015-01-02|
| Yankees|2015-01-27 00:00:00| 348541.0| -0.0044| 2015| 01| 27|2015-01-27| 2015-01-26| 2015-01-01| 2014-12-27| 2014-11-27| 2014-10-27| 2014-07-27| 2014-01-27| 100.75763372760697| 2015-01-02|
| Yankees|2015-01-28 00:00:00|347579.25| 0.0015| 2015| 01| 28|2015-01-28| 2015-01-26| 2015-01-01| 2014-12-28| 2014-11-28| 2014-10-28| 2014-07-28| 2014-01-28| 100.9087701781984| 2015-01-02|
| Yankees|2015-01-29 00:00:00|348431.75| 2.0E-4| 2015| 01| 29|2015-01-29| 2015-01-26| 2015-01-01| 2014-12-29| 2014-11-29| 2014-10-29| 2014-07-29| 2014-01-29| 100.92895193223403| 2015-01-02|
| Yankees|2015-02-28 00:00:00|348431.75| 2.0E-4| 2015| 02| 28|2015-02-28| 2015-01-26| 2015-01-01| 2014-12-28| 2014-11-28| 2014-10-28| 2014-07-28| 2014-01-28| 100.92895193223403| 2015-01-02|
+--------------+-----------------------+-----------+--------------+-----+-------+----+------------+--------------+------------------+---------------+----------------+-----------------+---------------+-------------------+-------------------------+---------------+
2条答案
按热度按时间oyxsuwqo1#
你可以试试看
left_outer self join
这样地。from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy(F.dayofmonth("AsofDate"))
.orderBy(F.to_timestamp("AsofDate").cast("long"))
.rangeBetween(86400*-30,0)
first=F.first("value").over(w)
df.withColumn("1MonthAgoValue", F.when(first!=F.col("value"), first)
.otherwise(F.lit(None))).show()
lfapxunr2#
这是可行的,一个窗口函数是正确的方法,但是你不需要asofdate列
在实际的数据集中,您可能希望使其更为唯一,并按id/为每个id执行此操作,如果是这种情况,请将id列也添加到partitionby中。
ps:如果lag()得到错误的结果,请使用lead(),我总是忘记顺序,最后不得不同时尝试这两种方法。