使用每个组的年-周格式的日期重新索引Pyspark Dataframe

fcg9iug3 于 2023-01-16 发布在 Spark

关注(0)|答案(1)|浏览(159)

我有以下Pyspark Dataframe ：

id1  id2   date         col1    col2 
1     1    2022-W01      5       10
1     2    2022-W02      2       5
1     3    2022-W03      3       8
1     5    2022-W05      5       3
2     2    2022-W03      2       2
2     6    2022-W05      4       1
2     8    2022-W07      3       2

我想为每个id 1填充缺失的日期，并获得如下内容：

id1  id2   date         col1    col2 
1     1    2022-W01      5       10
1     2    2022-W02      2       5
1     3    2022-W03      3       8
1     NA   2022-W04      NA      NA
1     5    2022-W05      5       3
2     2    2022-W03      2       2
2     NA   2022-W04      NA      NA
2     6    2022-W05      4       1
12    NA   2022-W06      NA      NA
2     8    2022-W07      3       2

我从这个代码开始：

df.groupby('id').agg(F.expr('max(date)').alias('max_date'),F.expr('min(date)').alias('min_date'))\
   .withColumn('date',F.expr("explode(sequence(min_date,max_date,interval 1 week))"))\
   .drop('max_date','min_date')
  )

主要的问题是我的日期是在一个特殊的格式'2022-W 01'。我无法找到一个快速的解决方案

pyspark

来源：https://stackoverflow.com/questions/75103687/reindex-pyspark-dataframe-with-dates-in-year-week-format-for-each-group

1条答案

按热度按时间

krugob8w1#

r= regexp_extract('date','\d$',0 )
w=Window.partitionBy('id1')
new = (
          df.withColumn('y', min(r).over(w).astype('int'))
            .withColumn('x', max(r).over(w).astype('int'))
          #extract trailing digits in date, use min and max to create sequence, use array except to find missing dates' digits
          .withColumn('z', array_except(sequence(col('y'), col('x')),collect_list(r.astype('int')).over(w)))
          # explode column generated above
         .withColumn('z',explode('z'))
          #concat missing digits and dates to create new dates
         .withColumn('date',concat(regexp_replace('date', '\d$',''),col('z')))
          # select required columns
          .select('id1','date')
          # drop duplicates
          .dropDuplicates()
        )

#Create new df by appending outcome of above to existing df and sorting
df.unionByName(new, allowMissingColumns=True).sort('id1','date').show()

+---+----+--------+----+----+
|id1| id2|    date|col1|col2|
+---+----+--------+----+----+
|  1|   1|2022-W01|   5|  10|
|  1|   2|2022-W02|   2|   5|
|  1|   3|2022-W03|   3|   8|
|  1|null|2022-W04|null|null|
|  1|   5|2022-W05|   5|   3|
|  2|   2|2022-W03|   2|   2|
|  2|null|2022-W04|null|null|
|  2|   6|2022-W05|   4|   1|
|  2|null|2022-W06|null|null|
|  2|   8|2022-W07|   3|   2|
+---+----+--------+----+----+

赞(0）回复(0）举报 2023-01-16

我来回答

使用每个组的年-周格式的日期重新索引Pyspark Dataframe

1条答案

相关问题

热门标签

最新问答