pyspark:根据其他Dataframe中的字段和时间段过滤Dataframe

ecbunoof  于 2021-07-09  发布在  Spark
关注(0)|答案(2)|浏览(480)

我想在df1中选择满足以下条件的行:

1) df1.number == df2.number
2) df1.timestamp >= df2.startdate
3) df1.timestamp <= df2.enddate

两个Dataframe具有不同的列,但共享数字列:
df1型
电话:2132002-10-26t07:55:344322020-11-26t07:55:34
df2型
号码起始日期终止日期2132002-10-262020-10-284322020-10-132020-11-26
我搞不清楚。我认为左半联接与filter/where子句组合应该可以做到这一点,但它不起作用:

df3 = df1.join(df2, on=['number'], how='left_semi').where((df1.timestamp >= df2.startdate) & (df1.timestamp <= df2.enddate))

感谢您的意见!

yfwxisqw

yfwxisqw1#

import spark.implicits._
import org.apache.spark.sql.functions.to_date

val df1 = Seq((213, "2020-10-26T07:55:34"), (432, "2020-11-26T07:55:34"))
  .toDF("number", "timestamp")

val df2 = Seq(
  (213, "2020-10-26", "2020-10-28"),
  (432, "2020-10-13", "2020-11-26")
).toDF("number", "startdate", "enddate")

val df3 = df1.join(df2,
  df1.col("number") === df2.col("number") &&
    to_date(df1.col("timestamp")) >= df2.col("startdate") &&
    to_date(df1.col("timestamp")) <= df2.col("enddate")
  , "left_semi")

df3.show(false)
//  +------+-------------------+
//  |number|timestamp          |
//  +------+-------------------+
//  |213   |2020-10-26T07:55:34|
//  |432   |2020-11-26T07:55:34|
//  +------+-------------------+
ej83mcc0

ej83mcc02#

不能在联接之后应用筛选器,因为 df2 联接后不再存在。
相反,您可以将所有条件放入 on 连接的一部分:

df3 = df1.join(df2, (df1.number == df2.number) & (df1.timestamp >= df2.startdate) & (df1.timestamp <= df2.enddate), how='left_semi')

在比较之前,最好将列转换为相同的类型,例如。

import pyspark.sql.functions as F

df3 = df1.join(df2, 
    (df1.number == df2.number) & 
    (F.to_date(df1.timestamp) >= F.to_date(df2.startdate)) & 
    (F.to_date(df1.timestamp) <= F.to_date(df2.enddate)),
    'left_semi'
)

相关问题