python—如何将一个Dataframe循环到另一个Dataframe,并在pyspark中获得单个匹配记录

9jyewag0  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(400)

Dataframe1

+----+------+------+-----+-----+
 |key  |dc_count|dc_day_count   |
 +----+------+------+-----+-----+
 | 123 |13      |66             |
 | 123 |13      |12             |
 +----+------+------+-----+-----+

规则Dataframe

+----+------+------+-----+-----++------+-----+-----+
 |key  |rule_dc_count|rule_day_count   |rule_out    | 
 +----+------+------+-----+-----++------+-----+-----+
 | 123 |2            |30               |139         |
 | 123 |null         |null             |64          |
 | 124 |2            |30               |139         |
 | 124 |null         |null             |64          |
 +----+------+------+-----+-----+----+------+-----+--

如果dc\u count>rule\u dc\u count和dc\u day\u count>rule\u day\u count填充相应的rule\u out
否则其他的排除”
预期产量

+----+------+------+-
 |key  |rule_out    | 
 +----+------+------+
 | 123 | 139        |
 | 124 |  64        |
 +----+------+------+
11dmarpk

11dmarpk1#

Pypark版本
这里的挑战是获取同一列中某个键的第二行值,以便解析这个lead()分析函数。
在此处创建Dataframe

from pyspark.sql import functions as F
df = spark.createDataFrame([(123,13,66),(124,13,12)],[ "key","dc_count","dc_day_count"])
df1 = spark.createDataFrame([(123,2,30,139),(123,0,0,64),(124,2,30,139),(124,0,0,64)],
                            ["key","rule_dc_count","rule_day_count","rule_out"])

得到期望结果的逻辑

from pyspark.sql import Window as W
_w = W.partitionBy('key').orderBy(F.col('key').desc())
df1 = df1.withColumn('rn', F.lead('rule_out').over(_w))
df1 = df1.join(df,'key','left')
df1 = df1.withColumn('condition_col', 
                     F.when(
  (F.col('dc_count') > F.col('rule_dc_count')) & 
  (F.col('dc_day_count') > F.col('rule_day_count')),F.col('rule_out'))
                     .otherwise(F.col('rn')))

df1 = df1.filter(F.col('rn').isNotNull())

输出

df1.show()
+---+-------------+--------------+--------+---+--------+------------+-------------+
|key|rule_dc_count|rule_day_count|rule_out| rn|dc_count|dc_day_count|condition_col|
+---+-------------+--------------+--------+---+--------+------------+-------------+
|124|            2|            30|     139| 64|      13|          12|           64|
|123|            2|            30|     139| 64|      13|          66|          139|
+---+-------------+--------------+--------+---+--------+------------+-------------+
z2acfund

z2acfund2#

假设预期输出为-

+---+--------+
|key|rule_out|
+---+--------+
|123|139     |
+---+--------+

下面的查询应该可以工作-

spark.sql(
      """
        |SELECT
        | t1.key, t2.rule_out
        |FROM table1 t1 join table2 t2 on t1.key=t2.key and
        |t1.dc_count > t2.rule_dc_count and t1.dc_day_count > t2.rule_day_count
      """.stripMargin)
      .show(false)

相关问题