pyspark 如果一个列的值作为子字符串存在于另一个 Dataframe 中，则联接该 Dataframe

piah890a 于 2022-11-01 发布在 Spark

关注(0)|答案(2)|浏览(197)

我有一个 Dataframe df1，如下所示：

以及另一个 Dataframe df2，如下所示：

如何使用左连接将df2与df1连接起来，以使输出如下所示？

来源：https://stackoverflow.com/questions/74152065/join-the-dataframe-if-value-of-one-column-exists-as-substring-in-another-datafra

2条答案

按热度按时间

rekjcdws1#

你可以将split的值在df1和explode它们之前加入。

df3 = df1.withColumn('Value', F.explode(F.split('Value', ';')))
df4 = df2.join(df3, 'Value', 'left')

完整示例：

from pyspark.sql import functions as F
df1 = spark.createDataFrame([('apple;banana', 150), ('carrot', 20)], ['Value', 'Amount'])
df2 = spark.createDataFrame([('apple',), ('orange',)], ['Value'])

df3 = df1.withColumn('Value', F.explode(F.split('Value', ';')))
df4 = df2.join(df3, 'Value', 'left')

df4.show()

# +------+------+

# | Value|Amount|

# +------+------+

# | apple|   150|

# |orange|  null|

# +------+------+

**处理空值。**如果您希望成功连接的两个 Dataframe 中的“Value”列都为空值，则需要使用eqNullSafe等式。使用此条件通常会将两个 Dataframe 中的“Value”列保留在输出 Dataframe 中。因此，为了明确删除它，我建议在 Dataframe 上使用alias。

from pyspark.sql import functions as F
df1 = spark.createDataFrame([('apple;banana', 150), (None, 20)], ['Value', 'Amount'])
df2 = spark.createDataFrame([('apple',), ('orange',), (None,)], ['Value'])

df3 = df1.withColumn('Value', F.explode(F.coalesce(F.split('Value', ';'), F.array(F.lit(None)))))
df4 = df2.alias('a').join(
    df3.alias('b'),
    df2.Value.eqNullSafe(df3.Value),
    'left'
).drop(F.col('b.Value'))

df4.show()

# +------+------+

# | Value|Amount|

# +------+------+

# | apple|   150|

# |  null|    20|

# |orange|  null|

# +------+------+

赞(0）回复(0）举报 2022-11-01

6yt4nkrj2#

在左外联接中使用SQL“like”运算符。请尝试以下操作

//Input

spark.sql(" select 'apple;banana' value,  150 amount union all  select 'carrot', 50 ").createOrReplaceTempView("df1")
spark.sql(" select 'apple' value union all  select 'orange' ").createOrReplaceTempView("df2")

//Output

spark.sql("""
select a.value, b.amount 
   from df2 a 
   left join df1 b 
   on ';'||b.value||';' like '%;'||a.value||';%' 
""").show(false)

+------+------+
|value |amount|
+------+------+
|apple |150   |
|orange|null  |
+------+------+

赞(0）回复(0）举报 2022-11-01

我来回答

pyspark 如果一个列的值作为子字符串存在于另一个 Dataframe 中，则联接该 Dataframe

2条答案

相关问题

热门标签

最新问答