如何在apachespark中连接Dataframe

rslzwgfq  于 2021-06-29  发布在  Hive
关注(0)|答案(1)|浏览(366)

我有以下两个Dataframe:
df1型

uid   text   frequency
11    a      1
12    a      2
12    b      1

df2型

text
a
b
c
d

我想创建一个如下所示的Dataframe:
输出测向

uid  text  frequency
11   a     1
11   b     0
11   c     0
11   d     0
12   a     2
12   b     1
12   c     0
12   d     0

我一直在使用spark sql编写这样的连接:

sqlContext.sql("Select uid,df2.text,frequency from df1  right outer join df2 on df1.text= df2.text")

它不会返回正确的结果。
有什么建议吗?

myzjeezk

myzjeezk1#

你得这样做

// Find unique combinations of uid and text
df1.select("uid").distinct.join(df2.distinct)  
  // Left join with df1
  .join(df1, Seq("uid", "text"), "leftouter")
  // Replace missing values with 0
  .withColumn("frequency", coalesce($"frequency", lit(0)))

大致相当于以下sql:

WITH tmp AS (SELECT DISTINCT df1.uid, df2.text FROM df1  JOIN df2)
SELECT tmp.uid, tmp.text, COALESCE(df1.frequency, 0) AS frequency
FROM tmp LEFT OUTER JOIN df1
ON tmp.uid = df1.uid AND tmp.text = df1.text

相关问题