如何在spark scala中合并这两个Dataframe以生成第三个Dataframe?

jchrr9hc  于 2021-05-16  发布在  Spark
关注(0)|答案(1)|浏览(1203)

我很难连接这两个Dataframe视图,因为无法修改spark scala中的特定列值。我想我必须做一个转置/连接的方式,但无法解决它。
这是第一个Dataframe:

var sample_df = Seq(("john","morning","7am"),("john","night","10pm"),("bob","morning","8am"),("bob","night","11pm"),("phil","morning","9am"),("phil","night","10pm")).toDF("person","time_of_day","wake/sleep hour")


下面是第二个Dataframe:

var sample_df2 = Seq(("john","6am","11pm"),("bob","7am","2am"),("phil","8am","1am")).toDF("person","morning_earliest","night_latest")


下面是我希望生成的Dataframe:

var resulting_df = Seq(("john","morning","7am","6am"),("john","night","10pm","11pm"),("bob","morning","8am","7am"),("bob","night","11pm","2am"),("phil","morning","9am","8am"),("phil","night","10pm","1am")).toDF("person","time_of_day","wake/sleep hour","earliest/latest")


任何帮助都将不胜感激!谢谢,祝你今天愉快!

0ejtzxu1

0ejtzxu11#

sample_df.createOrReplaceTempView("df1")
sample_df2.createOrReplaceTempView("df2")

spark.sql("""
select person, time_of_day, `wake/sleep hour`, `earliest/latest`
from (
    select person, stack(2, 'morning', morning_earliest, 'night', night_latest) as (time_of_day, `earliest/latest`)
    from df2
) df
join df1
using (time_of_day, person)
""").show()

+------+-----------+---------------+---------------+
|person|time_of_day|wake/sleep hour|earliest/latest|
+------+-----------+---------------+---------------+
|  john|    morning|            7am|            6am|
|  john|      night|           10pm|           11pm|
|   bob|    morning|            8am|            7am|
|   bob|      night|           11pm|            2am|
|  phil|    morning|            9am|            8am|
|  phil|      night|           10pm|            1am|
+------+-----------+---------------+---------------+

相关问题