如何组合没有公共列的Dataframe?

uplii1fm  于 2021-05-17  发布在  Spark
关注(0)|答案(2)|浏览(446)

我有两个Dataframe

val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
df1.show
+---+---+---+
|  A|  B|  C|
+---+---+---+
|  1|  2|  3|
|  1|  2|  3|
+---+---+---+

val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
df2.show
+---+---+---+
|  D|  E|  F|
+---+---+---+
| 11| 22| 33|
| 44| 55| 66|
+---+---+---+

我需要把上面的结合起来

val df3 = Seq(("1","2","3","","",""),("4","5","6","","",""),("","","","11","22","33"),("","","","44","55","66"))
.toDF("A","B","C","D","E","F")
df3.show
+---+---+---+---+---+---+
|  A|  B|  C|  D|  E|  F|
+---+---+---+---+---+---+
|  1|  2|  3|   |   |   |
|  4|  5|  6|   |   |   |
|   |   |   | 11| 22| 33|
|   |   |   | 44| 55| 66|
+---+---+---+---+---+---+

现在,我正在为所有Dataframe手动创建缺少的列,以获得一个公共结构,然后使用 union . 此代码特定于Dataframe,不可伸缩
寻找一个解决方案,将与 x Dataframe y 每列

pftdvrlh

pftdvrlh1#

您可以手动在两个Dataframe中创建缺少的列,然后合并它们:

import org.apache.spark.sql.DataFrame

val allCols = df1.columns.toSet.union(df2.columns.toSet).toArray

val createMissingCols = (df: DataFrame, allCols: Array[String]) => allCols.foldLeft(df)(
  (_df, _col) => if (_df.columns.contains(_col)) _df else _df.withColumn(_col, lit(""))
).select(allCols.head, allCols.tail: _*)  
// select is needed to make sure the two data frames have the same order of columns

createMissingCols(df1, allCols).union(createMissingCols(df2, allCols)).show
+---+---+---+---+---+---+
|  E|  F|  A|  B|  C|  D|
+---+---+---+---+---+---+
|   |   |  1|  2|  3|   |
|   |   |  4|  5|  6|   |
| 22| 33|   |   |   | 11|
| 55| 66|   |   |   | 44|
+---+---+---+---+---+---+
zbq4xfa0

zbq4xfa02#

一种更简单的方法是创建一个完整的外部联接,并将联接表达式/条件设置为false:

val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")

val joined = df1.join(df2, lit(false), "full")

joined.show() 

+----+----+----+----+----+----+
|   A|   B|   C|   D|   E|   F|
+----+----+----+----+----+----+
|   1|   2|   3|null|null|null|
|   4|   5|   6|null|null|null|
|null|null|null|  11|  22|  33|
|null|null|null|  44|  55|  66|
+----+----+----+----+----+----+

如果要将空值实际设置为空字符串,只需添加:

val withEmptyString = joined.na.fill("")

withEmptyString.show() 

+---+---+---+---+---+---+
|  A|  B|  C|  D|  E|  F|
+---+---+---+---+---+---+
|  1|  2|  3|   |   |   |
|  4|  5|  6|   |   |   |
|   |   |   | 11| 22| 33|
|   |   |   | 44| 55| 66|
+---+---+---+---+---+---+

总而言之 df1.join(df2, lit(false), "full").na.fill("") 应该会成功的。

相关问题