基于or条件在sparkscala中连接两个Dataframe

wlwcrazw  于 2021-05-27  发布在  Spark
关注(0)|答案(3)|浏览(452)

我有两个数据框1)帐户和2)客户。账户模式如下:

Name  Id    Telehone     Mob       email
 AR     1     123        1234      test1@gmail.com
 BR     2     213        4123      test2@gmail.com
 CR     3     231        3214      test3@gmail.com
 KR     4     132        1324      test4@gmail.com

第二桌的顾客是:

Id    Phone   Email
  2     2344    testq@gmail.com
  6     132     testf@gmail.com
  7     64562    test1@gmail.com

我需要连接这两个Dataframe Id 正在匹配 Id OR Phone 正在匹配
Telephone OR Mob Or Email 正在匹配 email . 在上述情况下,第一行客户的身份证匹配,第二行客户的电话匹配,第三行客户的电子邮件匹配。联接应保留在所有帐户记录中。

368yc8dk

368yc8dk1#

你可以很容易地满足这个要求 spark SQL .
参考代码-

import org.apache.spark.sql.functions._

val accountdf = sc.parallelize(Seq(("AR",1,123,1234,"test1@gmail.com"),("BR", 2, 213, 4123, "test2@gmail.com"),("CR", 3, 231, 3214, "test3@gmail.com"),("KR", 4, 132, 1324, "test4@gmail.com"))).toDF("name","id","telephone","mob","email")

accountdf.createOrReplaceTempView("account")

val customerdf = sc.parallelize(Seq((2,2344,"testq@gmail.com"),(6,132,"testf@gmail.com"),(7,64562,"test1@gmail.com"))).toDF("id","phone","email")

customerdf.createOrReplaceTempView("customer")

sql("select * from account a left join customer c on a.id = c.id or (a.telephone = c.phone or a.mob = c.phone) or a.email = c.email").show(false)

+----+---+---------+----+---------------+----+-----+---------------+
|name|id |telephone|mob |email          |id  |phone|email          |
+----+---+---------+----+---------------+----+-----+---------------+
|BR  |2  |213      |4123|test2@gmail.com|2   |2344 |testq@gmail.com|
|KR  |4  |132      |1324|test4@gmail.com|6   |132  |testf@gmail.com|
|AR  |1  |123      |1234|test1@gmail.com|7   |64562|test1@gmail.com|
|CR  |3  |231      |3214|test3@gmail.com|null|null |null           |
+----+---+---------+----+---------------+----+-----+---------------+
vfwfrxfs

vfwfrxfs2#

检查以下代码。

scala> accountDF.show(false)
+----+---+---------+----+---------------+
|name|id |telephone|mob |email          |
+----+---+---------+----+---------------+
|AR  |1  |123      |1234|test1@gmail.com|
|BR  |2  |213      |4123|test2@gmail.com|
|CR  |3  |231      |3214|test3@gmail.com|
|KR  |4  |132      |1324|test4@gmail.com|
+----+---+---------+----+---------------+
scala> customerDF.show(false)
+---+-----+---------------+
|id |phone|email          |
+---+-----+---------------+
|2  |2344 |testq@gmail.com|
|6  |132  |testf@gmail.com|
|7  |64562|test1@gmail.com|
+---+-----+---------------+
scala> accountDF.printSchema
root
 |-- name: string (nullable = true)
 |-- id: string (nullable = true)
 |-- telephone: string (nullable = true)
 |-- mob: string (nullable = true)
 |-- email: string (nullable = true)
scala> customerDF.printSchema
root
 |-- id: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- email: string (nullable = true)
scala> 

accountDF.join(customerDF,(accountDF("id") === customerDF("id") || (accountDF("telephone") === customerDF("phone") ||accountDF("mob") === customerDF("phone")) || accountDF("email") === customerDF("email")),"left").show(false)

+----+---+---------+----+---------------+----+-----+---------------+
|name|id |telephone|mob |email          |id  |phone|email          |
+----+---+---------+----+---------------+----+-----+---------------+
|AR  |1  |123      |1234|test1@gmail.com|7   |64562|test1@gmail.com|
|BR  |2  |213      |4123|test2@gmail.com|2   |2344 |testq@gmail.com|
|CR  |3  |231      |3214|test3@gmail.com|null|null |null           |
|KR  |4  |132      |1324|test4@gmail.com|6   |132  |testf@gmail.com|
+----+---+---------+----+---------------+----+-----+---------------+
dl5txlt9

dl5txlt93#

val sourceDF = Seq(("AR",1,123,1234,"test1@gmail.com"),
    ("BR",2,213,4123,"test2@gmail.com"),
  ("CR",3,231,3214,"test3@gmail.com"),
  ("KR",4,132,1324,"test4@gmail.com")
  ).toDF("Name","Id","Telehone","Mob","email")

  val sourceDF2 = Seq((2,2344,"testq@gmail.com"),
    (6,132,"testf@gmail.com"),
    (7,64562,"test1@gmail.com")
  ).toDF("Id","Phone","Email")

  val joinDF = sourceDF.join(sourceDF2,
    sourceDF.col("Id") === sourceDF2.col("Id") ||
      (sourceDF.col("Telehone") === sourceDF2.col("Phone") ||
      sourceDF.col("Mob") === sourceDF2.col("Phone")) ||
      sourceDF.col("email") === sourceDF2.col("Email")
    ,
  "inner")
  // use "inner" or "left" or ...

相关问题