scala过滤出任何column2与column1匹配的行

gajydyqb  于 2021-05-29  发布在  Spark
关注(0)|答案(2)|浏览(409)

嗨,stackoverflow,
我想删除Dataframe中a列与b列中任何不同值匹配的所有行。我希望这个代码块能做到这一点,但它似乎也会删除列b为null的值,这很奇怪,因为过滤器应该只考虑列a。如何修复此代码以执行预期行为,即删除Dataframe中列a与列b中任何不同值匹配的所有行。

import spark.implicits._

val df = Seq(

      (scala.math.BigDecimal(1) , null),

      (scala.math.BigDecimal(2), scala.math.BigDecimal(1)),

      (scala.math.BigDecimal(3), scala.math.BigDecimal(4)),

      (scala.math.BigDecimal(4), null),

      (scala.math.BigDecimal(5), null),

      (scala.math.BigDecimal(6), null)

    ).toDF("A", "B")

// correct, has 1, 4

val to_remove = df

    .filter(

    df.col("B").isNotNull

    ).select(

    df("B")

).distinct()

// incorrect, returns 2, 3 instead of 2, 3, 5, 6

val final = df.filter(!df.col("A").isin(to_remove.col("B")))

// 4 != 2

assert(4 === final.collect().length)
xzv2uavs

xzv2uavs1#

改变过滤条件 !df.col("A").isin(to_remove.col("B"))!df.col("A").isin(to_remove.collect.map(_.getDecimal(0)):_*) 检查以下代码。

val finaldf = df
.filter(!df
         .col("A")
         .isin(to_remove.map(_.getDecimal(0)).collect:_*)
       )

scala> finaldf.show
+--------------------+--------------------+
|                   A|                   B|
+--------------------+--------------------+
|2.000000000000000000|1.000000000000000000|
|3.000000000000000000|4.000000000000000000|
|5.000000000000000000|                null|
|6.000000000000000000|                null|
+--------------------+--------------------+
brccelvz

brccelvz2#

isin 函数接受一个列表。但是,在您的代码中,您正在传递 Dataset[Row] . 根据文件https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.column@isin%28scala.collection.seq%29声明为 def isin(list: Any*): Column 首先需要将值提取到序列中,然后在 isin 功能。请注意,这可能会影响性能。

scala> val to_remove = df.filter(df.col("B").isNotNull).select(df("B")).distinct().collect.map(_.getDecimal(0))
to_remove: Array[java.math.BigDecimal] = Array(1.000000000000000000, 4.000000000000000000)

scala> val finaldf = df.filter(!df.col("A").isin(to_remove:_*))
finaldf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [A: decimal(38,18), B: decimal(38,18)]

scala> finaldf.show
+--------------------+--------------------+
|                   A|                   B|
+--------------------+--------------------+
|2.000000000000000000|1.000000000000000000|
|3.000000000000000000|4.000000000000000000|
|5.000000000000000000|                null|
|6.000000000000000000|                null|
+--------------------+--------------------+

相关问题