如何只选择至少有一个值与前一行不同的行

ngynwnxp  于 2021-05-18  发布在  Spark
关注(0)|答案(3)|浏览(420)

如果我有一个每天都以快照的形式跟踪值的表,我怎么能只选择每个id至少有一个值更改的行,当然不包括日期。每个id的第一行也应该包含在结果中。
例如,使用此表:

id    phone_number    email          date
1     12345           a@gmail.com    2020-01-01
1     12345           a@gmail.com    2020-01-02
1     23456           a@gmail.com    2020-01-03
1     34567           a@gmail.com    2020-01-04
1     34567           a@gmail.com    2020-01-05
1     45678           a@gmail.com    2020-01-06
1     45678           a@gmail.com    2020-01-07
2     56789           b@gmail.com    2020-01-01
2     56789           b@gmail.com    2020-01-02
2     56789           c@gmail.com    2020-01-03
2     67890           c@gmail.com    2020-01-04
2     67890           c@gmail.com    2020-01-05
3     78901           d@gmail.com    2020-01-01
3     78901           d@gmail.com    2020-01-02
3     78901           d@gmail.com    2020-01-03

将返回此结果:

id    phone_number    email          date
1     12345           a@gmail.com    2020-01-01
1     23456           a@gmail.com    2020-01-03
1     34567           a@gmail.com    2020-01-04
1     45678           a@gmail.com    2020-01-06
2     56789           b@gmail.com    2020-01-01
2     56789           c@gmail.com    2020-01-03
2     67890           c@gmail.com    2020-01-04
3     78901           d@gmail.com    2020-01-01
htrmnn0y

htrmnn0y1#

select distinct  phone_number,email from table
nvbavucw

nvbavucw2#

我将其作为sql查询使用,我可以这样运行:

df.createOrReplaceTempView("df")

df = spark.sql(
    """
    SELECT  a.*
    FROM (SELECT ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) AS row, id, phone_number, email, date FROM df) AS a
    LEFT JOIN (SELECT ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) AS row, id, phone_number, email, date FROM df) AS b
    ON a.row = b.row + 1 AND a.id = b.id 
    WHERE a.phone_number != b.phone_number OR b.phone_number IS NULL OR a.email != b.email OR b.email IS NULL
    """
)

这很好用,现在我只想把它翻译成原生的pyspark代码。

toe95027

toe950273#

iiuc,您需要使用带有行号()的窗口函数,并且只取第一个值-

在这里创建df

from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql.window import Window as W
df = spark.createDataFrame([(1,12345,"a@gmail.com","2020-01-01"),
                           (1,12345,"a@gmail.com","2020-01-02"),
                           (1,23456,"a@gmail.com","2020-01-03"),
                           (1,34567,"a@gmail.com","2020-01-04"),
                           (1,34567,"a@gmail.com","2020-01-05"),
                           (1,45678,"a@gmail.com","2020-01-06"),
                           (1,45678,"a@gmail.com","2020-01-07"),
                           (2,56789,"b@gmail.com","2020-01-01"),
                           (2,56789,"b@gmail.com","2020-01-02"),
                           (2,56789,"c@gmail.com","2020-01-03"),
                           (2,67890,"c@gmail.com","2020-01-04"),
                           (2,67890,"c@gmail.com","2020-01-05"),
                           (3,78901,"d@gmail.com","2020-01-01"),
                           (3,78901,"d@gmail.com","2020-01-02"),
                           (3,78901,"d@gmail.com","2020-01-03"),
                           ],[ "id","phone_number","email","date"])
df.show()

输出

+---+------------+-----------+----------+
| id|phone_number|      email|      date|
+---+------------+-----------+----------+
|  1|       12345|a@gmail.com|2020-01-01|
|  1|       12345|a@gmail.com|2020-01-02|
|  1|       23456|a@gmail.com|2020-01-03|
|  1|       34567|a@gmail.com|2020-01-04|
|  1|       34567|a@gmail.com|2020-01-05|
|  1|       45678|a@gmail.com|2020-01-06|
|  1|       45678|a@gmail.com|2020-01-07|
|  2|       56789|b@gmail.com|2020-01-01|
|  2|       56789|b@gmail.com|2020-01-02|
|  2|       56789|c@gmail.com|2020-01-03|
|  2|       67890|c@gmail.com|2020-01-04|
|  2|       67890|c@gmail.com|2020-01-05|
|  3|       78901|d@gmail.com|2020-01-01|
|  3|       78901|d@gmail.com|2020-01-02|
|  3|       78901|d@gmail.com|2020-01-03|
+---+------------+-----------+----------+

逻辑在这里

_w = W.partitionBy("id","phone_number","email").orderBy("date")
df = df.withColumn("rnk", F.row_number().over(_w))
df = df.filter(F.col("rnk") ==F.lit(1))
df = df.orderBy("id", "phone_number")
df.show()
+---+------------+-----------+----------+---+
| id|phone_number|      email|      date|rnk|
+---+------------+-----------+----------+---+
|  1|       12345|a@gmail.com|2020-01-01|  1|
|  1|       23456|a@gmail.com|2020-01-03|  1|
|  1|       34567|a@gmail.com|2020-01-04|  1|
|  1|       45678|a@gmail.com|2020-01-06|  1|
|  2|       56789|c@gmail.com|2020-01-03|  1|
|  2|       56789|b@gmail.com|2020-01-01|  1|
|  2|       67890|c@gmail.com|2020-01-04|  1|
|  3|       78901|d@gmail.com|2020-01-01|  1|
+---+------------+-----------+----------+---+

相关问题