pyspark 比较不同列的子字符串

bvpmtnay  于 2023-01-12  发布在  Spark
关注(0)|答案(1)|浏览(144)

我有这样一个 Dataframe :

+----------------------+--------------------------------------------------+-------------------+
| column_1             |column_2|                                         |Required_column    |
+----------------------+--------------------------------------------------+-------------------+
|K12B-45-84-6          |K12B-02-36-504, I05O-21-65-312, A301-21-25-363    | True              |
|J020-35-2-9           |P12K-05-31-602, M002-22-22-636,L630-51-32-544     | False             | 
|L006-85-00-694        |M10P-22-94-349,L006-85-00-694, I553-35-12-240     | True              |
|M002-22-36-989        |U985-12-45-363,    M002-19-14-964                 | True              |
+----------------------+--------------------------------------------------+-------------------+

说明:column_1和column_2是一个字符串,为了便于理解,我们将 Dataframe 中的值称为"开关"。column_1每行始终只有一个开关值,但column_2中可能有多个开关值。只有通过比较前4个字符串,才能返回值True或False(例如:K12B == K12B见第一行)
注意:即使column_2中的开关值是以逗号分隔的,也从来没有公共逻辑(有时可能有一个或两个空格等)提示是column_1或column_2中的每个开关值都以字母开头,因此需要基于该提示的逻辑
目标是获得返回True或False的所需列,在Pyspark中需要解决方案
提前致谢

bmp9r5qi

bmp9r5qi1#

下面是一个使用Pyspark的substring并包含函数的解决方案,这样你就不用担心column_2的清洁度了,你只需要确保column_1是干净的:

import pyspark.sql.functions as F

data = [
    ("K12B-45-84-6", "K12B-02-36-504, I05O-21-65-312, A301-21-25-363"),
    ("J020-35-2-9", "P12K-05-31-602, M002-22-22-636,L630-51-32-544"),
    ("L006-85-00-694", "M10P-22-94-349,L006-85-00-694, I553-35-12-240"),
    ("M002-22-36-989", "U985-12-45-363,    M002-19-14-964")]

columns = ["column_1", "column_2"]
df = spark.createDataFrame(data = data, schema = columns)

df = df.withColumn("Required_column", F.when(
            F.col("column_2").contains(F.substring(F.col("column_1"), 1, 4)), True
        ).otherwise(False)
            )
df.show()

输出:

+--------------+--------------------+---------------+
|      column_1|            column_2|Required_column|
+--------------+--------------------+---------------+
|  K12B-45-84-6|K12B-02-36-504, I...|           true|
|   J020-35-2-9|P12K-05-31-602, M...|          false|
|L006-85-00-694|M10P-22-94-349,L0...|           true|
|M002-22-36-989|U985-12-45-363,  ...|           true|
+--------------+--------------------+---------------+

相关问题