确定pyspark dataframe行值是否存在于其他列中

mrzz3bfm  于 2021-07-14  发布在  Spark
关注(0)|答案(3)|浏览(371)

我正在pyspark中处理一个Dataframe,如果Dataframe的其他列中存在值,则需要逐行计算。例如,给定此Dataframe:
测向:

+---------+--------------+-------+-------+-------+
|Subject  |SubjectTotal  |TypeA  |TypeB  |TypeC  |
+---------+--------------+-------+-------+-------+
|Subject1 |10            |5      |3      |2      |
+---------+--------------+-------+-------+-------+
|Subject2 |15            |0      |15     |0      |
+---------+--------------+-------+-------+-------+
|Subject3 |5             |0      |0      |5      |
+---------+--------------+-------+-------+-------+

作为输出,我需要确定哪个类型具有100%的 SubjectTotal . 所以我的输出是这样的:
数据框输出:

+---------+--------------+
|Subject  |Type          |
+---------+--------------+
|Subject2 |TypeB         |
+---------+--------------+
|Subject3 |TypeC         |
+---------+--------------+

有可能吗?
谢谢!

kognpnkq

kognpnkq1#

你可以用 when 列表中所有列的表达式 TypeX ,那么 coalesce 表达式列表:

from pyspark.sql import functions as F

df1 = df.select(
    F.col("Subject"),
    F.coalesce(*[F.when(F.col(c) == F.col("SubjectTotal"), F.lit(c)) for c in df.columns[2:]]).alias("Type")
).filter("Type is not null")

df1.show()

# +--------+-----+

# | Subject| Type|

# +--------+-----+

# |Subject2|TypeB|

# |Subject3|TypeC|

# +--------+-----+
xlpyo6sf

xlpyo6sf2#

你可以试试 when().otherwise() pyspark sql函数或 case sql语句

import pyspark.sql.functions as F
df = spark.createDataFrame(
[
  ("Subject1", 10, 5, 3, 2),
  ("Subject2", 15, 0, 15, 0),
  ("Subject3", 5, 0, 0, 5)
],
("subject", "subjectTotal", "TypeA", "TypeB", "TypeC"))
df.show()

+--------+------------+-----+-----+-----+
| subject|subjectTotal|TypeA|TypeB|TypeC|
+--------+------------+-----+-----+-----+
|Subject1|          10|    5|    3|    2|
|Subject2|          15|    0|   15|    0|
|Subject3|           5|    0|    0|    5|
+--------+------------+-----+-----+-----+

df.withColumn("Type", F.
              when(F.col("subjectTotal") == F.col("TypeA"), "TypeA").
             when(F.col("subjectTotal") == F.col("TypeB"), "TypeB").
             when(F.col("subjectTotal") == F.col("TypeC"), "TypeC").
             otherwise(None)).show()

+--------+------------+-----+-----+-----+-----+
| subject|subjectTotal|TypeA|TypeB|TypeC| Type|
+--------+------------+-----+-----+-----+-----+
|Subject1|          10|    5|    3|    2| null|
|Subject2|          15|    0|   15|    0|TypeB|
|Subject3|           5|    0|    0|    5|TypeC|
+--------+------------+-----+-----+-----+-----+
sg3maiej

sg3maiej3#

您可以使用 stack 并过滤其中的行 SubjectTotal 等于类型列中的值:

df2 = df.selectExpr(
    'Subject', 
    'SubjectTotal', 
    "stack(3, 'TypeA', TypeA, 'TypeB', TypeB, 'TypeC', TypeC) as (type, val)"
).filter('SubjectTotal = val').select('Subject', 'type')

df2.show()
+--------+-----+
| Subject| type|
+--------+-----+
|Subject2|TypeB|
|Subject3|TypeC|
+--------+-----+

相关问题