python-3.x 如何在pyspark Dataframe 列中根据OK和not OK条件分隔列?

bvjxkvbb  于 2022-11-26  发布在  Python
关注(0)|答案(1)|浏览(151)

我有一个 dataframedf,如下所示:

VehNum  Control_circuit control_circuit_status  partnumbers     errors     Flag
4234456 DOC             ok                      A567UR      Software Issue  0
4234456 DOC             not_okay                A568UR      Software Issue  1
4234456 DOC             not_okay                A569UR      Hardware issue  2
4234457 ACR             ok                      A234TY      Hardware issue  0
4234457 ACR             ok                      A235TY      Hardware issue  0
4234457 ACR             ok                      A234TY      Hardware issue  0
4234487 QWR             ok                      A276TY      Hardware issue  0
4234487 QWR             not_okay                A872UR      Hardware issue  1
3423448 QWR             not_okay                A872UR      Hardware issue  1

我想添加一个名为"Control_Flag"的新列,并执行以下操作:对于每个VehNumControl_circuit,如果它具有"control_circuit_status",则Control_circuit具有状态“OK,”因为"Control_Flag"值将是0,否则1
结果应如下所示:

VehNum  Control_circuit control_circuit_status  partnumbers     errors     Flag Control_Flag
4234456 DOC             ok                      A567UR      Software Issue  0   0
4234456 DOC             not_okay                A568UR      Software Issue  1   0
4234456 DOC             not_okay                A569UR      Hardware issue  2   0
4234457 ACR             ok                      A234TY      Hardware issue  0   0
4234457 ACR             ok                      A235TY      Hardware issue  0   0
4234457 ACR             ok                      A234TY      Hardware issue  0   0
4234487 QWR             ok                      A276TY      Hardware issue  0   1
4234487 QWR             not_okay                A872UR      Hardware issue  1   1
3423448 QWR             not_okay                A872UR      Hardware issue  1   1

如何使用pyspark实现这一点?

zbdgwd5y

zbdgwd5y1#

这是解决方案

from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import Window

df = spark.createDataFrame(
    [
        ("4234456", "DOC", "ok", "A567UR", "Software Issue", 0),
        ("4234456", "DOC", "not_okay", "A568UR", "Software Issue", 1),
        ("4234456", "DOC", "not_okay", "A569UR", "Hardware Issue", 2),        
        ("4234457", "ACR", "ok", "A234TY", "Hardware Issue", 0),
        ("4234457", "ACR", "ok", "A234TY", "Hardware Issue", 0),
        ("4234457", "ACR", "ok", "A234TY", "Hardware Issue", 0),        
        ("4234487", "QWR", "ok", "A276TY", "Hardware Issue", 0),
        ("4234487", "QWR", "not_okay", "A872UR", "Hardware Issue", 1),
        ("3423448", "QWR", "not_okay", "A872UR", "Hardware Issue", 1),
    ],
    ["VehNum", "Control_circuit", "control_circuit_status", "partnumbers", "errors", "Flag"],
)

df_agg_window = Window.partitionBy(
    "VehNum",
    "Control_circuit",
)

df = (
    df
    .withColumn(
        "cc_status",
        F.when(
            F.lower(F.col("control_circuit_status")) == "ok",
            F.lit(1),
        )
        .when(
            F.lower(F.col("control_circuit_status")) == "not_okay",
            F.lit(0),
        )
        .otherwise(F.lit(0)),
    )
    .withColumn(
        "flag_sum",
        F.sum("cc_status").over(df_agg_window),
    )
    .withColumn(
        "Control_Flag",
        F.when(
            F.lower(F.col("flag_sum")) > 0,
            F.lit(0),
        )
        .otherwise(F.lit(1)),
    )
    .drop("cc_status", "flag_sum")
)

df.show()

输出:

+-------+---------------+----------------------+-----------+--------------+----+------------+
| VehNum|Control_circuit|control_circuit_status|partnumbers|        errors|Flag|Control_Flag|
+-------+---------------+----------------------+-----------+--------------+----+------------+
|4234457|            ACR|                    ok|     A234TY|Hardware Issue|   0|           0|
|4234457|            ACR|                    ok|     A234TY|Hardware Issue|   0|           0|
|4234457|            ACR|                    ok|     A234TY|Hardware Issue|   0|           0|
|4234487|            QWR|              not_okay|     A872UR|Hardware Issue|   1|           0|
|4234487|            QWR|                    ok|     A276TY|Hardware Issue|   0|           0|
|4234456|            DOC|                    ok|     A567UR|Software Issue|   0|           0|
|4234456|            DOC|              not_okay|     A569UR|Hardware Issue|   2|           0|
|4234456|            DOC|              not_okay|     A568UR|Software Issue|   1|           0|
|3423448|            QWR|              not_okay|     A872UR|Hardware Issue|   1|           1|
+-------+---------------+----------------------+-----------+--------------+----+------------+

相关问题