我有一个dataframe
,如下所示:
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
|sequence|recType|valCode|registerNumber| rest| errorCode|errorType | errorDescription|isSuccessful|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
| 9| 11| 0| XXXX2288|110XXXX2288MKKKKK...| CHAR0088| ERROR|Records out of se...| N|
| 9| 12| 0| XXXX2288|130XXXX22880011ZZ...| CHAR0088| ERROR|Records out of se...| N|
| 9| 18| 0| XXXX2288|140XXXX2288 ...| CHAR0088| ERROR|Records out of se...| N|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+ N|
以下代码使用UDF
填充errorType
和errorDescription
列的数据。UDFs
(即resolveErrorTypeUDF
和resolveErrorDescUDF
)将一个errorCode
作为输入,并在输出中分别提供相应的errorType
和errorDescription
。
errorFinalDf = errorDfAll.na.fill("") \
.withColumn("errorType", resolveErrorTypeUDF(col("errorCode"))) \
.withColumn("errorDescription", resolveErrorDescUDF(col("errorCode"))) \
.withColumn("isSuccessful", when(trim(col("errorCode")).eqNullSafe(""), "Y").otherwise("N")) \
.dropDuplicates()
请注意,我过去在errorCode
列中只得到一个error code
。现在,我将在errorCode
列中得到单个/多个-
分隔的error codes
。我需要填充所有MaperrorType
和errorDescription
,并将它们写入相应的列中,分隔为-
。
新的dataframe
看起来像这样。
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
|sequence|recType|valCode|registerNumber| rest| errorCode|errorType | errorDescription|isSuccessful|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
| 7| 1| 0| XXXX8822|010XXXX8822XBCDEF...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 11| 0| XXXX8822|110XXXX8822LLLLLL...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 12| 0| XXXX8822|120XXXX8822011GB ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 18| 0| XXXX8822|180XXXX8822 ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 18| 0| XXXX8822|180XXXX88220 ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
为了适应新的情况需要做哪些改变。请帮助。谢谢。
1条答案
按热度按时间9rygscc11#
您只需要对
UDFs
进行最小限度的更改。假设你有一个简单的python函数,
get_type_from_code
能够将一个带有错误代码的字符串转换成相应的类型(同样的方法也适用于描述)。成交!