我试图通过将“fault_start”列标记为“True”来解决一个问题,即我需要在错误代码首次出现时对其行进行计数。如果它们连续重复,没有任何零出现,则将其标记为“假”。如果出现零,则表示故障已清除,如果再次出现相同故障,则将其标记为“True”。
我已经包括了下面的数据样本。请注意,“rn”列只是行号,如果需要可以忽略。此外,“fault_start”是滞后一行的“fault_code”列。
truck_fault_counts_1hz_df =
Truck Timestamp fault_code rn fault_start
251_PS1_PB_520_REF_111199 2023-07-18T00:01:02.739+0000 3 63 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:03.739+0000 244 64 3
251_PS1_PB_520_REF_111199 2023-07-18T00:01:04.739+0000 3 65 244
251_PS1_PB_520_REF_111199 2023-07-18T00:01:05.739+0000 244 66 3
251_PS1_PB_520_REF_111199 2023-07-18T00:01:06.739+0000 3 67 244
251_PS1_PB_520_REF_111199 2023-07-18T00:01:07.739+0000 244 68 3
251_PS1_PB_520_REF_111199 2023-07-18T00:01:08.739+0000 3 69 244
251_PS1_PB_520_REF_111199 2023-07-18T00:01:09.739+0000 244 70 3
251_PS1_PB_520_REF_111199 2023-07-18T00:01:10.739+0000 3 71 244
251_PS1_PB_520_REF_111199 2023-07-18T00:01:11.739+0000 244 72 3
251_PS1_PB_520_REF_111199 2023-07-18T00:01:12.741+0000 3 73 244
251_PS1_PB_520_REF_111199 2023-07-18T00:01:13.741+0000 244 74 3
251_PS1_PB_520_REF_111199 2023-07-18T00:01:14.839+0000 49 75 244
251_PS1_PB_520_REF_111199 2023-07-18T00:01:15.839+0000 0 76 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:16.840+0000 49 77 0
251_PS1_PB_520_REF_111199 2023-07-18T00:01:17.840+0000 0 78 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:18.840+0000 49 79 0
251_PS1_PB_520_REF_111199 2023-07-18T00:01:19.840+0000 0 80 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:20.840+0000 49 81 0
251_PS1_PB_520_REF_111199 2023-07-18T00:01:21.840+0000 0 82 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:22.840+0000 49 83 0
251_PS1_PB_520_REF_111199 2023-07-18T00:01:23.840+0000 0 84 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:24.840+0000 49 85 0
251_PS1_PB_520_REF_111199 2023-07-18T00:01:25.841+0000 0 86 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:26.842+0000 49 87 0
251_PS1_PB_520_REF_111199 2023-07-18T00:01:27.940+0000 3 88 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:28.940+0000 244 89 3
下面的解决方案对所有不为零的故障进行计数,但无法不对连续重复发生的故障进行计数:
# Initialize a list of numbers to not be included in the distinct_faults list
x = [0.0]
# Select the distinct faults in the fault_code column and convert it to Pandas
distinct_faults_df = truck_fault_counts_1hz_df.dropDuplicates(["fault_code"]).select("fault_code").toPandas()["fault_code"]
# Create a list of the distinct faults in the fault_code column and convert the values to ints
distinct_faults = [int(i) for i in distinct_faults_df if i not in x]
# Sort the list in ascending order
distinct_faults.sort()
# Convert the values to strings
distinct_faults = [str(i) for i in distinct_faults]
# Convert the fault_code and fault_start column to int then to string from float
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.withColumn("fault_code",truck_fault_counts_1hz_df.fault_code.cast('int'))
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.withColumn("fault_code",truck_fault_counts_1hz_df.fault_code.cast('string'))
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.withColumn("fault_start",truck_fault_counts_1hz_df.fault_start.cast('int'))
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.withColumn("fault_start",truck_fault_counts_1hz_df.fault_start.cast('string'))
# Fill any Null values with "0"
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.fillna(value="0", subset=["fault_start"])
for i in distinct_faults:
# Create a column for row count and fault start
# fault_start lags fault_code by one row
# Any row that has a number greater than 0 in fault_code
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.\
withColumn("fault_start",\
when(((col("fault_code") == i) & (col("fault_start") != i)),
lit("True")).\
when(((col("fault_code") == 0) & (col("fault_start") == 0)),
lit("False")).\
when(((col("fault_code") == i) & (col("fault_start") == i)),
lit("False")).\
when(((col("fault_code") == 0) & (col("fault_start") == i)),
lit("False")).\
otherwise(truck_fault_counts_1hz_df.fault_start))
truck_fault_counts_1hz_noSpam_df = truck_fault_counts_1hz_df.where(truck_fault_counts_1hz_df.fault_start == True).drop("rn")
display(truck_fault_counts_1hz_df)
我目前的输出:
Truck Timestamp fault_code rn fault_start
251_PS1_PB_520_REF_111199 2023-07-18T00:01:02.739+0000 3 63 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:03.739+0000 244 64 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:04.739+0000 3 65 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:05.739+0000 244 66 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:06.739+0000 3 67 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:07.739+0000 244 68 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:08.739+0000 3 69 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:09.739+0000 244 70 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:10.739+0000 3 71 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:11.739+0000 244 72 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:12.741+0000 3 73 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:13.741+0000 244 74 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:14.839+0000 49 75 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:15.839+0000 0 76 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:16.840+0000 49 77 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:17.840+0000 0 78 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:18.840+0000 49 79 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:19.840+0000 0 80 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:20.840+0000 49 81 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:21.840+0000 0 82 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:22.840+0000 49 83 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:23.840+0000 0 84 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:24.840+0000 49 85 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:25.841+0000 0 86 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:26.842+0000 49 87 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:27.940+0000 3 88 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:28.940+0000 244 89 True
我希望实现的输出:
Truck Timestamp fault_code rn fault_start
251_PS1_PB_520_REF_111199 2023-07-18T00:01:02.739+0000 3 63 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:03.739+0000 244 64 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:04.739+0000 3 65 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:05.739+0000 244 66 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:06.739+0000 3 67 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:07.739+0000 244 68 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:08.739+0000 3 69 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:09.739+0000 244 70 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:10.739+0000 3 71 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:11.739+0000 244 72 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:12.741+0000 3 73 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:13.741+0000 244 74 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:14.839+0000 49 75 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:15.839+0000 0 76 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:16.840+0000 49 77 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:17.840+0000 0 78 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:18.840+0000 49 79 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:19.840+0000 0 80 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:20.840+0000 49 81 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:21.840+0000 0 82 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:22.840+0000 49 83 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:23.840+0000 0 84 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:24.840+0000 49 85 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:25.841+0000 0 86 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:26.842+0000 49 87 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:27.940+0000 3 88 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:28.940+0000 244 89 True
我也一直在尝试实现一个字典,其中错误代码作为键,错误开始(True/False)作为值来跟踪它们已经开始,但还没有找到一种实用的方法来实现,不涉及迭代每一行的时间和内存密集型操作(这个数据集包含1900万行,所以这是不实际的)。我还考虑过为每个新的故障代码创建一个新的列,但也没有找到实现它的方法。
任何帮助都非常感谢!!
1条答案
按热度按时间wqsoz72f1#
代码
如何工作
用
Truck
对该矩阵进行分区,并按Timestamp
排序,然后计算fault_code
等于0
时的条件下的累积和。这将创建一个名为reset_cnt
的列,这将帮助我们区分fault_code
被重置的不同行块,即清除故障。创建
fault_start
列:按Truck
、reset_cnt
和fault_code
对该框架进行分区,并分配行号以标识每个分区的重复fault_codes
。使用false屏蔽
fault_start
中的行,其中fault_code
是0