我希望找到一个解决方案来检查一个组内的多个条件。首先我检查记录之间的重叠(基于ID),其次我应该为同一个传染性重叠中编号最高的记录例外。最重要的是,同一个ID可以有多个重叠。例如:
data = [('A',1000,1,100),
('B',1001,0,10),
('B',1002,10,15),
('B',1002,20,22),
('B',1003,25,50),
('B',1004,50,55),
('B',1005,53,56),
('B',1006,60,100),
('C',1007,1,100)
]
schema = StructType([ \
StructField("id",StringType(),True), \
StructField("tran",IntegerType(),True), \
StructField("start",IntegerType(),True), \
StructField("end",IntegerType(),True), \
])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
+---+----+-----+---+
| id|tran|start|end|
+---+----+-----+---+
| A|1000| 1|100|
| B|1001| 0| 10|
| B|1002| 10| 15|
| B|1003| 20| 22|
| B|1004| 25| 50|
| B|1005| 50| 55|
| B|1006| 53| 56|
| B|1007| 60|100|
| C|1008| 1|100|
+---+----+-----+---+
所需的 Dataframe 应如下所示:
| id|tran|start|end|valid|
+---+----+-----+---+-----+
| A|1000| 1|100| yes| # this is valid because by id there is no overlap between start and end
| B|1001| 0| 10| no| # invalid because by id it overlaps with the next
| B|1002| 10| 15| yes| # it overlaps with the previous one but it has the highest tran number between the two
| B|1003| 20| 22| yes| # yes because no overlap
| B|1004| 25| 50| no| # invalid because overlaps and the tran is not the highest
| B|1005| 50| 55| no| # invalid because overlaps and the tran is not the highest
| B|1006| 53| 56| yes| # it overlaps with the previous ones but it has the highest tran number among the three contagiously overlapping ones
| B|1007| 60|100| yes| # no overlap
| C|1008| 1|100| yes| # no overlap
+---+----+-----+---+-----+
非常感谢解决这个问题的传奇人物:)
1条答案
按热度按时间iqxoj9l91#
1.导入必要的包
1.创建数据框
1.添加一些额外的列,需要与其他记录进行比较
1.让我们根据您指定的条件筛选出记录
1.让我们将所有匹配的DataFrame连接在一起
1.我们来查找不匹配的记录
1.让我们将它们相加,排序,然后删除一些列以获得所需的输出
这就是: