我有一个巨大的 Dataframe 类似于这个:
l = [('20190503', 'par1', 'feat2', '0x0'),
('20190503', 'par1', 'feat3', '0x01'),
('date', 'part', 'feature', 'value'),
('20190501', 'par5', 'feat9', '0x00'),
('20190506', 'par8', 'feat2', '0x00f45'),
('date', 'part', 'feature', 'value'),
('20190501', 'par11', 'feat3', '0x000000000'),
('date', 'part', 'feature', 'value'),
('20190501', 'par3', 'feat9', '0x000'),
('20190501', 'par6', 'feat5', '0x000000'),
('date', 'part', 'feature', 'value'),
('20190506', 'par8', 'feat1', '0x00000'),
('20190508', 'par3', 'feat6', '0x00000000'),
('20190503', 'par4', 'feat3', '0x0c0deffe21'),
('20190503', 'par6', 'feat4', '0x0000000000'),
('20190501', 'par3', 'feat6', '0x0123fe'),
('20190501', 'par7', 'feat4', '0x00000d0')]
columns = ['date', 'part', 'feature', 'value']
+--------+-----+-------+------------+
| date| part|feature| value|
+--------+-----+-------+------------+
|20190503| par1| feat2| 0x0|
|20190503| par1| feat3| 0x01|
| date| part|feature| value|
|20190501| par5| feat9| 0x00|
|20190506| par8| feat2| 0x00f45|
| date| part|feature| value|
|20190501|par11| feat3| 0x000000000|
| date| part|feature| value|
|20190501| par3| feat9| 0x000|
|20190501| par6| feat5| 0x000000|
| date| part|feature| value|
|20190506| par8| feat1| 0x00000|
|20190508| par3| feat6| 0x00000000|
|20190503| par4| feat3|0x0c0deffe21|
|20190503| par6| feat4|0x0000000000|
|20190501| par3| feat6| 0x0123fe|
|20190501| par7| feat4| 0x00000d0|
+--------+-----+-------+------------+
它具有与标题匹配的行,我希望删除所有行,因此结果将是:
+--------+-----+-------+------------+
| date| part|feature| value|
+--------+-----+-------+------------+
|20190503| par1| feat2| 0x0|
|20190503| par1| feat3| 0x01|
|20190501| par5| feat9| 0x00|
|20190506| par8| feat2| 0x00f45|
|20190501|par11| feat3| 0x000000000|
|20190501| par3| feat9| 0x000|
|20190501| par6| feat5| 0x000000|
|20190506| par8| feat1| 0x00000|
|20190508| par3| feat6| 0x00000000|
|20190503| par4| feat3|0x0c0deffe21|
|20190503| par6| feat4|0x0000000000|
|20190501| par3| feat6| 0x0123fe|
|20190501| par7| feat4| 0x00000d0|
+--------+-----+-------+------------+
我试图用.distinct()
方法去掉它们,但总是留下一个。
我该怎么做呢?
1条答案
按热度按时间jchrr9hc1#
这将工作(本质上链接多个过滤器,Spark将负责合并它们,同时创建物理计划)
输入:
输出: