python 从Pyspark Dataframe 中删除与标题匹配的行

2admgd59  于 2023-02-02  发布在  Python
关注(0)|答案(1)|浏览(99)

我有一个巨大的 Dataframe 类似于这个:

l = [('20190503', 'par1', 'feat2', '0x0'),
('20190503', 'par1', 'feat3', '0x01'),
('date', 'part', 'feature', 'value'),
('20190501', 'par5', 'feat9', '0x00'),
('20190506', 'par8', 'feat2', '0x00f45'),
('date', 'part', 'feature', 'value'),
('20190501', 'par11', 'feat3', '0x000000000'),
('date', 'part', 'feature', 'value'),
('20190501', 'par3', 'feat9', '0x000'),
('20190501', 'par6', 'feat5', '0x000000'),
('date', 'part', 'feature', 'value'),
('20190506', 'par8', 'feat1', '0x00000'),
('20190508', 'par3', 'feat6', '0x00000000'),
('20190503', 'par4', 'feat3', '0x0c0deffe21'),
('20190503', 'par6', 'feat4', '0x0000000000'),
('20190501', 'par3', 'feat6', '0x0123fe'),
('20190501', 'par7', 'feat4', '0x00000d0')]

columns = ['date', 'part', 'feature', 'value']

+--------+-----+-------+------------+
|    date| part|feature|       value|
+--------+-----+-------+------------+
|20190503| par1|  feat2|         0x0|
|20190503| par1|  feat3|        0x01|
|    date| part|feature|       value|
|20190501| par5|  feat9|        0x00|
|20190506| par8|  feat2|     0x00f45|
|    date| part|feature|       value|
|20190501|par11|  feat3| 0x000000000|
|    date| part|feature|       value|
|20190501| par3|  feat9|       0x000|
|20190501| par6|  feat5|    0x000000|
|    date| part|feature|       value|
|20190506| par8|  feat1|     0x00000|
|20190508| par3|  feat6|  0x00000000|
|20190503| par4|  feat3|0x0c0deffe21|
|20190503| par6|  feat4|0x0000000000|
|20190501| par3|  feat6|    0x0123fe|
|20190501| par7|  feat4|   0x00000d0|
+--------+-----+-------+------------+

它具有与标题匹配的行,我希望删除所有行,因此结果将是:

+--------+-----+-------+------------+
|    date| part|feature|       value|
+--------+-----+-------+------------+
|20190503| par1|  feat2|         0x0|
|20190503| par1|  feat3|        0x01|
|20190501| par5|  feat9|        0x00|
|20190506| par8|  feat2|     0x00f45|
|20190501|par11|  feat3| 0x000000000|
|20190501| par3|  feat9|       0x000|
|20190501| par6|  feat5|    0x000000|
|20190506| par8|  feat1|     0x00000|
|20190508| par3|  feat6|  0x00000000|
|20190503| par4|  feat3|0x0c0deffe21|
|20190503| par6|  feat4|0x0000000000|
|20190501| par3|  feat6|    0x0123fe|
|20190501| par7|  feat4|   0x00000d0|
+--------+-----+-------+------------+

我试图用.distinct()方法去掉它们,但总是留下一个。
我该怎么做呢?

jchrr9hc

jchrr9hc1#

这将工作(本质上链接多个过滤器,Spark将负责合并它们,同时创建物理计划)

for col in df.schema.names:
   df=df.filter(F.col(col) != col)

df.show()

输入:

输出:

相关问题