我正在处理pysparkDataframe,如下所示:
+-------+--------------------------------------------------+
| id| words|
+-------+--------------------------------------------------+
|1475569|[pt, m, reporting, delivery, scam, thank, 0a, 0...|
|1475568|[, , delivered, trblake, yahoo, com, received, ...|
|1475566|[, marco, v, washin, gton, thursday, de, cembe...|
|1475565|[, marco, v, washin, gton, wednesday, de, cembe...|
|1475563|[joyce, 20, begin, forwarded, message, 20, memo...|
+-------+--------------------------------------------------+
数据类型:
id: 'bigint'
words: 'array<string>'
我想从“单词”列中删除非英语单词(包括数值或带数字的单词,如bun20),我已经删除了停止词,但如何从该列中删除其他非英语单词?
请帮忙。
1条答案
按热度按时间h9a6wy2h1#
您可以使用自定义项检查数组中的每个单词是否在nltk语料库中: