我正在使用pyspark dataframes,需要对其中一列执行数据清理,如下所示:
df.select('words').show(10, truncate = 100)
+----------------------------------------------------------------------------------------------------+
| words|
+----------------------------------------------------------------------------------------------------+
|[you, are, hereby, ordered, to, cease, and, desist, all, furthe, r, emails, to, this, address, im...|
|[content, type, text, plain, charset, utf, 8, content, transfer, encoding, quoted, printable, x, ...|
|[you, are, hereby, ordered, to, cease, and, desist, all, furthe, r, emails, to, this, address, im...|
|[, original, message, return, path, bounce, 19853e, 6fb54, visyak, 3djuno, com, cysticacneonchin,...|
|[, forwarded, message, return, pat, h, bounce, 19853e, 6fb54, visyak, 3djuno, com, cysticacneonch...|
|[, original, message, from, 248, 623, 1653, mailto, lisa, lahlahsales, com, 20, sent, tuesday, fe...|
|[2018, horse, trailer, closeouts, free, delivery, cash, back, click, here, to, view, it, online, ...|
|[, original, message, from, paypal, us, mailto, scottkahndmd, nc, rr, com, sent, 27, february, 20...|
|[2col, 1, 2, 09, client, specific, styles, outlook, a, padding, 0, force, outlook, to, provide, a...|
|[you, are, hereby, ordered, to, cease, and, desist, all, furthe, r, emails, to, this, address, im...|
+----------------------------------------------------------------------------------------------------+
only showing top 10 rows
我执行以下数据清理步骤:
remover = StopWordsRemover(inputCol='words', outputCol='words_clean') #remove stop-word
df = remover.transform(df)
df = df.withColumn("words_filtered", F.expr("filter(words_clean, x -> not(length(x) < 3))")).where(F.size(F.col("words_filtered")) > 0) #remove words with less than 3 characters
wnl = WordNetLemmatizer()
@F.udf('array<string>')
def remove_words(words):
return [word for word in words if wnl.lemmatize(word) in nltk.corpus.words.words()] #removing words that are not in nltk corpus
df = df.withColumn('words_final', remove_words('words_filtered'))
我得到如下输出:
df.select('words_final').show(10, truncate = 100)
+----------------------------------------------------------------------------------------------------+
| words_final|
+----------------------------------------------------------------------------------------------------+
|[hereby, ordered, cease, desist, address, immediately, authorities, provider, continued, failure,...|
|[content, type, text, plain, content, transfer, printable, apparently, yahoo, tue, return, path, ...|
|[hereby, ordered, cease, desist, address, immediately, authorities, provider, continued, failure,...|
|[original, message, return, path, bounce, received, sender, bounce, tue, pst, results, received, ...|
|[message, return, pat, bounce, received, sender, bounce, tue, pst, results, received, ass, receiv...|
| [original, message, sent, ball, subject, get]|
|[horse, trailer, free, delivery, cash, back, click, view, horse, magazine, index, option, archive...|
|[original, message, sent, subject, notification, payment, number, hello, payment, amount, payment...|
|[client, specific, styles, outlook, padding, force, outlook, provide, view, browser, button, body...|
|[hereby, ordered, cease, desist, address, immediately, authorities, provider, continued, failure,...|
+----------------------------------------------------------------------------------------------------+
only showing top 10 rows
我看到停止的话( are, the, in,
以及许多垃圾词,如 scottkahndmd
或不完整的词,如 furthe
已删除。但是像这样的词很少 emails, tuesday, february, encoding, quoted, online
也会被移除。可能会有更多这样的英语单词被忽视。
有什么原因吗?
1条答案
按热度按时间2w3kk1z51#
在您的例子中,过滤似乎发生在几个地方:
这个
StopWordsRemover
删除常用词,比如,he
,she
,myself
通常这些词在文本模型中可能不是很有用,但这取决于您要解决的任务另一层过滤是
WordNetLemmatizer
-它可能是清除病毒的罪魁祸首email
,encoding
等。试着调整它,以便在删除单词时不那么咄咄逼人p、 如果你在spark上做nlp,我建议你去看看spark的nlp包。它可以更高的性能,更多的功能,等等。