为什么在使用stop-words或nltk语料库之后会删除一些英语单词?

cfh9epnr  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(289)

我正在使用pyspark dataframes,需要对其中一列执行数据清理,如下所示:

df.select('words').show(10, truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                               words|
+----------------------------------------------------------------------------------------------------+
|[you, are, hereby, ordered, to, cease, and, desist, all, furthe, r, emails, to, this, address, im...|
|[content, type, text, plain, charset, utf, 8, content, transfer, encoding, quoted, printable, x, ...|
|[you, are, hereby, ordered, to, cease, and, desist, all, furthe, r, emails, to, this, address, im...|
|[, original, message, return, path, bounce, 19853e, 6fb54, visyak, 3djuno, com, cysticacneonchin,...|
|[, forwarded, message, return, pat, h, bounce, 19853e, 6fb54, visyak, 3djuno, com, cysticacneonch...|
|[, original, message, from, 248, 623, 1653, mailto, lisa, lahlahsales, com, 20, sent, tuesday, fe...|
|[2018, horse, trailer, closeouts, free, delivery, cash, back, click, here, to, view, it, online, ...|
|[, original, message, from, paypal, us, mailto, scottkahndmd, nc, rr, com, sent, 27, february, 20...|
|[2col, 1, 2, 09, client, specific, styles, outlook, a, padding, 0, force, outlook, to, provide, a...|
|[you, are, hereby, ordered, to, cease, and, desist, all, furthe, r, emails, to, this, address, im...|
+----------------------------------------------------------------------------------------------------+
only showing top 10 rows

我执行以下数据清理步骤:

remover = StopWordsRemover(inputCol='words', outputCol='words_clean') #remove stop-word
df = remover.transform(df)

df = df.withColumn("words_filtered", F.expr("filter(words_clean, x -> not(length(x) < 3))")).where(F.size(F.col("words_filtered")) > 0) #remove words with less than 3 characters

wnl = WordNetLemmatizer()
@F.udf('array<string>')
def remove_words(words):
    return [word for word in words if wnl.lemmatize(word) in nltk.corpus.words.words()] #removing words that are not in nltk corpus

df = df.withColumn('words_final', remove_words('words_filtered'))

我得到如下输出:

df.select('words_final').show(10, truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                         words_final|
+----------------------------------------------------------------------------------------------------+
|[hereby, ordered, cease, desist, address, immediately, authorities, provider, continued, failure,...|
|[content, type, text, plain, content, transfer, printable, apparently, yahoo, tue, return, path, ...|
|[hereby, ordered, cease, desist, address, immediately, authorities, provider, continued, failure,...|
|[original, message, return, path, bounce, received, sender, bounce, tue, pst, results, received, ...|
|[message, return, pat, bounce, received, sender, bounce, tue, pst, results, received, ass, receiv...|
|                                                       [original, message, sent, ball, subject, get]|
|[horse, trailer, free, delivery, cash, back, click, view, horse, magazine, index, option, archive...|
|[original, message, sent, subject, notification, payment, number, hello, payment, amount, payment...|
|[client, specific, styles, outlook, padding, force, outlook, provide, view, browser, button, body...|
|[hereby, ordered, cease, desist, address, immediately, authorities, provider, continued, failure,...|
+----------------------------------------------------------------------------------------------------+
only showing top 10 rows

我看到停止的话( are, the, in, 以及许多垃圾词,如 scottkahndmd 或不完整的词,如 furthe 已删除。但是像这样的词很少 emails, tuesday, february, encoding, quoted, online 也会被移除。可能会有更多这样的英语单词被忽视。
有什么原因吗?

2w3kk1z5

2w3kk1z51#

在您的例子中,过滤似乎发生在几个地方:
这个 StopWordsRemover 删除常用词,比如, he , she , myself 通常这些词在文本模型中可能不是很有用,但这取决于您要解决的任务
另一层过滤是 WordNetLemmatizer -它可能是清除病毒的罪魁祸首 email , encoding 等。试着调整它,以便在删除单词时不那么咄咄逼人
p、 如果你在spark上做nlp,我建议你去看看spark的nlp包。它可以更高的性能,更多的功能,等等。

相关问题