nltk 在英文停用词表中,有一个字母拼写错误,

7cwmlq89  于 5个月前  发布在  其他
关注(0)|答案(5)|浏览(47)
from nltk.corpus import stopwords
stopwords.words('english')

返回的最后一个元素是'wouldn',但我认为它应该是'wouldn't'。

htrmnn0y

htrmnn0y1#

实际上,停用词表中的任何单词在'后面都应该有一个字母。如果能按字母顺序返回它们会更好,但这不是一个错误。

8dtrkrch

8dtrkrch2#

这个停用词列表假设单词已经根据非字母字符进行了分词,这就是为什么wouldnt都被列出了。
与其删除这些形式,更好的做法是添加全形式,例如wouldn't

qni6mghb

qni6mghb3#

这是我想要请求的,是的。2017年9月30日6:48 AM,“Steven Bird” notifications@github.com写道:这个停用词列表假设单词已经根据非字母字符进行了分词,这就是为什么would和t都被列出来了。与其删除这些形式,不如添加完整形式,例如wouldn't。——你收到这封邮件是因为你创建了这条线程。直接回复此邮件,在GitHub上查看<#1800 (comment)>,或者静音该线程< https://github.com/notifications/unsubscribe-auth/AShd7I7SYY6h-ELbL02iCMnunUYVaF8mks5snhyRgaJpZM4OqPqB > 。

tkqqtvp1

tkqqtvp14#

So, inspecting the current list of English stopwords I came up with these additions. What am I missing?

you're
you've
you'll
you'd
she's
it's
that'll
don't
should've
aren't
couldn't
didn't
doesn't
hadn't
hasn't
haven't
isn't
mightn't
mustn't
needn't
shan't
shouldn't
wasn't
weren't
won't
wouldn't
h79rfbju

h79rfbju5#

你可以看到我的分叉版本,以及我在这里尝试纠正这些错误的尝试:https://github.com/fabianvf/python-rake/blob/master/RAKE/stoplists/NLTKStopList.py

相关问题