nltk 在英文停用词表中，有一个字母拼写错误,

7cwmlq89 于 5个月前发布在其他

关注(0)|答案(5)|浏览(47)

from nltk.corpus import stopwords
stopwords.words('english')

返回的最后一个元素是'wouldn',但我认为它应该是'wouldn't'。

nltk

来源：https://github.com/nltk/nltk/issues/1800

5条答案

按热度按时间

htrmnn0y1#

实际上，停用词表中的任何单词在'后面都应该有一个字母。如果能按字母顺序返回它们会更好，但这不是一个错误。

赞(0）回复(0）举报 5个月前

8dtrkrch2#

这个停用词列表假设单词已经根据非字母字符进行了分词，这就是为什么wouldn和t都被列出了。
与其删除这些形式，更好的做法是添加全形式，例如wouldn't。

赞(0）回复(0）举报 5个月前

qni6mghb3#

这是我想要请求的，是的。2017年9月30日6:48 AM,“Steven Bird” notifications@github.com写道：这个停用词列表假设单词已经根据非字母字符进行了分词，这就是为什么would和t都被列出来了。与其删除这些形式，不如添加完整形式，例如wouldn't。——你收到这封邮件是因为你创建了这条线程。直接回复此邮件，在GitHub上查看<#1800 (comment)>,或者静音该线程< https://github.com/notifications/unsubscribe-auth/AShd7I7SYY6h-ELbL02iCMnunUYVaF8mks5snhyRgaJpZM4OqPqB > 。

赞(0）回复(0）举报 5个月前

tkqqtvp14#

So, inspecting the current list of English stopwords I came up with these additions. What am I missing?

you're
you've
you'll
you'd
she's
it's
that'll
don't
should've
aren't
couldn't
didn't
doesn't
hadn't
hasn't
haven't
isn't
mightn't
mustn't
needn't
shan't
shouldn't
wasn't
weren't
won't
wouldn't

赞(0）回复(0）举报 5个月前

h79rfbju5#

你可以看到我的分叉版本，以及我在这里尝试纠正这些错误的尝试：https://github.com/fabianvf/python-rake/blob/master/RAKE/stoplists/NLTKStopList.py

赞(0）回复(0）举报 5个月前