from nltk.corpus import stopwords stopwords.words('english')
返回的最后一个元素是'wouldn',但我认为它应该是'wouldn't'。
htrmnn0y1#
实际上,停用词表中的任何单词在'后面都应该有一个字母。如果能按字母顺序返回它们会更好,但这不是一个错误。
8dtrkrch2#
这个停用词列表假设单词已经根据非字母字符进行了分词,这就是为什么wouldn和t都被列出了。与其删除这些形式,更好的做法是添加全形式,例如wouldn't。
wouldn
t
wouldn't
qni6mghb3#
这是我想要请求的,是的。2017年9月30日6:48 AM,“Steven Bird” notifications@github.com写道:这个停用词列表假设单词已经根据非字母字符进行了分词,这就是为什么would和t都被列出来了。与其删除这些形式,不如添加完整形式,例如wouldn't。——你收到这封邮件是因为你创建了这条线程。直接回复此邮件,在GitHub上查看<#1800 (comment)>,或者静音该线程< https://github.com/notifications/unsubscribe-auth/AShd7I7SYY6h-ELbL02iCMnunUYVaF8mks5snhyRgaJpZM4OqPqB > 。
tkqqtvp14#
So, inspecting the current list of English stopwords I came up with these additions. What am I missing?
you're you've you'll you'd she's it's that'll don't should've aren't couldn't didn't doesn't hadn't hasn't haven't isn't mightn't mustn't needn't shan't shouldn't wasn't weren't won't wouldn't
h79rfbju5#
你可以看到我的分叉版本,以及我在这里尝试纠正这些错误的尝试:https://github.com/fabianvf/python-rake/blob/master/RAKE/stoplists/NLTKStopList.py
5条答案
按热度按时间htrmnn0y1#
实际上,停用词表中的任何单词在'后面都应该有一个字母。如果能按字母顺序返回它们会更好,但这不是一个错误。
8dtrkrch2#
这个停用词列表假设单词已经根据非字母字符进行了分词,这就是为什么
wouldn
和t
都被列出了。与其删除这些形式,更好的做法是添加全形式,例如
wouldn't
。qni6mghb3#
这是我想要请求的,是的。2017年9月30日6:48 AM,“Steven Bird” notifications@github.com写道:这个停用词列表假设单词已经根据非字母字符进行了分词,这就是为什么would和t都被列出来了。与其删除这些形式,不如添加完整形式,例如wouldn't。——你收到这封邮件是因为你创建了这条线程。直接回复此邮件,在GitHub上查看<#1800 (comment)>,或者静音该线程< https://github.com/notifications/unsubscribe-auth/AShd7I7SYY6h-ELbL02iCMnunUYVaF8mks5snhyRgaJpZM4OqPqB > 。
tkqqtvp14#
So, inspecting the current list of English stopwords I came up with these additions. What am I missing?
h79rfbju5#
你可以看到我的分叉版本,以及我在这里尝试纠正这些错误的尝试:https://github.com/fabianvf/python-rake/blob/master/RAKE/stoplists/NLTKStopList.py