为什么我的代码为路透社语料库计算了1720901个单词,而实际上有130万个单词?

km0tfn4u  于 2022-10-22  发布在  Python
关注(0)|答案(1)|浏览(147)

我想统计路透社语料库中的单词数。下面的python代码给出了1720901。虽然我知道正确答案大约是130万字。

len(nltk.corpus.reuters.words())

这种差异的原因是什么?

ttcibm8c

ttcibm8c1#

这是因为语料库由多个文件组成,而words()方法返回的对象缓慢加载单词,因此len函数只告诉您当前视图中的标记数。
演示:

>>> import nltk
>>> help(nltk.corpus.reuters.words())

阅读__len__方法的说明
输出:

Help on ConcatenatedCorpusView in module nltk.corpus.reader.util object:

class ConcatenatedCorpusView(nltk.collections.AbstractLazySequence)
 |  ConcatenatedCorpusView(corpus_views)
 |
 |  A 'view' of a corpus file that joins together one or more
 |  ``StreamBackedCorpusViews<StreamBackedCorpusView>``.  At most
 |  one file handle is left open at any time.
 |
 |  Method resolution order:
 |      ConcatenatedCorpusView
 |      nltk.collections.AbstractLazySequence
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __init__(self, corpus_views)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |
 |  __len__(self)
 |    **Return the number of tokens in the corpus file underlying this**
 |    **corpus view.**
 |
 |  close(self)
 |
 |  iterate_from(self, start_tok)
 |      Return an iterator that generates the tokens in the corpus
 |      file underlying this corpus view, starting at the token number
 |      ``start``.  If ``start>=len(self)``, then this iterator will
 |      generate no tokens.
 |
 |  ----------------------------------------------------------------------
 |  Methods inherited from nltk.collections.AbstractLazySequence:
 |
 |  __add__(self, other)
 |      Return a list concatenating self with other.
 |
 |  __contains__(self, value)
 |      Return true if this list contains ``value``.
 |
 |  __eq__(self, other)
 |      Return self==value.
 |
 |  __ge__(self, other, NotImplemented=NotImplemented)
 |      Return a >= b.  Computed by @total_ordering from (not a < b).
 |
 |  __getitem__(self, i)
 |      Return the *i* th token in the corpus file underlying this
 |      corpus view.  Negative indices and spans are both supported.
 |
 |  __gt__(self, other, NotImplemented=NotImplemented)
 |      Return a > b.  Computed by @total_ordering from (not a < b) and (a != b).
 |
 |  __hash__(self)
 |      :raise ValueError: Corpus view objects are unhashable.
 |
 |  __iter__(self)
 |      Return an iterator that generates the tokens in the corpus
 |      file underlying this corpus view.
 |
 |  __le__(self, other, NotImplemented=NotImplemented)
 |      Return a <= b.  Computed by @total_ordering from (a < b) or (a == b).
 |
 |  __lt__(self, other)
 |      Return self<value.
 |
 |  __mul__(self, count)
 |      Return a list concatenating self with itself ``count`` times.
 |
 |  __ne__(self, other)
 |      Return self!=value.
 |
 |  __radd__(self, other)
 |      Return a list concatenating other with self.
 |
 |  __repr__(self)
 |      Return a string representation for this corpus view that is
 |      similar to a list's representation; but if it would be more
 |      than 60 characters long, it is truncated.
 |
 |  __rmul__(self, count)
 |      Return a list concatenating self with itself ``count`` times.
 |
 |  count(self, value)
 |      Return the number of times this list contains ``value``.
 |
 |  index(self, value, start=None, stop=None)
 |      Return the index of the first occurrence of ``value`` in this
 |      list that is greater than or equal to ``start`` and less than
 |      ``stop``.  Negative start and stop values are treated like negative
 |      slice bounds -- i.e., they count from the end of the list.

相关问题