pytorch 为什么加载IMDB数据集时出错

ykejflvf  于 2023-10-20  发布在  其他
关注(0)|答案(1)|浏览(163)
from torchtext.datasets import WikiText2, IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tkzer = get_tokenizer('basic_english')

tr_iter = WikiText2(split='train')
vocabulary = build_vocab_from_iterator(map(tkzer, tr_iter), specials=['<unk>'])

tr_iter_imdb = IMDB(split='train')
vocabulary = build_vocab_from_iterator(map(tkzer, tr_iter_imdb), specials=['<unk>'])

WikiText2的代码运行良好。但是当涉及到IMDB时,我在运行 build_vocab_from_iterator 时得到以下错误。
“tuple”对象没有属性“lower”
有人能帮我理解为什么会这样吗?我认为这与IMDB数据结构不同,不同于WikiText2。在这种情况下,我如何为IMDB数据集构建vocab。

ghg1uchk

ghg1uchk1#

IMDB()返回一个包含int和str的元组:

IMDB Dataset

For additional details refer to http://ai.stanford.edu/~amaas/data/sentiment/

Number of lines per split:

train: 25000
test: 25000
Args:
    root: Directory where the datasets are saved. Default: os.path.expanduser('~/.torchtext/cache')
    split: split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

:returns: DataPipe that yields tuple of label (1 to 2) and text containing the movie review
:rtype: (int, str)

我建议你检查元组中的文本是否是你想要的,然后更新你的map函数,如下所示:map(lambda x : tkzer(x[1]),tr_iter_imdb)

相关问题