从头开始训练一个BPE分词器,我使用的是Split预处理。在下面的示例中,我在每个数字上进行分割,这样数字就可以由它们组成的数字序列表示。
from datasets import load_dataset
from tokenizers import models, pre_tokenizers, trainers, Tokenizer
# Dataset
ds = load_dataset('HuggingFaceFW/fineweb', streaming = True)['train']
texts = [sample['text'] for sample in ds.take(10_000)]
print(len(texts))
# Init Tokenizer
tokenizer = Tokenizer(models.BPE(unk_token="<UNK>", byte_fallback = True))
digit_split_pretokenization_pattern = r'\d'
split_pretokenizer = pre_tokenizers.Split(pattern = digit_split_pretokenization_pattern, behavior = "isolated", invert = False)
tokenizer.pre_tokenizer = split_pretokenizer
# Sentinel tokens
sentinel_tokens = ["<UNK>", "<BOS>", "<EOS>"]
# Digits
digits = [str(num) for num in range(10)]
# Combine
special_tokens = sentinel_tokens + digits
print('Number of Special Tokens:', len(special_tokens))
trainer = trainers.BpeTrainer(
vocab_size=32_000,
special_tokens=special_tokens,
initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
min_frequency=2,
limit_alphabet=1024,
max_token_length = 32,
show_progress = True
)
tokenizer.train_from_iterator(texts, trainer = trainer)
# Encode a test sample
text = "This is a text that involves numbers 123,456 and 789.321"
print(tokenizer.encode(text).tokens)
['This is ', 'a t', 'ext', ' that ', 'involves ', 'numbers ', '1', '2', '3', ',', '4', '5', '6', ' and ', '7', '8', '9', '.', '3', '2', '1']
编码函数按照预期工作,每个数字都被分割。
# Check vocabulary
numeric_tokens_in_vocab = [token for token in tokenizer.get_vocab() if any(char.isnumeric() for char in token)]
print(numeric_tokens_in_vocab[:100])
['1996', '$1', '54 ', 'Version 7.', '$5 ', '3D ', ':5', '3 million ', '70 ', '.\n3', '6) ', '37 ', '228', '10-', '2010-', '500', '250 ', '360', '24/', '$200', '11-', '\n4', '12th ', '69', '-1', '2007', '6/', '1930', '11, ', '1:', '1 cup ', 'May 17', '2.0 ', '77', '7) ', '50-', '2013, ', '0, ', '4th ', '02 ', '⅘', '28 ', '0.3 ', '8 million ', '24', '160', '1.5', '18, ', '$2', '$1,', ', 2009 ', '61 ', '7\n', '27', 'z47', '187', '0', 'in 2012', '2), ', '10\n', '1% ', '9th ', '39 ', '3: ', '07/', '.\n4', ',000 ', '4 ', 'at 3', '15th ', '185', '96', '1993', '8, ', 'in 197', '(1)', '01/', '2011', '$17', '/2012', '7 million ', '482 U.S. 304, ', ':43 ', 'at 8', '5-', ':00 p.m', '14/', '130', '200', '15-', '1 or ', '8-', 'May 14', '10', '12) ', '62 ', '1940', ':53 ', '6:', '100']
然而,这个模式并没有应用到词汇表上。
它仍然会导致由不同数量的数字组成的几个标记,而我的意图是让每个数字成为一个单独的标记,而不是包含数字的其他标记。
2条答案
按热度按时间6ss1mwsb1#
看起来这与问题here有关。通过使用tokenizers.Regex Package 模式解决了这个问题。
3bygqnnd2#
很高兴你找到了解决方法!