llama.cpp 支持phi-2分词器

ogsagwnx  于 6个月前  发布在  其他
关注(0)|答案(5)|浏览(156)

先决条件

在提交问题之前,请先为自己回答以下问题。

功能描述

为phi-2提供支持。运行以下命令会产生错误:

python -c "
from huggingface_hub import snapshot_download;
snapshot_download(repo_id='microsoft/phi-2', local_dir='phi-2', local_dir_use_symlinks=False)
"
python convert-hf-to-gguf.py phi-2/ --outtype f16

错误:

Traceback (most recent call last):
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 3001, in <module>
    main()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 2988, in main
    model_instance.set_vocab()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 75, in set_vocab
    self._set_vocab_gpt2()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 331, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 242, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 323, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

Phi-2使用 CodeGenTokenizer,这是一个BPE分词器。
我不确定是否只需在这里添加以下行?

{ "name": "phi-2",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-2" },

编辑尝试过那个,这是生成的哈希值:

if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
    # ref: https://huggingface.co/microsoft/phi-2
    res = "phi-2"
d5vmydt9

d5vmydt92#

你能确认HF分词和llama.cpp量化的GGUF连接的分词器给出相同的结果吗?

特别是当文本包含特殊字符时,如#7049#7062

zynd9foi

zynd9foi3#

@turian 有任何想法可以让我轻松测试吗?

pokxtpni

pokxtpni4#

抱歉,如果这是发错的帖子,我不知道这是否有用,但我想分享一下我尝试转换Phi 2模型的快速尝试。
我遇到了上面提到的错误,并尝试修改从HuggingFace转换而来的两个.py文件。
在更新文件中,我在问题这里搜索到的Phi行中添加了更多的Phi行,特别是这个线程和:

7219(评论)

// convert-hf-to-gguf-update.py

    {"name": "phi",            "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-1", },
    {"name": "phi-2",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-2", },
    {"name": "phi-3",          "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct", },

...以及对两个convert-hf-to-gguf.py的修改:

if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
            # ref: https://huggingface.co/microsoft/phi-1
            res = "phi"
        if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
            # ref: https://huggingface.co/microsoft/phi-2
            res = "phi-2"

请注意,这两个chkhsh值是相同的吗?
对于Phi2Model类,我添加了一个单独的add_tokenizer_pre行:

self.gguf_writer.add_name("Phi2")
        self.gguf_writer.add_tokenizer_pre("gpt-2")
        self.gguf_writer.add_context_length(self.find_hparam(["n_positions", "max_position_embeddings"]))

然后我尝试运行我的弗兰肯斯坦创造物。它似乎起作用了,但是当我测试它时,我看到了这个错误:

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'phi'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'model-q4_0.gguf'
main: error: unable to load model

//我尝试删除所有的phi 1引用,再试一次。现在错误变成了:

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'phi-2'
xfb7svmp

xfb7svmp5#

有什么方法可以让我像Fietje一样通过解决办法来工作吗?

相关问题