unstructured/unstructured/staging/huggingface.py 第75行至第82行在a66661a中 | | ifnum_tokens>max_chunk_size: | | | raiseValueError( | | | f"The number of tokens in the segment is {num_tokens}. " | | | f"The maximum number of tokens is {max_chunk_size}. " | | | "Consider using a different split_function to reduce the size " | | | "of the segments under consideration. The text that caused the " | | | f"error is:
1条答案
按热度按时间vltsax251#
我遇到了一段24164个字符的文本,包含8220个标记;因此,无法适应OpenAI嵌入模型上下文窗口(通过OpenAI API);这段文本是一篇研究论文的参考文献部分。
OpenAI的经验法则是自然语言中每个标记平均有4个字符;因此,限制为8192个标记意味着32768个字符;但是,像我的例子那样基于字符的分块是不可靠的。
我认为当前实现的
chunk_by_attention_window
并没有太大帮助;它只考虑连接分块,而不考虑拆分它们;实际上,它要求split_function
的输出分块小于max_input_size
(上下文窗口),否则会引发异常。unstructured/unstructured/staging/huggingface.py
第75行至第82行在a66661a中
| | ifnum_tokens>max_chunk_size: |
| | raiseValueError( |
| | f"The number of tokens in the segment is {num_tokens}. " |
| | f"The maximum number of tokens is {max_chunk_size}. " |
| | "Consider using a different split_function to reduce the size " |
| | "of the segments under consideration. The text that caused the " |
| | f"error is:
{segment}", |
| | ) |