unstructured feat/chunking_by_title_tokens

u4dcyp6a 于 4个月前发布在其他

关注(0)|答案(1)|浏览(65)

使用相同的软窗口和小块组合逻辑，只是使用令牌而不是字符。
我看到还有一个chunk_by_attention_window(),它是暂存/目标端的一部分，所以似乎可以连续使用这两个。

unstructured

来源：https://github.com/Unstructured-IO/unstructured/issues/2967

1条答案

按热度按时间

vltsax251#

我遇到了一段24164个字符的文本，包含8220个标记；因此，无法适应OpenAI嵌入模型上下文窗口(通过OpenAI API);这段文本是一篇研究论文的参考文献部分。

OpenAI的经验法则是自然语言中每个标记平均有4个字符；因此，限制为8192个标记意味着32768个字符；但是，像我的例子那样基于字符的分块是不可靠的。

我认为当前实现的chunk_by_attention_window并没有太大帮助；它只考虑连接分块，而不考虑拆分它们；实际上，它要求split_function的输出分块小于max_input_size(上下文窗口),否则会引发异常。

unstructured/unstructured/staging/huggingface.py
第75行至第82行在a66661a中
| | ifnum_tokens>max_chunk_size: |
| | raiseValueError( |
| | f"The number of tokens in the segment is {num_tokens}. " |
| | f"The maximum number of tokens is {max_chunk_size}. " |
| | "Consider using a different split_function to reduce the size " |
| | "of the segments under consideration. The text that caused the " |
| | f"error is:

{segment}", |
| | ) |

赞(0）回复(0）举报 4个月前

我来回答

unstructured feat/chunking_by_title_tokens

1条答案

相关问题

热门标签

最新问答