langchain 语义分块器:列表索引超出范围

w51jfk4q  于 3个月前  发布在  其他
关注(0)|答案(7)|浏览(34)

检查其他资源

  • 为这个问题添加了一个非常描述性的标题。
  • 使用集成搜索在LangChain文档中进行了搜索。
  • 使用GitHub搜索查找类似的问题,但没有找到。
  • 我确信这是LangChain中的一个bug,而不是我的代码。
  • 通过更新到LangChain的最新稳定版本(或特定集成包)无法解决此bug。

示例代码

在text_splitter.py (SemanticChunker)中

def _calculate_sentence_distances(
        self, single_sentences_list: List[str]
    ) -> Tuple[List[float], List[dict]]:
        """Split text into multiple components."""

        _sentences = [
            {"sentence": x, "index": i} for i, x in enumerate(single_sentences_list)
        ]
        sentences = combine_sentences(_sentences, self.buffer_size)
        embeddings = self.embeddings.embed_documents(
            [x["combined_sentence"] for x in sentences]
        )
        for i, sentence in enumerate(sentences):
            sentence["combined_sentence_embedding"] = embeddings[i] << Failed here since embeddings size is less than i at a later point

        return calculate_cosine_distances(sentences)

错误信息和堆栈跟踪(如果适用)

Traceback (most recent call last):
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/A72281951/telly/telly-backend/ingestion/main.py", line 132, in start
    store.load_data_to_db(configured_spaces)
  File "/Users/A72281951/telly/telly-backend/ingestion/common/utils.py", line 70, in wrapper
    value = func(*args, **kwargs)
  File "/Users/A72281951/telly/telly-backend/ingestion/agent/store/db.py", line 86, in load_data_to_db
    for docs in self.ingest_data(spaces):
  File "/Users/A72281951/telly/telly-backend/ingestion/agent/store/db.py", line 77, in ingest_data
    documents.extend(self.chunker.split_documents(docs))
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/langchain_experimental/text_splitter.py", line 258, in split_documents
    return self.create_documents(texts, metadatas=metadatas)
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/langchain_experimental/text_splitter.py", line 243, in create_documents
    for chunk in self.split_text(text):
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/langchain_experimental/text_splitter.py", line 201, in split_text
    distances, sentences = self._calculate_sentence_distances(single_sentences_list)
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/langchain_experimental/text_splitter.py", line 186, in _calculate_sentence_distances
    sentence["combined_sentence_embedding"] = embeddings[i]
IndexError: list index out of range

描述

  • 我正在尝试对文档列表进行分块,但它失败了
  • 我正在使用来自langchain-experimental~=0.0.61的SemanticChunker
  • breakpoint_threshold = percentile,breakpoint_threshold amount = 95.0

系统信息

langchain==0.2.5
langchain-community==0.2.5
langchain-core==0.2.9
langchain-experimental==0.0.61
langchain-google-vertexai==1.0.5
langchain-postgres==0.0.8
langchain-text-splitters==0.2.1
Mac M3
Python 3.10.14

6uxekuva

6uxekuva1#

你好,@amitjoy,你能分享一下你的MVE吗?我无法使用Cohere(而不是OpenAI)重现,并且使用笔记本中引用的Greg Kamradt的样本文本。我有Python 3.10.11,但使用了相同的软件包。以下代码在没有任何错误的情况下工作(百分位数和95是默认值,所以没有更改它们)。

import os
from langchain.embeddings import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain.docstore.document import Document
from langchain_community.embeddings import CohereEmbeddings

if __name__ == '__main__':
    os.environ["OPENAI_API_KEY"] = "<your_key>"
    os.environ["COHERE_API_KEY"] = "<your_key>"

    with open(r'./data/mit.txt') as file:
        essay = file.read()
        doc = Document(page_content=essay)

    # embeddings = OpenAIEmbeddings()
    embeddings = CohereEmbeddings(model="embed-english-light-v3.0")
    chunker = SemanticChunker(embeddings)
    docs = chunker.transform_documents([doc, ])
    print(f"{len(docs)}")
qvtsj1bj

qvtsj1bj2#

我目前正在使用VertexAI Gemini来从Confluence中摄取数据:

self.chunker = SemanticChunker(
                    embeddings=vector_db.embedding, //VertexAIEmbeddings
                    breakpoint_threshold_type=self.settings.db.vector_db.chunking.semantic.breakpoint_threshold.type, // percentile
                    breakpoint_threshold_amount=self.settings.db.vector_db.chunking.semantic.breakpoint_threshold.amount) // 95.0

    def ingest_data(self, spaces: List[str]):
        for space in spaces:
            click.echo(f"⇢ Loading data from space '{space}'")
            confluence_loader = self.loader(space)

            documents: List[Document] = []
            if self.chunker is not None:
                docs: List[Document] = confluence_loader.load()
                documents.extend(self.chunker.split_documents(docs))
            elif if self.splitter is not None:
                documents.extend(confluence_loader.load_and_split(self.splitter))

            """adding space ID to the existing metadata"""
            for doc in documents:
                doc.metadata["space_key"] = space
                # the following metadata is required for ragas
                doc.metadata['filename'] = space
            yield documents
pexxcrt2

pexxcrt23#

你好,@amitjoy,这不是一个MVE(多文档验证示例),例如elif甚至不使用chunker。

我的最佳猜测是,你没有得到任何嵌入。堆栈跟踪对此很清楚,所以例如,在for循环之前尝试打印出embeddings的长度。

我建议将你的代码简化为一个文档,其中调用split_documents(或transform_documents,它只是一个 Package 器)函数。此外,尽量去掉confluence_loader,因为这不应该影响最终结果。

一个MVE的示例:使用我提供的代码(包括提到的文档),但只需用VertexAIEmbeddings替换CohereEmbeddings。如果失败了,那就是VertexAiEmbeddings的问题。如果没有失败,那么使用其中一个文档。如果没有失败,那就是confluence_loader的问题,否则就是文档的问题。

nbnkbykc

nbnkbykc4#

你好@tibor-reiss,@amitjoy,
我遇到了类似的问题。可以通过以下代码片段重现:

import itertools
import lorem
from google.cloud import aiplatform
# from langchain.embeddings import VertexAIEmbeddings  # this one works
from langchain_google_vertexai import VertexAIEmbeddings # this one fails
from langchain_experimental.text_splitter import SemanticChunker

aiplatform.init(project=PROJECT_ID, location=LOCATION)

embedding_model = VertexAIEmbeddings("text-embedding-004")

text_splitter = SemanticChunker(embedding_model)

document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), 200)))

请注意,在使用 from langchain.embeddings import VertexAIEmbeddings 时不会出现此问题,但会触发一个弃用警告。
问题似乎来自于 langchain_google_vertexai/embeddings.py 中的批处理大小计算,尽管文本总数较高,但它仍然为批处理大小产生任意低的值。
text_splitter.py 中,embeddings 的长度与 sentences 不同。

embeddings = self.embeddings.embed_documents( # <<< does not return with the correct number of embeddings
            [x["combined_sentence"] for x in sentences]
        )
        for i, sentence in enumerate(sentences):
>>>>        sentence["combined_sentence_embedding"] = embeddings[i]
r9f1avp5

r9f1avp55#

你好,jsconan,感谢你检查这个问题。正如我们所怀疑的那样,这是一个新VertexAIEmbeddings的问题(或者说是一个特性),而不是SemanticChunker的问题。我在源代码中看到确实有一些重大的变化。我建议你更改这个问题的标题,或者更好的是,在https://github.com/langchain-ai/langchain-google上开一个新的问题。

svdrlsy4

svdrlsy46#

感谢您@tibor-reiss。我已经按照您的建议创建了一个问题:langchain-ai/langchain-google#353

bzzcjhmw

bzzcjhmw7#

@amitjoy 请注意,该问题已在未发布的版本中修复: langchain-ai/langchain-google#353 (评论)

相关问题