llama_index [问题]:在向量Db中存储重复项翻译结果：在向量Db中存储重复项

qlckcl4x 于 5个月前发布在其他

关注(0)|答案(4)|浏览(47)

问题验证

我已经在文档和Discord上搜索过答案。

问题

我已经创建了节点，并且这些节点是唯一的。然而，当我尝试创建一个向量存储索引时，它在我的向量数据库中创建了重复的条目。我正在使用Qdrant Db,并且在我查看向量数据库时，有重复的条目可用。这是我的Qdrant数据库初始化：

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
client = qdrant_client.QdrantClient(
    url="xxxxx",
    api_key="xxxxx",
)
asynclient = qdrant_client.AsyncQdrantClient(
    url="xxxxxx",
    api_key="xxxxxx",
)
# vector_store = QdrantVectorStore(client=client, collection_name="Tender-Db", aclient=asynclient)

在创建节点后，我使用了存储上下文：

from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    nodes,
    storage_context=storage_context,
    embed_model = VoyageEmbeddModel
)
# index = VectorStoreIndex.from_vector_store(vector_store=vector_store, embed_model=VoyageEmbeddModel)

请问有人能告诉我如何解决这个问题吗？

llama_index

来源：https://github.com/run-llama/llama_index/issues/13595

4条答案

按热度按时间

laawzig21#

要解决在使用QdrantVectorStore时Qdrant数据库中出现重复条目的问题，您应该实现一个在添加节点之前检查节点是否存在的机制。QdrantVectorStore类本身并不会根据内容自动阻止重复；它只使用id字段作为唯一标识符。
以下是确保仅添加唯一节点的策略：

使用get_nodes方法通过ID查询数据库中已存在的节点。
从待添加的节点列表中过滤掉已经存在于数据库中的任何节点。
仅将新的、唯一的节点添加到数据库中。
这种方法涉及检查每个节点的唯一标识符(假设每个BaseNode都有一个唯一的node_id属性)与数据库中现有条目进行比较，并仅在节点尚未存在的情况下继续添加。
以下是一个简化的代码片段来说明此方法：

def add_unique_nodes(node_list):
    existing_nodes = query_database_for_existing_nodes()
    for node in node_list:
        if node.id not in existing_nodes:
            add_node_to_database(node)

通过采用这种方法，您可以有效地防止在Qdrant数据库中添加重复条目。这确保了您的VectorStoreIndex在其包含的节点中保持唯一性。

详细信息

来源

此响应旨在帮助您节省时间并提高效率。它不是精确的解决方案，而是您自己研究的一个起点。
请帮助我变得更有用！如果这个有帮助，请留下一个👍，如果无关紧要，请留下一个👎。

要继续对话，请提及@dosu。*

赞(0）回复(0）举报 5个月前

bogh5gae2#

你应该使用一个带有文档存储和向量存储的摄取管道。

假设每个输入文档都有一致的文档ID,它可以在文档存储中查找该文档(如果有)的现有哈希值，并决定是更新、跳过还是插入。

这里有一个例子(忽略缓存):

$x_{1e0f1}x$

赞(0）回复(0）举报 5个月前

whlutmcx3#

感谢@logan-markewich,如果我想要在Chroma中实现vector_store和本地存储的docstore,这是否是您的推荐实现？

将内容嵌入并存储在Chroma vector_store和本地docstore中：

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        embed_model,
    ],
    docstore=SimpleDocumentStore(),
    vector_store=ChromaVectorStore(
        chroma_collection=chromadb.PersistentClient(path="./chroma_db").get_or_create_collection("collection_name")
    ),
    docstore_strategy=DocstoreStrategy.UPSERTS
)

nodes = pipeline.run(documents=documents)

index = VectorStoreIndex.from_vector_store(
    pipeline.vector_store, embed_model=embed_model
)

pipeline.persist("./pipeline_storage")

从Chroma vector_store和本地docstore中加载内容：

documents = SimpleDirectoryReader(
    "./test_redis_data", filename_as_id=True
).load_data()

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ]
)

pipeline.load("./pipeline_storage")

nodes = pipeline.run(documents=documents)

index = VectorStoreIndex.from_vector_store(
    pipeline.vector_store, embed_model=embed_model
)

赞(0）回复(0）举报 5个月前

ig9co6j14#

@130jd not quite -- you should pass in the vector store again when loading. Tbh I would load both the vector and docstore outside of the pipeline and just pass it in. But that's just me

赞(0）回复(0）举报 5个月前

我来回答

llama_index [问题]:在向量Db中存储重复项翻译结果：在向量Db中存储重复项

问题验证

问题

4条答案

详细信息

相关问题

热门标签

最新问答

llama_index [问题]:在向量Db中存储重复项 翻译结果：在向量Db中存储重复项

问题验证

问题

4条答案

详细信息

相关问题

热门标签

最新问答

llama_index [问题]:在向量Db中存储重复项翻译结果：在向量Db中存储重复项