BERTopic Langchain表示:在生成过程中不包含KEYWORDS标签,

vdgimpew  于 3个月前  发布在  其他
关注(0)|答案(6)|浏览(46)

你好,
为什么在langchain链表示生成中没有包含KEYWORDS标签?
可以添加吗?我认为它可能提高标签准确性。
在_langchain.py中可以简单地实现这样的功能吗?

# `self.chain` must take `input_documents` and `question` as input keys
        # Use a custom prompt that leverages keywords, using the tag: [KEYWORDS]
        if "[KEYWORDS]" in self.prompt:
            prompts = []
            for topic in topics:
                keywords = list(zip(*topics[topic]))[0]
                prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
                prompts.append(prompt)

            inputs = [
                {"input_documents": docs, "question": prompt}
                for docs, prompt in zip(chain_docs, prompts)
            ]
            
        else:
            inputs = [
                {"input_documents": docs, "question": self.prompt}
                for docs in chain_docs
            ]

而不是:

inputs = [
                {"input_documents": docs, "question": self.prompt}
                for docs in chain_docs
            ]
bpzcxfmw

bpzcxfmw2#

我认为这完全可以做到,但我不确定LangChain是如何确切处理提示的。例如,它是在问题之前还是之后添加文档?例如,在你的例子中,提示会是什么样子?

jaxagkaj

jaxagkaj3#

我认为这是完全可以实现的,但我不确定LangChain是如何精确地处理提示的。例如,它是在问题之前还是之后添加文档?以你的例子为例,提示会是什么样子?

请注意,我对编程一窍不通,所以不确定我是否理解你的意思。

但是,我根据你的文档和主题的格式尝试进行了测试,它似乎符合批处理过程的输入要求,并且与您当前输入的结构相同:

import pandas as pd

from langchain.docstore.document import Document

from typing import Callable, Dict, Mapping, List, Tuple, Union
repr_docs = [["Hi", "Bye", "Sigh"], ["No", "So", "Glow"]]

 

chain_docs: List[List[Document]] = [

    [

        Document(

            page_content=doc

            )

        for doc in docs

    ]

    for docs in repr_docs

]

 

topics = {"-1": [("Well", .3), ("Sell", .2)], "0": [("Sap", .33), ("Cap", .21)]}

 

self_prompt = "Huh? [KEYWORDS]"
if "[KEYWORDS]" in self_prompt:

    prompts = []

    for topic in topics:

        keywords = list(zip(*topics[topic]))[0]

        prompt = self_prompt.replace("[KEYWORDS]", ", ".join(keywords))

        prompts.append(prompt)

 

    inputs = [

        {"input_documents": docs, "question": prompt}

        for docs,prompt in zip(chain_docs,prompts)

    ]

 

else:

    inputs = [

        {"input_documents": docs, "question": self_prompt}

        for docs in chain_docs

    ]

 

# `self.chain` must return a dict with an `output_text` key

# same output key as the `StuffDocumentsChain` returned by `load_qa_chain`

#outputs = self.chain.batch(inputs=inputs, config=self.chain_config)

#labels = [output["output_text"].strip() for output in outputs]
inputs

#Example prompt template: prompt = ChatPromptTemplate.from_template("tell me a joke about {topic}")

#...

#Example batch input: chain.batch([{"topic": "bears"}, {"topic": "cats"}], config={"max_concurrency": 5})

带[KEYWORDS]的输入格式:

[{'input_documents': [Document(page_content='Hi'),
Document(page_content='Bye'),
Document(page_content='Sigh')],
'question': 'Huh? Well, Sell'},
{'input_documents': [Document(page_content='No'),
Document(page_content='So'),
Document(page_content='Glow')],
'question': 'Huh? Sap, Cap'}]

不带[KEYWORDS]的输入格式(您当前的实现):

[{'input_documents': [Document(page_content='Hi'),
Document(page_content='Bye'),
Document(page_content='Sigh')],
'question': 'Huh?'},
{'input_documents': [Document(page_content='No'),
Document(page_content='So'),
Document(page_content='Glow')],
'question': 'Huh?'}]
我认为,鉴于它模仿了您的输入格式,只是每个主题都有一个不同的提示,Langchain应该以相同的方式处理它吗?它接收文档列表和每个主题给定的提示,但现在提示是针对每个主题而不是一个静态值(但提示仍然以相同的方式被摄取)。

ej83mcc0

ej83mcc04#

感谢您分享这个。我的意思是,对于确切的提示将发送到底层LLM,对我来说还不清楚。例如,self.prompt 不是实际的提示,而可能是一个模板,因为LangChain处理它以包含文档。
换句话说,底层提示是否如下:

"""
What are these documents about? Please give a single label.

Doc 1
Doc 2
Doc 3
"""

或者像这样:

"""
Doc 1
Doc 2
Doc 3

What are these documents about? Please give a single label.
"""

很重要要知道这一点,因为如果用户不知道如何使用for,仅仅添加一个 [KEYWORDS] 标签是不够的。例如,如果您想使用 [KEYWORDS] ,输入提示看起来是什么样子?是这样的:

"""
I have a topic with the following keywords: `[KEYWORDS]`
Create a topic label based on the keywords and the following documents:

Doc 1
Doc 2
Doc 3
"""

还是这样的:

"""
Doc 1
Doc 2
Doc 3

These documents are about a topic that has the following keywords: `[KEYWORDS]`

Create a topic label based on the keywords and these most representative documents.
"""
u5rb5r59

u5rb5r595#

@MaartenGr
明白了,我认为它来自这个初始化:
stuff_prompt.py
load_qa_chain (在这里: question_answering/init.py )调用 _load_stuff_chain 对于 "stuff" 链,这通过

def _load_stuff_chain(
    llm: BaseLanguageModel,
    prompt: Optional[BasePromptTemplate] = None,
    document_variable_name: str = "context",
    verbose: Optional[bool] = None,
    callback_manager: Optional[BaseCallbackManager] = None,
    callbacks: Callbacks = None,
    **kwargs: Any,
) -> StuffDocumentsChain:
    _prompt = prompt or stuff_prompt.PROMPT_SELECTOR.get_prompt(llm)

所以要么是自定义的可选提示模板,或者它引出 stuff_prompt.PROMPT_SELECTOR(来自上面的 stuff_prompt.py),即:

PROMPT_SELECTOR = ConditionalPromptSelector(
    default_prompt=PROMPT, conditionals=[(is_chat_model, CHAT_PROMPT)]
)

然后提示来自于:

prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

因此使用的默认模板是以后者选项的形式:

"""
Doc 1
Doc 2
Doc 3

These documents are about a topic that has the following keywords: `[KEYWORDS]`

Create a topic label based on the keywords and these most representative documents.
"""

至少从那个文档来看,我认为它们仍然适用/没有被弃用。

y0u0uwnf

y0u0uwnf6#

在这种情况下,现在BERTopic在使用LangChain时的基准提示应该更新,以纳入[KEYWORDS]标签。我还很好奇这将如何影响生成的主题标签的质量与原始方法相比。如果你愿意,我很感激收到PR!

相关问题