BERTopic n-gram关键词在OpenAI()中需要分隔,

gfttwv5a 于 5个月前发布在其他

关注(0)|答案(6)|浏览(75)

你好，马腾，

我认为OpenAI的表示模型在生成提示时存在一个bug。关键词之间只用空格分隔，而不是逗号，这对于n-grams > 1来说是有问题的。

在244215a文件的第203行到第209行：

def_create_prompt(self, docs, topic, topics):
keywords=list(zip(*topics[topic]))[0]

# Use the Default Chat Prompt

if self.prompt == DEFAULT_CHAT_PROMPT or self.prompt == DEFAULT_PROMPT:
    prompt = self.prompt.replace("[KEYWORDS]", " ".join(keywords))
    prompt = self._replace_documents(prompt, docs)

没有适当的分隔符，我得到的提示如下：

I have a topic that contains the following documents:

Legumes for mitigation of climate change and the provision of feedstock for biofuels and biorefineries. A review.
A global spectral library to characterize the world's soil.
Classification of natural flow regimes in Australia to support environmental flow management.
Laboratory characterisation of shale properties.
Effects of climate extremes on the terrestrial carbon cycle: concepts, processes and potential future impacts.
Threat of plastic pollution to seabirds is global, pervasive, and increasing.
Pushing the limits in marine species distribution modelling: lessons from the land present challenges and opportunities.
Land-use futures in the shared socio-economic pathways.
The WULCA consensus characterization model for water scarcity footprints: assessing impacts of water consumption based on available water remaining (AWARE).
BIOCHAR APPLICATION TO SOIL: AGRONOMIC AND ENVIRONMENTAL BENEFITS AND UNINTENDED CONSEQUENCES.

The topic is described by the following keywords: food land use global properties climate using review potential change different production environmental data changes high study based years model models time used area future terrestrial plant field analysis management

Based on the information above, extract a short topic label in the following format:
topic:


TextGeneration and Cohere look to be okay.

在244215a文件的第130行到第136行：

def _create_prompt(self, docs, topic, topics):
keywords=", ".join(list(zip(*topics[topic]))[0])

# Use the default prompt and replace keywords

if self.prompt == DEFAULT_PROMPT:
    prompt = self.prompt.replace("[KEYWORDS]", keywords)

另外，如果能提供一种方法来生成带有[DOCUMENTS]和[KEYWORDS]的示例提示，以帮助测试，让用户实际上可以看到发送的内容，那将非常有帮助。因为我使用的是AWS上的ChatGPT,所以我有一个自定义类，里面有一些额外的日志记录器，但是在标准的BERTopic中很难看到提示。

BERTopic

来源：https://github.com/MaartenGr/BERTopic/issues/1546

6条答案

按热度按时间

ulydmbyx1#

感谢您的详细描述！我将确保在 #1539 中进行更改。

另外，如果能有一种方法生成带有 [DOCUMENTS] 和 [KEYWORDS] 的应用示例提示，以帮助测试，这样用户实际上可以看到发送的内容，那将会很有帮助。因为我正在使用 AWS 上的 ChatGPT,所以我有一个自定义类，里面有额外的记录器，但是将其与标准的 BERTopic 结合在一起查看提示是很困难的。

确实如此，我可以启用详细程度，打印出每次调用时给出的提示，但这可能会导致过多的日志记录，如果您有一个非常大的数据集的话。

赞(0）回复(0）举报 5个月前

jmo0nnb32#

我在自定义类中设置的内容仅仅是为了打印主题0的提示(如果没有主题，则为离群主题),所以如果你想用详细程度而不是make函数来生成提示，这可能是一个好方法。

赞(0）回复(0）举报 5个月前

5us2dqdw3#

事情是，当只有一个主题被记录时，用户可能想要记录其中的每一个，反之亦然。我可能会在LLMs本身中添加额外的详细程度级别，但这感觉在用户体验方面没有那么直观，因为详细程度在整个BERTopic中处理方式不同。

赞(0）回复(0）举报 5个月前

ymdaylpp4#

是的，以一种比从日志中提取提示更简单的方式访问所有提示可能会很好。代表性文档的选择和多样化是否是确定性的？如果是这样的话，而不是通过循环遍历主题、生成提示并逐个获取描述，你可以一次性生成所有提示，然后通过循环遍历提示来获取每个表示。然后你可以将提示生成抽象为一个函数或方法，以便用户可以调用该函数来获取所有提示，使用他们最初发送给LLM的相同参数，也许如果他们想要的话可以将它们绑定到.get_topic_info()。伪代码可能如下：

representation_model = OpenAI(delay_in_seconds=5, nr_rocs=10, diversity=0.2)

topic_model = BERTopic(representation_model=representation_model)
topics, probs = topic_model.fit_transform()

topic_info = topic_model.get_topic_info()
topic_info['prompts'] = representation_model.generate_prompts()


然后而不是：
BERTopic/bertopic/representation/_openai.py
第192行到第220行 in [817ad86](https://github.com/MaartenGr/BERTopic/commit/817ad86e0c42462dac659f7b4846c6e5f7432449)
|  | # 使用OpenAI的语言模型生成 |
|  | updated_topics= {} |
|  | for topic, docsintqdm(repr_docs_mappings.items(), disable=not topic_model.verbose): |
|  | truncated_docs= [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs] |
|  | prompt=self._create_prompt(truncated_docs, topic, topics) |
|  | |
|  | # 延迟 |
|  | if self.delay_in_seconds: |
|  | time.sleep(self.delay_in_seconds) |
|  | |
|  | if self.chat: |
|  | messages= [ |
|  | {"role": "system", "content": "You are a helpful assistant."}, |
|  | {"role": "user", "content": prompt} |
|  | ] |
|  | kwargs= {"model": self.model, "messages": messages, **self.generator_kwargs} |
|  | if self.exponential_backoff: |
|  | response=chat_completions_with_backoff(**kwargs) |
|  | else: |
|  | response=openai.ChatCompletion.create(**kwargs) |
|  | label=response["choices"][0]["message"]["content"].strip().replace("topic: ", "") |
|  | else: |
|  | if self.exponential_backoff: |
|  | response=completions_with_backoff(model=self.model, prompt=prompt, **self.generator_kwargs) |
|  | else: |
|  | response=openai.Completion.create(model=self.model, prompt=prompt, **self.generator_kwargs) |
|  | label=response["choices"][0]["text"].strip() |
|  | |
|  | updated_topics[topic] = [(label, 1)] |
你可能有类似的东西：

generate prompts

    prompts = self.generate_prompts(topic_model, repr_docs_mappings, topics)
    
    # log an example prompt 
    logger.info("Example prompt: \n{}".format(prompts[min(1,len(prompts))]))

    # Generate using OpenAI's Language Model
    updated_topics = {}
    for topic, p in tqdm(zip(topics, prompts), total=len(topics), disable=not topic_model.verbose):
        
        # Delay
        if self.delay_in_seconds:
            time.sleep(self.delay_in_seconds)

        if self.chat:
            messages = [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": p}
            ]
            kwargs = {"model": self.model, "messages": messages, **self.generator_kwargs}
            if self.exponential_backoff:
                response = chat_completions_with_backoff(**kwargs)
            else:
                response = openai.ChatCompletion.create(**kwargs)
            label = response["choices"][0]["message"]["content"].strip().replace("topic: ", "")
        else:
            if self.exponential_backoff:
                response = completions_with_backoff(model=self.model, prompt=p, **self.generator_kwargs)
            else:
                response = openai.Completion.create(model=self.model, prompt=p, **self.generator_kwargs)
            label = response["choices"][0]["text"].strip()

        updated_topics[topic] = [(label, 1)]

    return updated_topics

def generate_prompts(self, topic_model, repr_docs_mappings, topics):
    prompts = []
    for topic, docs in repr_docs_mappings.items():
        truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
        prompts.append(self._create_prompt(truncated_docs, topic, topics))
    
    return prompts


这段代码基于[#1539](https://github.com/MaartenGr/BERTopic/pull/1539)并且还需要一些工作...它可以生成表示，但是`representation_model.generate_prompts()`仍然不起作用，因为`generate_prompts`位于`extract_topics`内部，依赖于一些不容易从外部获得的东西...但是在没有你的反馈之前，没有必要花更多的时间在这上面。

赞(0）回复(0）举报 5个月前

hpcdzsge5#

好的主意！在将提示传递给LLM之前生成它们是可行的，它们目前不依赖于之前的提示。然而，将来可能会有所改变，所以我认为我更倾向于简单地在迭代生成提示后保存它们。然后，您可以将提示保存到表示模型中并在那里访问它们。
由于提示也依赖于表示模型的顺序(KeyBERT -> OpenAI),我认为.generate_prompts只有在使用OpenAI作为独立工具时才能起作用。因此，如果存在其他表示方法，则该方法在不运行所有其他表示方法的情况下将无法工作，这可能证明计算效率过低。
此外，在您的示例中，您实际上会两次创建提示。一次是在运行.fit_transform时，另一次是在运行.generate_prompts时。相反，您可以在创建表示模型期间将提示保存到representation_model.OpenAI中，然后使用类似representation_model.generated_prompts_的东西访问它们。
基于这一点，我建议以下操作：在任何LLM表示模型中，在创建表示模型的同时保存提示，可以选择记录每个提示或仅记录第一个。这意味着提示在.fit_transform期间只创建一次，并且可以轻松地在之后访问它们。