BERTopic representation_model: 'NoneType'对象不可迭代

bmp9r5qi  于 4个月前  发布在  其他
关注(0)|答案(4)|浏览(84)

你好!
首先,感谢你开发BERTopic,它很棒!然而,我在尝试重命名我的聚类表示时遇到了一个问题。只要我只使用embedding_model,一切都正常。但是,一旦我开始使用representation_model,我就一直得到相同的错误。
以下是一些受此文档启发的示例代码。

# Import the necessary libraries
from bertopic import BERTopic
import pandas as pd
from transformers import pipeline
from bertopic.representation import TextGeneration

# prompt = f"I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?"

# Create your representation model
generator = pipeline('text2text-generation', model='google/flan-t5-base')
representation_model = TextGeneration(generator)

# 4. Get some sample data
data = pd.read_excel(testdata.xlsx')

# 5. Initialize BERTopic with the representation model
topic_model = BERTopic(
    embedding_model= 'paraphrase-multilingual-mpnet-base-v2',
    representation_model = representation_model # if commented, code works
)

# 6. Fit BERTopic to the sample texts
topics, _ = topic_model.fit_transform(data['text'])

# 6. Get the topic information
topic_info = topic_model.get_topic_info()

# 7. Print the topic information
print(topic_info)

我得到的错误是:

TypeError                                 Traceback (most recent call last)
Cell In[3], line 26
     20 topic_model = BERTopic(
     21     embedding_model= 'paraphrase-multilingual-mpnet-base-v2',
     22     representation_model = representation_model
     23 )
     25 # 6. Fit BERTopic to the sample texts
---> 26 topics, _ = topic_model.fit_transform(data['Absatz'])
     28 # 6. Get the topic information
     29 topic_info = topic_model.get_topic_info()

File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:433, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    430     self._save_representative_docs(custom_documents)
    431 else:
    432     # Extract topics by calculating c-TF-IDF
--> 433     self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
    435     # Reduce topics
    436     if self.nr_topics:

File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:3637, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose)
   3635 documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
   3636 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
-> 3637 self.topic_representations_ = self._extract_words_per_topic(words, documents)
   3638 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
   3639 self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
   3640                       for key, values in
   3641                       self.topic_representations_.items()}

File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:3922, in BERTopic._extract_words_per_topic(self, words, documents, c_tf_idf, calculate_aspects)
   3920         topics = tuner.extract_topics(self, documents, c_tf_idf, topics)
   3921 elif isinstance(self.representation_model, BaseRepresentation):
-> 3922     topics = self.representation_model.extract_topics(self, documents, c_tf_idf, topics)
   3923 elif isinstance(self.representation_model, dict):
   3924     if self.representation_model.get("Main"):

File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/representation/_textgeneration.py:147, in TextGeneration.extract_topics(self, topic_model, documents, c_tf_idf, topics)
    143 updated_topics = {}
    144 for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
    145 
    146     # Prepare prompt
--> 147     truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
    148     prompt = self._create_prompt(truncated_docs, topic, topics)
    149     self.prompts_.append(prompt)

TypeError: 'NoneType' object is not iterable

在M1 Mac上运行它,如果这有帮助的话。任何帮助都将不胜感激。还尝试从best practise复制所有代码并得到了相同的错误。
最好的祝愿!
亚历克斯·穆尔豪森

zour9fqk

zour9fqk1#

老实说,我不确定这里发生了什么。我相信还有一个同样的问题没有解决,但它可能与底层的T5模型有关。另外,你试过将文档作为字符串列表传递,而不是pandas系列吗?

v1l68za4

v1l68za42#

我遇到了相同的问题,但只在使用TextGeneration表示模型时出现。我可以生成其他表示模型而没有问题。我确实尝试将文档作为字符串列表传递,但错误仍然存在。
我在v0.15.0上运行相同的代码成功。
编辑:我进行了一些调查,发现问题出在这一行。似乎无论何时使用默认提示,顶部代表性文档都将是None
为了解决这个问题,可以在第141行的else条件中将空列表分配为默认值。我打开了一个PR with this change

puruo6ea

puruo6ea3#

感谢PR。我刚刚合并了#1726,这应该解决了问题。你们中的一位能否测试一下,以便我知道它对其他人也有效?

z6psavjg

z6psavjg4#

感谢您的更新!我测试了一下,在我这边运行没有出现任何错误。

相关问题