indices = np.arange(len(text_column)) np.random.shuffle(indices) text_column = [text_column[i] for i in indices]
chunk_size = 10000 text_chunks = [text_column[i:i + chunk_size] for i in range(0, len(text_column), chunk_size)] topics = []
for i in tqdm(range(len(text_chunks)), desc="Processing chunks", ): text_chunk = text_chunks[i] topics_chunk = model.fit_transform(text_chunk) topics.extend(topics_chunk)
Could you format your code with ```python so that it is easier to read? Also, could you share your full code? Moreover, could you share the full error that you received? Without it, it is difficult to understand where the issue lies.
from river import stream
from river import cluster
class River:
def __init__(self, model):
self.model = model
def partial_fit(self, umap_embeddings):
for umap_embedding, _ in stream.iter_array(umap_embeddings):
self.model = self.model.learn_one(umap_embedding)
labels = []
for umap_embedding, _ in stream.iter_array(umap_embeddings):
label = self.model.predict_one(umap_embedding)
labels.append(label)
self.labels_ = labels
return self
# Using DBSTREAM to detect new topics as they come in
cluster_model = River(cluster.DBSTREAM())
vectorizer_model = OnlineCountVectorizer(stop_words="english")
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)
# Prepare model
topic_model_v2 = BERTopic(
hdbscan_model=cluster_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
)
batch_size = 1000
batches = [text_col[i:i + batch_size] for i in range(0, len(text_col), batch_size)]
all_my_topics=[]
# Incrementally fit the topic model by training on 1000 documents at a time
for batches in batches:
topic_model_v2.partial_fit(batch)
batch_size = 1000
batches = [text_col[i:i + batch_size] for i in range(0, len(text_col), batch_size)]
all_my_topics=[]
# Incrementally fit the topic model by training on 1000 documents at a time
for batches in batches:
topic_model_v2.partial_fit(batch)
您没有更新内部的 topic_model_v2.topics_ ,这仍然应该完成。所以像这样:
batch_size = 1000
batches = [text_col[i:i + batch_size] for i in range(0, len(text_col), batch_size)]
all_my_topics=[]
# Incrementally fit the topic model by training on 1000 documents at a time
for batches in batches:
topic_model_v2.partial_fit(batch)
all_my_topics.extend(topic_model_v2.topics_)
topic_model_v2.topics_ = all_my_topics
8条答案
按热度按时间x759pob21#
请确保
text_col
中的文档数量与date
中的日期数量相同。运行.topics_over_time
的唯一方法是确保文档和日期的大小相同。我假设您已经保存了原始数据,为什么不将其用作topics_over_time
的输入呢?e5njpo682#
你好,Maarten!非常感谢你抽出时间回答我的问题。我已经意识到在批量处理过程中犯了一个错误,并且没有按照在线主题建模部分的建议更新主题。一旦我解决了这个问题,主题的长度和文档的长度是相同的,我可以绘制分层主题。但是我现在遇到的问题是,当我尝试使用以下代码合并主题时:
topics_to_merge = [1, 2] topic_model.merge_topics(docs, topics_to_merge)
我遇到了一个KeyError,无法合并主题。我正在附上批处理处理代码以便进一步了解:
indices = np.arange(len(text_column))
np.random.shuffle(indices)
text_column = [text_column[i] for i in indices]
chunk_size = 10000
text_chunks = [text_column[i:i + chunk_size] for i in range(0, len(text_column), chunk_size)]
topics = []
for i in tqdm(range(len(text_chunks)), desc="Processing chunks", ):
text_chunk = text_chunks[i]
topics_chunk = model.fit_transform(text_chunk)
topics.extend(topics_chunk)
topic_model.topics_ = topics
ttcibm8c3#
Could you format your code with ```python so that it is easier to read? Also, could you share your full code? Moreover, could you share the full error that you received? Without it, it is difficult to understand where the issue lies.
kd3sttzy4#
你好,Maarten!感谢你的耐心。以下是我使用的完整代码和在尝试合并主题时遇到的错误:
此外,我还尝试使用在线主题建模教程中的River包,但每次都遇到相同的错误(我尝试在我的Mac机器上本地安装它,也尝试在Google Colab上安装):
vsikbqxv5#
更新:我成功地让River工作(需要安装一个较旧的版本),但我再次遇到了相同的问题:
此外,partial_fit的结果与fit_transform()的结果有很大差异,其中我得到了200个主题,而使用partial_fit时只有18个。
vmdwslir6#
在您的代码中:
您没有更新内部的
topic_model_v2.topics_
,这仍然应该完成。所以像这样:此外,partial_fit的结果与fit_transform()的结果有很大差异,后者我得到200个主题,而使用partial_fit只有18个。这是可以预料到的,因为您使用了两种完全不同的聚类模型。在前者中,您使用的是HDBSCAN,而后者使用的是DBSTREAM。此外,训练过程也不同,因为在前者中,您在整个数据集上训练UMAP,而在后者中,您只在第一批数据上训练UMAP。
相反,您的用例可能适合新引入的
.merge_models
方法。该方法允许将不同的主题模型合并在一起。当您使用此方法结合两个模型时,第一个模型将保持不变,第二个模型将被添加,只要它包含新的簇。现有的簇不会被添加,因为它们已经在第一个模型中找到了。您可以持续进行这种操作,并通过每次训练新模型的方式不断合并模型。这意味着您可以使用此方法进行增量学习,即通过迭代训练和出现新模型来迭代训练。您可以了解更多关于这个 #1516 的信息。
b1zrtrql7#
感谢您的详细回答!尽管我已经尝试了您的建议,但我仍然遇到相同的问题。
dddzy1tm8#
你能分享你尝试过的提议的完整代码吗?这样交流会更容易一些。