BERTopic 合并主题

pb3s4cty  于 5个月前  发布在  其他
关注(0)|答案(7)|浏览(70)

嘿,马腾,

我已经进一步优化了我的主题模型,但现在我注意到了一些奇怪的问题。

如果我合并主题(甚至将一个主题合并到另一个单独的主题中),我的主题模型似乎会从32个主题收缩到大约4个,包括离群值。同样的事情发生在我使用update_topics()countVectoriser时。

另外,我还注意到,即使我设置了min_topic_size超参数,它也没有任何作用,尽管我正在使用HDBSCAN。这是你之前遇到过的问题吗?

干杯。

ikfrs5lh

ikfrs5lh1#

这不是预期的行为,我之前没有见过这样的情况。你能分享一下你的完整代码吗?如果没有的话,很难理解这里发生了什么。请确保尽可能完整。

2izufjch

2izufjch2#

embedding_model = SentenceTransformer('all-mpnet-base-v2')
embeddings = embedding_model.encode(docs, show_progress_bar=True)

umap_model = UMAP(n_neighbors=15, 
                  n_components=5, 
                  min_dist=0.0, 
                  metric='cosine', 
                  random_state=42)

hdbscan_model = HDBSCAN(
    min_cluster_size=10, 
    metric='euclidean', 
    cluster_selection_method='eom', 
    min_samples=8, # added to reduce outliers
    prediction_data=True)

vectorizer_model = CountVectorizer(stop_words="english")

# KeyBERT
keybert_model = KeyBERTInspired()

# Part-of-Speech
pos_model = PartOfSpeech("en_core_web_sm")

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# GPT-3.5
openai.api_key=os.environ['openai_api_key'] 
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""
openai_model = OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)

from bertopic.representation import ZeroShotClassification

candidate_topics = [
    'x', 
    # 'y',
    'z, 
    ]

zero_shot_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli")

# representation_model = zero_shot_model

representation_model = {
    "Main": zero_shot_model,
    'KeyBERT': keybert_model,
    # 'OpenAI': openai_model,  # Uncomment if you will use OpenAI
    'MMR': mmr_model,
    # 'POS': pos_model,
    # 'ZeroShot': zero_shot_model,
}

seed_topic_list = [
    ['x'],
  ]

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)

topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  # vectorizer_model=                       # Step 4 - Tokenize topics. Don't do this! It removed the entire abortion topic.
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model, # Step 6 - (Optional) Fine-tune topic representations
  seed_topic_list= seed_topic_list,
  min_topic_size=10, # 10 is the default nope
  nr_topics=29, # 32
  verbose=True,
  n_gram_range=(1,3), # allows Brothers of Italy
  calculate_probabilities=True,
)

topics, probs = topic_model.fit_transform(docs)
topic_labels = topic_model.generate_topic_labels(nr_words=3, topic_prefix=False, word_length=20, separator=', ')
topic_model.set_topic_labels(topic_labels)
# 889 outliers
topic_model.get_topic_info()

yes             = [-1, 11]    # yes, yes yes
thanks          = [-1, 14]    # you thank, you
good_morning    = [-1, 23]    # good morning
why             = [-1, 27]    # why, why why

topics_to_merge = [good_morning, why, thanks, yes]
print(topics_to_merge)
topic_model.merge_topics(docs, topics_to_merge=topics_to_merge)     
topic_model.get_topic_info().head()
a7qyws3x

a7qyws3x3#

你的代码中有很多事情发生,似乎还有一些多余的代码(比如你似乎没有使用的OpenAI部分)。另外,这是你正在使用的确切代码吗?我问这个问题是因为以下内容看起来不像实际的主题:

candidate_topics = [
    'x', 
    # 'y',
    'z, 
    ]

无论如何,我认为问题出在这里:

topics_to_merge = [good_morning, why, thanks, yes]

我相信这应该是主题标识符,而不是标签。参见这里的示例: https://maartengr.github.io/BERTopic/getting_started/topicreduction/topicreduction.html
总之,我强烈建议仔细阅读 best practices guide ,因为它展示了一些获取你正在寻找的主题的有用提示和技巧。
我还注意到,如果我设置了min_topic_size超参数,即使我使用HDBSCAN,它也什么都不做。你有遇到过这种情况吗?
如果你在使用 hdbscan_model ,这是预期的行为,因为 min_topic_size 本质上是 min_cluster_size 参数。换句话说,它会被覆盖。

yyhrrdl8

yyhrrdl84#

你是说Python没有将yes = [-1, 11]视为变量吗?我已经查阅了最佳实践,并花费了大量时间来解决这个问题。

xuo3flqw

xuo3flqw5#

为了清晰起见:

seed_topic_list = [  
    ['Brothers of Italy', 'brothers of italy', 'Italy', 'Italian'],
    ['we are ready'], # The FDI's Campaign slogan     
    ['immigration', 'migration', 'migrants', 'refugee', 'traffickers'],
    ['abortion', 'abort', '194', 'law 194'],
    ['election', 'government', 'vote'],
    ['inflation', 'bills'],
    ['freedom'],
    ['rape', 'raped'],
    ['women'],

    
    ['climate' , 'environmental', 'ecological',  'sustainability'],
    ['fake', 'fake news', 'lies', 'journalism'],
    ['tax', 'income'],
    ['crime'],
    ['minimum wage'],
    ['Nazis', 'nazis'],
    ['pensions'],
    ['family', 'families'],
    ['pets', 'animals'], # added as pets get merge into the migrants topic    
    ['russia']       
    ]
candidate_topics = [
    'migrants', 
    # 'immigration',
    'abortion', 
    'fake news', 
    'Brothers of Italy', 
    'we are ready',
    'rape',
    'Nazis',
    'minimum wage',
    'ecological',
    'green pass',
    'russia'
    'crime', # this is used to separate out crime from migration
    'authoritarian',
    
    'women',
    # 'crime', 

    'inflation', 
    'citizenship', 
    'freedom',
    'prices',
    'pensions',
    'tax',
    'family',
    # 'government'    
    ]
k0pti3hp

k0pti3hp6#

哎呀,我失误了!我完全误解了那个。我以为它们是字符串而不是标识符。

实际上,这可能与您示例中的重复主题有关。如果我没有弄错的话,您打算将以下所有主题合并到离群主题中:

yes             = [-1, 11]    # yes, yes yes
thanks          = [-1, 14]    # you thank, you
good_morning    = [-1, 23]    # good morning
why             = [-1, 27]    # why, why why

topics_to_merge = [good_morning, why, thanks, yes]

我认为您需要这样做:

topics_to_merge = [-1, 11, 14, 23, 27]

这样,所有这些主题都将合并到-1主题中。通过在所有合并过程中重复相同的主题(-1),它将尝试迭代地进行操作,这可能导致问题。

最后,有一个开放的问题 PR ,即将推出新功能(零样本主题建模而不是分类),这可能更适合您的特定用例。它可以在 candidate_topics 中生成您正在寻找的特定主题(包括标签)。

ax6ht2ek

ax6ht2ek7#

哦,现在明白了。太酷了。今天晚些时候会看看。
这个PR看起来正是我想要的!谢谢Maarten。

相关问题