AttributeError: 'BERTopic' object has no attribute 'c_tf_idf'

km0tfn4u  于 3个月前  发布在  其他
关注(0)|答案(3)|浏览(54)

我正在按照这个Issue中的步骤测试元数据如何影响主题的普及程度/内容
#360
但是当我运行时,遇到了AttributeError: 'BERTopic' object has no attribute 'c_tf_idf'错误。
ests = estimate_effect(topic_model=topic_model,
topics=[-1, 0],
metadata=metadata,
docs=enr_df_docs,
probs=probs,
estimator="content ~ score",
y="content")
print([est.summary() for est in ests])

wi3ka0sx

wi3ka0sx1#

在那个问题中有一堆代码,所以我不确定你指的是哪一个。你能分享一下吗?

fjaof16o

fjaof16o2#

我尝试了以下操作:
首先运行基本的BERTopic模型:

vectorizer_model = CountVectorizer(stop_words="english")
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
topic_model = BERTopic(vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model, calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(enr_df_docs)

然后在评论中运行 estimate_effect 函数:

def estimate_effect(topic_model, 
                    docs: List[str], 
                    topics: Union[int, List[int]], 
                    metadata: pd.DataFrame, 
                    y: str = "prevalence", 
                    probs: np.ndarray = None, 
                    estimator: Union[str, Callable] = None,
                    estimator_kwargs: Mapping[str, Any] = None) -> List[wrap.ResultsWrapper]:
    
    """ Estimate the effect of metadata on topic prevalence and topic content
    
    Arguments:
        docs: The original list of documents on which the model was trained on
        probs: A mxn probability matrix, *m* is the number of document and 
               *n* the number of topics. It represents the probabilities of all topics 
               across all documents. 
        topics: The topic(s) for which you want to estimate the effect of metadata on
        metadata: The metadata in a dataframe. Make sure that the columns have the exact same 
                  name as the elements in the estimator
        y: The target, either "prevalence" (topic prevalence) or "content" (topic content)
        estimator: Either the formula used in the estimator or a custom estimator. 
                   When it is used as a formula, it follows R-style formulas, for example:
                      * 'prevalence ~ rating'
                      * 'prevalence ~ rating + day + rating:day'
                   Make sure that the target is either 'prevalence' or 'content'
                   The custom estimator should be a `statsmodels.formula.api`, currently, 
                   `statsmodels.api` is not supported.
        estimator_kwargs: The arguments needed within the estimator, needs at 
                          least a "formula" argument
                          
    Returns:
        fitted_estimators: List of fitted estimators for either topic prevalence or topic content
    """

    data = metadata.loc[::] 
    data["topics"] = topic_model._map_predictions(topic_model.hdbscan_model.labels_)
    data["docs"] = docs
    fitted_estimators = []
    
    if isinstance(topics, int):
        topics = [topics]
    
    # As a proxy for the topic prevalence, we take the probability of a document
    # belonging to a specific topic. We assume that a higher probability of a topic 
    # belonging to that topic also results in that document talking more about that topic    
    if y == "prevalence":
        for topic in topics:
            # Prepare topic prevalence, 
            # Exclude probs == 1 as no zero-one inflated beta regressions are currently avaible
            data["prevalence"] = list(probs[:, topic])
            data_filtered = data.loc[data.prevalence < 1, :]

            # Either use a custom estimator or a pre-set model
            if callable(estimator):
                est = estimator(data=data_filtered, **estimator_kwargs).fit()
            else:
                est = smf.glm(estimator, data=data_filtered, family=sm.families.Gamma(link=sm.families.links.log())).fit()
            fitted_estimators.append(est)

    # Topic content is modeled on a document-level by calculating the document cTFIDF 
    # representation. Based on that representation, we calculate its cosine similarity 
    # with its topic cTFIDF representation. The assumption here, is that we expect different 
    # similarity scores if a covariate changes the topic content.
    elif y == "content":
        for topic in topics:
            # Extract topic content and prevalence
            selected_data = data.loc[data.topics == topic, :]
            c_tf_idf_per_doc, _ = topic_model._c_tf_idf(pd.DataFrame({"Document": selected_data.docs.tolist()}), fit=False)
            sim_matrix = cosine_similarity(c_tf_idf_per_doc, topic_model.c_tf_idf)
            selected_data["content"] = sim_matrix[:, topic+1]

            # Either use a custom estimator or a pre-set model
            if callable(estimator):
                est = estimator(data=selected_data, **estimator_kwargs).fit()
            else:
                est = smf.glm(estimator, data=selected_data, 
                              family=sm.families.Gamma(link=sm.families.links.log())).fit()  # perhaps remove the gamma + link?
            fitted_estimators.append(est)

    return fitted_estimators

关于流行病学的代码运行良好

ests = estimate_effect(topic_model=topic_model, 
                      topics=[-1, 1],
                      metadata=metadata, 
                      docs=enr_df_docs, 
                      probs=probs, 
                      estimator="prevalence ~ score",
                      y="prevalence")
print([est.summary() for est in tests])

但是关于内容的代码返回错误

ests = estimate_effect(topic_model=topic_model, 
                      topics=[-1, 0],
                      metadata=metadata, 
                      docs=enr_df_docs, 
                      probs=probs, 
                      estimator="content ~ score",
                      y="content")
print([est.summary() for est in ests])

我猜我在这里弄错了什么,但我真的没有更改任何代码

elif y == "content":
        for topic in topics:
            # Extract topic content and prevalence
            selected_data = data.loc[data.topics == topic, :]
            c_tf_idf_per_doc, _ = topic_model._c_tf_idf(pd.DataFrame({"Document": selected_data.docs.tolist()}), fit=False)
            sim_matrix = cosine_similarity(c_tf_idf_per_doc, topic_model.c_tf_idf)
            selected_data["content"] = sim_matrix[:, topic+1]

抱歉给您带来麻烦,感谢您提前的回复。

mzsu5hc0

mzsu5hc03#

我认为你需要将 .c_tf_idf 更改为 .c_tf_idf_ 以获得正确的变量。我相信它在一段时间前已经更新了,这解释了你的问题。

相关问题