Question 1:Is there a general approach or parameter I can tweak to control the granularity of the topic clustering? Generally, the granularity of the topic clustering is controlled, to an extent, by the size of a cluster. The larger a cluster, the more broad it tends to be. By increasing the number of micro clusters generated you are likely to get more fine-grained topics. To do so, you can decrease either min_topic_size or control the parameters of HDBSCAN directly. Question2: I found sometimes for the topic -1, outliter there is too many sentence. Is there any way to reduce the noise? Actually I feel some noise actually is misclassify as noise For this, you can apply outlier reduction .
3条答案
按热度按时间jobtbby31#
Question 1:Is there a general approach or parameter I can tweak to control the granularity of the topic clustering?
Generally, the granularity of the topic clustering is controlled, to an extent, by the size of a cluster. The larger a cluster, the more broad it tends to be. By increasing the number of micro clusters generated you are likely to get more fine-grained topics. To do so, you can decrease either
min_topic_size
or control the parameters of HDBSCAN directly.Question2: I found sometimes for the topic -1, outliter there is too many sentence. Is there any way to reduce the noise? Actually I feel some noise actually is misclassify as noise
For this, you can apply outlier reduction .
bf1o4zei2#
你能详细介绍一下如何直接控制HDBSCAN的参数吗?它有很多参数,例如:min_cluster_size(最小簇大小)、min_samples(最小样本数)、metric(度量方法)、cluster_selection_method(簇选择方法)和cluster_selection_epsilon(簇选择阈值)。
wsxa1bj13#
我强烈建议阅读HDBSCAN本身的文档,因为它描述得更加详细。