BERTopic 阿拉伯文本可视化文档

mwngjboj  于 4个月前  发布在  其他
关注(0)|答案(6)|浏览(58)

如果阿拉伯语或其他类似语言的主题,使用visualize_documents函数可视化时会有问题。...文本应编辑为适合在Plotly图形中显示的格式。

hiz5n14c

hiz5n14c1#

感谢您分享这个问题。您能否详细解释一下这个问题到底是什么,以及如何解决?我对阿拉伯语或类似的语言不熟悉,所以我需要一些帮助来理解这个问题。

jjjwad0x

jjjwad0x2#

当然
首先,感谢你的工作。你真的做得很好。
这是一个文本编码的问题。我已经解决了,并向你发送了请求。你可以查看详细信息:
阿拉伯文具有两个重要特点:

  1. 从右到左书写。
  2. 字符根据周围的字符改变形状。
    所以当你尝试在不支持阿拉伯文的应用程序或库中打印阿拉伯文时,很可能会得到类似这样的结果:

我们在这里有两个问题,第一个是字符处于孤立形式,这意味着每个字符都会被渲染,而不管它的周围环境如何。第二个问题是文本是从左到右书写的。
要解决后者的问题,我们只需要使用 Unicode bidirectional algorithm ,它在 python-bidi 中完全用Python实现。如果你使用它,你会得到类似这样的结果:

剩下的问题是如何重塑这些字符,并根据它们的周围环境将它们替换为正确的形状。使用这个库有助于重塑,所以我们可以得到正确的结果,如下所示:

r7s23pms

r7s23pms3#

感谢您的详细描述!这确实帮助我了解如何正确渲染阿拉伯文本。实现本身将是我的主要关注点,因为需要额外的依赖项才能正确渲染文本,其中许多对于大多数用户来说并非必要。可选依赖项目前仅关注嵌入,但未来可能会根据进一步的开发和社区的需求而发生变化。也许可以进行某种检查,以查看是否已安装相关软件包并使用它们(如果有这种情况)。通常情况下,只有在用户手动安装这些软件包时才会安装此类软件包。

egmofgnx

egmofgnx4#

你好,我已经解决了这个问题,如下所示:

正如@apoalquaary提到的,我已经添加了所需的库来正确渲染文本。请注意,这不仅适用于阿拉伯语,还适用于其他应该从右到左书写的语言。
此外,我想为对此解决方案感兴趣的人提供我的实现方法:

  1. 安装以下软件包:
pip install python-bidi
pip install arabic_reshaper
  1. 我从这个库中编辑了这个文件: .env/lib/python3.8/sitepackages/bertopic/plotting/_documents.py
import numpy as np
import pandas as pd
import plotly.graph_objects as go

from umap import UMAP
from typing import List, Union

from bidi.algorithm import get_display
import arabic_reshaper

def visualize_documents(topic_model,
                        docs: List[str],
                        topics: List[int] = None,
                        embeddings: np.ndarray = None,
                        reduced_embeddings: np.ndarray = None,
                        sample: float = None,
                        hide_annotations: bool = False,
                        hide_document_hover: bool = False,
                        custom_labels: Union[bool, str] = False,
                        title: str = "<b>Documents and Topics</b>",
                        width: int = 1200,
                        height: int = 750):
    """ Visualize documents and their topics in 2D

Arguments:
topic_model: A fitted BERTopic instance.
docs: The documents you used when calling either `fit` or `fit_transform`
topics: A selection of topics to visualize.
Not to be confused with the topics that you get from `.fit_transform`.
For example, if you want to visualize only topics 1 through 5:
`topics = [1, 2, 3, 4, 5]`.
embeddings: The embeddings of all documents in `docs`.
reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.
sample: The percentage of documents in each topic that you would like to keep.
Value can be between 0 and 1. Setting this value to, for example,
0.1 (10% of documents in each topic) makes it easier to visualize
millions of documents as a subset is chosen.
hide_annotations: Hide the names of the traces on top of each cluster.
hide_document_hover: Hide the content of the documents when hovering over
specific points. Helps to speed up generation of visualization.
custom_labels: If bool, whether to use custom topic labels that were defined using 
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
title: Title of the plot.
width: The width of the figure.
height: The height of the figure.

Examples:

To visualize the topics simply run:

```python
topic_model.visualize_documents(docs)

Do note that this re-calculates the embeddings and reduces them to 2D.
The advised and prefered pipeline for using this function is as follows:

from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Prepare embeddings
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)

# Train BERTopic
topic_model = BERTopic().fit(docs, embeddings)

# Reduce dimensionality of embeddings, this step is optional
# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

# Run the visualization with the original embeddings
topic_model.visualize_documents(docs, embeddings=embeddings)

# Or, if you have reduced the original embeddings already:
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)

Or if you want to save the resulting figure:

fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
fig.write_html("path/to/file.html")
# Sample the data to optimize for visualization and dimensionality reduction
if sample is None or sample > 1:
    sample = 1

indices = []
for topic in set(topic_per_doc):
    s = np.where(np.array(topic_per_doc) == topic)[0]
    size = len(s) if len(s) < 100 else int(len(s) * sample)
    indices.extend(np.random.choice(s, size=size, replace=False))
indices = np.array(indices)

df = pd.DataFrame({"topic": np.array(topic_per_doc)[indices]})
df["doc"] = [docs[index] for index in indices]
df["topic"] = [topic_per_doc[index] for index in indices]

# Extract embeddings if not already done
if sample is None:
    if embeddings is None and reduced_embeddings is None:
        embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
    else:
        embeddings_to_reduce = embeddings
else:
    if embeddings is not None:
        embeddings_to_reduce = embeddings[indices]
    elif embeddings is None and reduced_embeddings is None:
        embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")

# Reduce input embeddings
if reduced_embeddings is None:
    umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit(embeddings_to_reduce)
    embeddings_2d = umap_model.embedding_
elif sample is not None and reduced_embeddings is not None:
    embeddings_2d = reduced_embeddings[indices]
elif sample is None and reduced_embeddings is not None:
    embeddings_2d = reduced_embeddings

unique_topics = set(topic_per_doc)
if topics is None:
    topics = unique_topics

# Combine data
df["x"] = embeddings_2d[:, 0]
df["y"] = embeddings_2d[:, 1]

# Prepare text and names
if isinstance(custom_labels, str):
    names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]
    names = ["_".join([label[0] for label in labels[:4]]) for labels in names]
    names = [label if len(label) < 30 else label[:27] + "..." for label in names]
elif topic_model.custom_labels_ is not None and custom_labels:
    names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
else:
    names = [f"{topic}_" + "_".join([word for word, value in topic_model.get_topic(topic)][:3]) for topic in unique_topics]

# Visualize
fig = go.Figure()

# Outliers and non-selected topics
non_selected_topics = set(unique_topics).difference(topics)
if len(non_selected_topics) == 0:
    non_selected_topics = [-1]

selection = df.loc[df.topic.isin(non_selected_topics), :]
selection["text"] = ""
selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), "Other documents"]

fig.add_trace(
    go.Scattergl(
        x=selection.x,
        y=selection.y,
        hovertext=selection.doc if not hide_document_hover else None,
        hoverinfo="text",
        mode='markers+text',
        name="other",
        showlegend=False,
        marker=dict(color='#CFD8DC', size=5, opacity=0.5)
    )
)

# Selected topics
for name, topic in zip(names, unique_topics):
    if topic in topics and topic != -1:
        selection = df.loc[df.topic == topic, :]
        selection["text"] = ""

        reshaped_text = arabic_reshaper.reshape(name)
        name_flipped = get_display(reshaped_text)

        if not hide_annotations:
            selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), name_flipped]

        fig.add_trace(
            go.Scattergl(
                x=selection.x,
                y=selection.y,
                hovertext=selection.doc if not hide_document_hover else None,
                hoverinfo="text",
                text=selection.text,
                mode='markers+text',
                name=name,
                textfont=dict(
                    size=12,
                ),
                marker=dict(size=5, opacity=0.5)
            )
        )

# Add grid in a 'plus' shape
x_range = (df.x.min() - abs((df.x.min()) * .15), df.x.max() + abs((df.x.max()) * .15))
y_range = (df.y.min() - abs((df.y.min()) * .15), df.y.max() + abs((df.y.max()) * .15))
fig.add_shape(type="line",
              x0=sum(x_range) / 2, y0=y_range[0], x1=sum(x_range) / 2, y1=y_range[1],
              line=dict(color="#CFD8DC", width=2))
fig.add_shape(type="line",
              x0=x_range[0], y0=sum(y_range) / 2, x1=x_range[1], y1=sum(y_range) / 2,
              line=dict(color="#9E9E9E", width=2))
fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text="D1", showarrow=False, yshift=10)
fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text="D2", showarrow=False, xshift=10)

# Stylize layout
fig.update_layout(
    template="simple_white",
    title={
        'text': f"{title}",
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': dict(
            size=22,
            color="Black")
    },
    width=width,
    height=height
)

fig.update_xaxes(visible=False)
fig.update_yaxes(visible=False)
return fig

我希望该库将来能支持它。
@MaartenGr,请告诉我是否需要在这方面的任何支持。我愿意支持与我的语言相关的任何事情。
7fhtutme

7fhtutme5#

你好,我已经解决了这个问题,如下所示:

正如@apoalquaary提到的,我已经添加了所需的库来正确渲染文本。请注意,这不仅适用于阿拉伯语,还适用于其他应该从右向左书写的语言。
此外,我想为对此解决方案感兴趣的人提供我的实现方法:

  1. 安装以下软件包:
pip install python-bidi
pip install arabic_reshaper
  1. 我从该库中编辑了这个文件: .env/lib/python3.8/sitepackages/bertopic/plotting/_documents.py
import numpy as np
import pandas as pd
import plotly.graph_objects as go

from umap import UMAP
from typing import List, Union

from bidi.algorithm import get_display
import arabic_reshaper

def visualize_documents(topic_model,
                        docs: List[str],
                        topics: List[int] = None,
                        embeddings: np.ndarray = None,
                        reduced_embeddings: np.ndarray = None,
                        sample: float = None,
                        hide_annotations: bool = False,
                        hide_document_hover: bool = False,
                        custom_labels: Union[bool, str] = False,
                        title: str = "<b>Documents and Topics</b>",
                        width: int = 1200,
                        height: int = 750):
    """ Visualize documents and their topics in 2D

Arguments:
topic_model: A fitted BERTopic instance.
docs: The documents you used when calling either `fit` or `fit_transform`
topics: A selection of topics to visualize.
Not to be confused with the topics that you get from `.fit_transform`.
For example, if you want to visualize only topics 1 through 5:
`topics = [1, 2, 3, 4, 5]`.
embeddings: The embeddings of all documents in `docs`.
reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.
sample: The percentage of documents in each topic that you would like to keep.
Value can be between 0 and 1. Setting this value to, for example,
0.1 (10% of documents in each topic) makes it easier to visualize
millions of documents as a subset is chosen.
hide_annotations: Hide the names of the traces on top of each cluster.
hide_document_hover: Hide the content of the documents when hovering over
specific points. Helps to speed up generation of visualization.
custom_labels: If bool, whether to use custom topic labels that were defined using 
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
title: Title of the plot.
width: The width of the figure.
height: The height of the figure.

Examples:

To visualize the topics simply run:

```python
topic_model.visualize_documents(docs)

Do note that this re-calculates the embeddings and reduces them to 2D.
The advised and prefered pipeline for using this function is as follows:

from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Prepare embeddings
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)

# Train BERTopic
topic_model = BERTopic().fit(docs, embeddings)

# Reduce dimensionality of embeddings, this step is optional
# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

# Run the visualization with the original embeddings
topic_model.visualize_documents(docs, embeddings=embeddings)

# Or, if you have reduced the original embeddings already:
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)

Or if you want to save the resulting figure:

fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
fig.write_html("path/to/file.html")
# Sample the data to optimize for visualization and dimensionality reduction
if sample is None or sample > 1:
    sample = 1

indices = []
for topic in set(topic_per_doc):
    s = np.where(np.array(topic_per_doc) == topic)[0]
    size = len(s) if len(s) < 100 else int(len(s) * sample)
    indices.extend(np.random.choice(s, size=size, replace=False))
indices = np.array(indices)

df = pd.DataFrame({"topic": np.array(topic_per_doc)[indices]})
df["doc"] = [docs[index] for index in indices]
df["topic"] = [topic_per_doc[index] for index in indices]

# Extract embeddings if not already done
if sample is None:
    if embeddings is None and reduced_embeddings is None:
        embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
    else:
        embeddings_to_reduce = embeddings
else:
    if embeddings is not None:
        embeddings_to_reduce = embeddings[indices]
    elif embeddings is None and reduced_embeddings is None:
        embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")

# Reduce input embeddings
if reduced_embeddings is None:
    umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit(embeddings_to_reduce)
    embeddings_2d = umap_model.embedding_
elif sample is not None and reduced_embeddings is not None:
    embeddings_2d = reduced_embeddings[indices]
elif sample is None and reduced_embeddings is not None:
    embeddings_2d = reduced_embeddings

unique_topics = set(topic_per_doc)
if topics is None:
    topics = unique_topics

# Combine data
df["x"] = embeddings_2d[:, 0]
df["y"] = embeddings_2d[:, 1]

# Prepare text and names
if isinstance(custom_labels, str):
    names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]
    names = ["_".join([label[0] for label in labels[:4]]) for labels in names]
    names = [label if len(label) < 30 else label[:27] + "..." for label in names]
elif topic_model.custom_labels_ is not None and custom_labels:
    names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
else:
    names = [f"{topic}_" + "_".join([word for word, value in topic_model.get_topic(topic)][:3]) for topic in unique_topics]

# Visualize
fig = go.Figure()

# Outliers and non-selected topics
non_selected_topics = set(unique_topics).difference(topics)
if len(non_selected_topics) == 0:
    non_selected_topics = [-1]

selection = df.loc[df.topic.isin(non_selected_topics), :]
selection["text"] = ""
selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), "Other documents"]

fig.add_trace(
    go.Scattergl(
        x=selection.x,
        y=selection.y,
        hovertext=selection.doc if not hide_document_hover else None,
        hoverinfo="text",
        mode='markers+text',
        name="other",
        showlegend=False,
        marker=dict(color='#CFD8DC', size=5, opacity=0.5)
    )
)

# Selected topics
for name, topic in zip(names, unique_topics):
    if topic in topics and topic != -1:
        selection = df.loc[df.topic == topic, :]
        selection["text"] = ""

        reshaped_text = arabic_reshaper.reshape(name)
        name_flipped = get_display(reshaped_text)

        if not hide_annotations:
            selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), name_flipped]

        fig.add_trace(
            go.Scattergl(
                x=selection.x,
                y=selection.y,
                hovertext=selection.doc if not hide_document_hover else None,
                hoverinfo="text",
                text=selection.text,
                mode='markers+text',
                name=name,
                textfont=dict(
                    size=12,
                ),
                marker=dict(size=5, opacity=0.5)
            )
        )

# Add grid in a 'plus' shape
x_range = (df.x.min() - abs((df.x.min()) * .15), df.x.max() + abs((df.x.max()) * .15))
y_range = (df.y.min() - abs((df.y.min()) * .15), df.y.max() + abs((df.y.max()) * .15))
fig.add_shape(type="line",
              x0=sum(x_range) / 2, y0=y_range[0], x1=sum(x_range) / 2, y1=y_range[1],
              line=dict(color="#CFD8DC", width=2))
fig.add_shape(type="line",
              x0=x_range[0], y0=sum(y_range) / 2, x1=x_range[1], y1=sum(y_range) / 2,
              line=dict(color="#9E9E9E", width=2))
fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text="D1", showarrow=False, yshift=10)
fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text="D2", showarrow=False, xshift=10)

# Stylize layout
fig.update_layout(
    template="simple_white",
    title={
        'text': f"{title}",
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': dict(
            size=22,
            color="Black")
    },
    width=width,
    height=height
)

fig.update_xaxes(visible=False)
fig.update_yaxes(visible=False)
return fig

我希望该库将来能支持它。
@MaartenGr,请告诉我是否需要在这方面的任何支持。我愿意支持与我的语言相关的一切。
已经解决了,当时就解决了。请查看我的仓库。
au9on6nz

au9on6nz6#

好的,我之前没有看到这个仓库。谢谢 :)

相关问题