numpy 抱面轮毂Dolly模型嵌入余弦相似度的计算

tv6aics1  于 2023-05-17  发布在  其他
关注(0)|答案(1)|浏览(66)

在Python中,我有一个文本查询变量和一个数据集,结构如下:

text = "hey how are you doing today love"
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]

我尝试使用以下管道来计算文本和数据集的Dolly嵌入之间的余弦相似度,如下所示:

# Import Pipeline

from transformers import pipeline
import torch
import accelerate
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

# Create Feature Extraction Object

feature_extraction = pipeline('feature-extraction',
                              model='databricks/dolly-v2-3b', 
                              torch_dtype=torch.bfloat16,
                              trust_remote_code=True, 
                              device_map="auto")

# Define Inputs

text = ["hey how are you doing today love"]
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]

# Create Embeddings

text_embeddings = feature_extraction(text)[0]
dataset_embeddings = feature_extraction(dataset)

text_embeddings = np.array(text_embeddings)
dataset_embeddings = np.array(dataset_embeddings)

text_embeddings = normalize(text_embeddings, norm='l2')
dataset_embeddings = normalize(dataset_embeddings, norm='l2')

cosine_similarity = np.dot(text_embeddings, dataset_embeddings.T)
angular_distance = np.arccos(cosine_similarity) / np.pi

L2规范化失败了,如果我“注解掉”,我会遇到以下错误:

ValueError: shapes (1,7,2560) and (1,3) not aligned: 2560 (dim 2) != 1 (dim 0)

我知道这个错误与text_embeddings和dataset_embeddings的形状不对齐有关。但我不确定我能做些什么来解决它。
救命啊!

x6h2sr28

x6h2sr281#

这里发生了几件事:

  • dolly-v2-3b为给定的文本输入提供多个嵌入,其中嵌入的数量取决于您提供的输入。例如,虽然模型为dataset中的第一个句子提供了7个嵌入(也称为向量),但它为随后的2个句子提供了4个嵌入。
  • cosine similarity度量两个向量之间的相似性。你提供的代码试图将一个句子的多个向量与另一个句子的多个向量进行比较;这违反了X1 M3 N1 X执行的上述操作。因此,在执行相似度计算之前,我们需要将嵌入“压缩”到单个向量中-下面的代码使用称为“向量平均”的技术,该技术简单地计算向量的平均值。
  • 需要为dataset中的每个句子单独调用np.average(用于向量平均)和np.normalize

下面的代码运行没有错误,并返回1cosine similarity作为第一次比较,我们将句子与其本身进行比较,这是预期的。此外,第一次比较的两个相同向量之间的未定义的np.NaNAngular 差也是有意义的。

# Installations required in Google Colab
# %pip install transformers
# %pip install torch
# %pip install accelerate

from transformers import pipeline
import torch
import accelerate
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

# Create Feature Extraction Object

feature_extraction = pipeline('feature-extraction',
                              model='databricks/dolly-v2-3b', 
                              torch_dtype=torch.bfloat16,
                              trust_remote_code=True, 
                              device_map="auto")

# Define Inputs

text = ["hey how are you doing today love"]
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]

# Create Embeddings
text_embeddings = feature_extraction(text)
dataset_embeddings = feature_extraction(dataset)

# Perform Vector Averaging
text_embeddings_avg = np.average(text_embeddings[0], axis=1)
dataset_embeddings_avg = np.array(
    [
        np.average(text_embedding, axis=1)
        for text_embedding
        in dataset_embeddings
    ]
)
print(text_embeddings_avg.shape)  # (1, 2560)
print(dataset_embeddings_avg.shape)  # (3, 1, 2560)

# Perform Normalization
text_embeddings_avg_norm = normalize(text_embeddings_avg, norm='l2')
dataset_embeddings_avg_norm = np.array(
    [
        normalize(text_embedding, norm='l2')
        for text_embedding
        in dataset_embeddings_avg
     ]
)
print(text_embeddings_avg_norm.shape)  # (1, 2560)
print(dataset_embeddings_avg_norm.shape)  # (3, 1, 2560)

# Cosine Similarity
cosine_similarity = np.array(
    [
        np.dot(text_embeddings_avg_norm, text_embedding.T)
        for text_embedding
        in dataset_embeddings_avg_norm
    ]
)
angular_distance = np.arccos(cosine_similarity) / np.pi
print(cosine_similarity.tolist())  # [[[1.0000000000000007]], [[0.7818918337438344]], [[0.7921756683919716]]]
print(angular_distance.tolist())  # [[[nan]], [[0.21425490131377858]], [[0.2089483418862303]]]

相关问题