tensorflow 在Python中使用序列设置数组元素时发生ValueError

tcomlyy6  于 2023-08-06  发布在  Python
关注(0)|答案(1)|浏览(107)

我正在做一个nlp项目,试图使用预先训练好的词嵌入来训练一个用于情感分析的LSTM模型。然而,当我试图将单词嵌入分配给numpy数组时,我遇到了ValueError。下面是我的模型的代码片段:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from gensim.models import Word2Vec
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropou

df = pd.read_csv('/content/drive/MyDrive/machinelearning/processed_data_CLEAN.csv')

subset_size = 10000
subset_data = df.sample(n=subset_size, random_state=42)
processed_tweets = subset_data['Tweet'].tolist()

w2v_model = Word2Vec.load('/content/drive/MyDrive/machinelearning/word2vec_model/model.bin')

num_tweets = len(processed_tweets)

embedding_dim = w2v_model.vector_size

max_sequence_length = max(len(tweet) for tweet in processed_tweets)

tweet_matrix = np.zeros((num_tweets, max_sequence_length, embedding_dim))

for i, tweet in enumerate(processed_tweets):
    for j, word in enumerate(tweet):
        if word in w2v_model.wv:
            tweet_matrix[i, j] = w2v_model.wv[word]
        else:
            tweet_matrix[i, j] = np.zeros((embedding_dim,))

polarity = subset_data['Polarity']

processed_polarity = polarity.replace({0: 0, 2: 1, 4: 2})

labels = to_categorical(processed_polarity, num_classes=3)

X_train, X_test, y_train, y_test = train_test_split(tweet_matrix, labels, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
processed_tweets_subset = processed_tweets[:num_tweets]
vocab_size = len(set(word for tweet in processed_tweets_subset for word in tweet))
num_classes = 3 # for negative,neutral and positive

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length))
model.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(units=num_classes, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()
model.fit(X_train, y_train, epochs=5, batch_size=32,shuffle=True)

字符串
我得到的错误是这样的:

ValueError: in user code:    ValueError: Exception encountered when calling layer 'sequential_3' (type Sequential).
    
Input 0 of layer "lstm_3" is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: (32, 152, 100, 100)
    
    Call arguments received by layer 'sequential_3' (type Sequential):
      • inputs=tf.Tensor(shape=(32, 152, 100), dtype=float32)
      • training=True
      • mask=None


显然,有一个意外的ndim(32,152,100,100),我不知道它来自哪里,我已经验证了单词嵌入的正确形状和数组尺寸匹配。是我遗漏了什么,还是有不同的方法将单词嵌入分配给数组?

vltsax25

vltsax251#

最可疑的是,你的网络的第一个Embedding层的大小正好与你的词汇表一样大,这意味着它是为了获取 * 单词索引 *。相反,你传递的是X_train,其中 * 每个 * 元素已经是一个查找的embedding_dim(可能是100维)向量。
这可能是错误消息中额外维度的来源。
你是否试图合并两种对词嵌入的来源有不同假设的技术?
你可能想用预先训练的向量来初始化Embedding层,在正确的位置,然后组成你的X_train,这样每个例子都是一个单词索引数组,而不是单词向量数组?

相关问题