我正在做一个nlp项目,试图使用预先训练好的词嵌入来训练一个用于情感分析的LSTM模型。然而,当我试图将单词嵌入分配给numpy数组时,我遇到了ValueError。下面是我的模型的代码片段:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from gensim.models import Word2Vec
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropou
df = pd.read_csv('/content/drive/MyDrive/machinelearning/processed_data_CLEAN.csv')
subset_size = 10000
subset_data = df.sample(n=subset_size, random_state=42)
processed_tweets = subset_data['Tweet'].tolist()
w2v_model = Word2Vec.load('/content/drive/MyDrive/machinelearning/word2vec_model/model.bin')
num_tweets = len(processed_tweets)
embedding_dim = w2v_model.vector_size
max_sequence_length = max(len(tweet) for tweet in processed_tweets)
tweet_matrix = np.zeros((num_tweets, max_sequence_length, embedding_dim))
for i, tweet in enumerate(processed_tweets):
for j, word in enumerate(tweet):
if word in w2v_model.wv:
tweet_matrix[i, j] = w2v_model.wv[word]
else:
tweet_matrix[i, j] = np.zeros((embedding_dim,))
polarity = subset_data['Polarity']
processed_polarity = polarity.replace({0: 0, 2: 1, 4: 2})
labels = to_categorical(processed_polarity, num_classes=3)
X_train, X_test, y_train, y_test = train_test_split(tweet_matrix, labels, test_size=0.2, random_state=42)
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
processed_tweets_subset = processed_tweets[:num_tweets]
vocab_size = len(set(word for tweet in processed_tweets_subset for word in tweet))
num_classes = 3 # for negative,neutral and positive
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length))
model.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(units=num_classes, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()
model.fit(X_train, y_train, epochs=5, batch_size=32,shuffle=True)
字符串
我得到的错误是这样的:
ValueError: in user code: ValueError: Exception encountered when calling layer 'sequential_3' (type Sequential).
Input 0 of layer "lstm_3" is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: (32, 152, 100, 100)
Call arguments received by layer 'sequential_3' (type Sequential):
• inputs=tf.Tensor(shape=(32, 152, 100), dtype=float32)
• training=True
• mask=None
型
显然,有一个意外的ndim(32,152,100,100),我不知道它来自哪里,我已经验证了单词嵌入的正确形状和数组尺寸匹配。是我遗漏了什么,还是有不同的方法将单词嵌入分配给数组?
1条答案
按热度按时间vltsax251#
最可疑的是,你的网络的第一个
Embedding
层的大小正好与你的词汇表一样大,这意味着它是为了获取 * 单词索引 *。相反,你传递的是X_train
,其中 * 每个 * 元素已经是一个查找的embedding_dim
(可能是100维)向量。这可能是错误消息中额外维度的来源。
你是否试图合并两种对词嵌入的来源有不同假设的技术?
你可能想用预先训练的向量来初始化
Embedding
层,在正确的位置,然后组成你的X_train
,这样每个例子都是一个单词索引数组,而不是单词向量数组?