我正在尝试将Kaggle入门Natural Language Processing with Disaster Tweets竞赛作为我大学深度学习课程的考试项目。
我尝试使用多输入网络来解决这个问题,其中关键字和位置列由两个单独的Conv1D网络处理,文本列由TransformerEncoder处理。我让Conv1D网络工作,但TransformerEncoder在标题中给我错误。我使用的是单词嵌入(尝试使用从头训练和GloVe嵌入,但是两者都给予相同的错误)和位置编码,它们基于第2版《使用Python进行Chollet深度学习》中的implementation(TransformerEncoder和PositionalEncoding类)。
下面是我处理数据集的方式:
1.我将.csv作为Pandas DataFrame导入,经过一些预处理后,进行训练和验证拆分
1.我将DataFrame转换为tf.dataTensor数据集:
train_text = data.Dataset.from_tensor_slices((train_data['text'].values.astype(str), train_data['target'].values.astype(bool)))
train_keywords = data.Dataset.from_tensor_slices((train_data['keyword'].values.astype(str), train_data['target'].values.astype(bool)))
train_loc = data.Dataset.from_tensor_slices((train_data['location'].values.astype(str), train_data['target'].values.astype(bool)))
val_text = data.Dataset.from_tensor_slices((validation_data['text'].values.astype(str), validation_data['target'].values.astype(bool)))
val_keywords = data.Dataset.from_tensor_slices((validation_data['keyword'].values.astype(str), validation_data['target'].values.astype(bool)))
val_loc = data.Dataset.from_tensor_slices((validation_data['location'].values.astype(str), validation_data['target'].values.astype(bool)))
我也尝试过使用与Load a pandas DataFrame中的方法更相似的方法,但我得到了相同的结果。
1.我执行标记化、标准化和矢量化:
text_vectorization = TextVectorization(
max_tokens=MAX_TOKENS_TEXT,
output_sequence_length=max_text_length,
standardize=standardize_text,
output_mode='int'
)
keyword_vectorization = TextVectorization(
max_tokens=MAX_TOKENS_KEYWORDS,
output_sequence_length=MAX_KEYWORD_LENGTH,
standardize=standardize_keywords,
output_mode='int'
)
loc_vectorization = TextVectorization(
max_tokens=MAX_TOKENS_KEYWORDS,
output_sequence_length=MAX_LOC_LENGTH,
standardize=standardize_loc,
output_mode='int'
)
text_vectorization.adapt(train_text.map(lambda x, y: x))
keyword_vectorization.adapt(train_keywords.map(lambda x, y: x))
loc_vectorization.adapt(train_loc.map(lambda x, y: x))
train_text_vectorized = train_text.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=-1 # According to the documentation, -1 means auto
).batch(BATCH_SIZE)
train_loc_vectorized = train_loc.map(
lambda x, y: (loc_vectorization(x), y),
num_parallel_calls=-1
).batch(BATCH_SIZE)
train_keywords_vectorized = train_keywords.map(
lambda x, y: (keyword_vectorization(x), y),
num_parallel_calls=-1
).batch(BATCH_SIZE)
val_text_vectorized = val_text.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=-1
).batch(BATCH_SIZE)
val_loc_vectorized = val_loc.map(
lambda x, y: (loc_vectorization(x), y),
num_parallel_calls=-1
).batch(BATCH_SIZE)
val_keywords_vectorized = val_keywords.map(
lambda x, y: (keyword_vectorization(x), y),
num_parallel_calls=-1
).batch(BATCH_SIZE)
在此,我也尝试了以下方法,结果相同:
- 使用批大小1而不是32
- 根本不使用.batch
- 使用.padded_batch而不是.batch
1.根据这里的建议,我将数据压缩成网络可以接受的输入格式:
def dataset_zipper(loc, text, keyword):
return (loc[0], text[0], keyword[0]), text[1]
train_full_vectorized = data.Dataset.zip((train_loc_vectorized, train_text_vectorized, train_keywords_vectorized))
train_full_vectorized = train_full_vectorized.map(dataset_zipper, num_parallel_calls=-1)
val_full_vectorized = data.Dataset.zip((val_loc_vectorized, val_text_vectorized, val_keywords_vectorized))
val_full_vectorized = val_full_vectorized.map(dataset_zipper, num_parallel_calls=-1)
现在我构建网络:1.
loc_input = Input(shape=(MAX_TOKENS_KEYWORDS,), dtype='int64', name='location')
keyword_input = Input(shape=(MAX_TOKENS_KEYWORDS,), dtype='int64', name='keyword')
text_input = Input(shape=(MAX_TOKENS_TEXT,), dtype="int64", name='text')
full_network = concatenate([
generate_convnet(loc=True, input_layer=loc_input),
generate_transformer(input_layer=text_input),
generate_convnet(loc=False, input_layer=keyword_input)
])
full_network = Dropout(0.3)(full_network)
full_network = Dense(1, activation='sigmoid')(full_network) # This is the classifier - since this is binary classification, I will use sigmoid activation
model = Model(inputs=[loc_input, text_input, keyword_input], outputs=full_network)
model.compile(loss='binary_crossentropy',
optimizer=Adam(learning_rate=0.001),
metrics=['binary_accuracy'])
callbacks = [
ModelCheckpoint('twitter_disasters_v1.h5', save_best_only=True),
EarlyStopping(monitor='val_loss', patience=5, mode='min')]
results = model.fit(x=train_full_vectorized, validation_data=val_full_vectorized, class_weight=class_weights, callbacks=callbacks, epochs=100)
下面是我遇到错误的地方:
Node: 'IteratorGetNext'
2 root error(s) found.
(0) INVALID_ARGUMENT: Cannot add tensor to the batch: number of elements does not match. Shapes are: [tensor]: [5], [batch]: [0]
[[{{node IteratorGetNext}}]]
[[gradient_tape/model_1/transformer_encoder_1/multi_head_attention_1/query/einsum/Einsum/_144]]
(1) INVALID_ARGUMENT: Cannot add tensor to the batch: number of elements does not match. Shapes are: [tensor]: [5], [batch]: [0]
[[{{node IteratorGetNext}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_117129]
1条答案
按热度按时间uqxowvwt1#
没关系,应该做更多的实验。将.batch函数从第3步移到第4步(我在那里进行数据集压缩),并将批大小设置为1已经起作用,网络现在正在训练。
现在我只需要解决损失是NaN的事实,而binary_accuracy工作正常...