keras 如何在Tensorflow中指定BERT标记器的输入序列长度？

p1iqtdky 于 2022-12-04 发布在其他

关注(0)|答案(1)|浏览(139)

我将遵循此example使用BERT进行情感分类。

text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
preprocessor = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3") # 128 by default
encoder_inputs = preprocessor(text_input)
encoder = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4",
    trainable=True)
outputs = encoder(encoder_inputs)
pooled_output = outputs["pooled_output"]      # [batch_size, 768].
sequence_output = outputs["sequence_output"]  # [batch_size, seq_length, 768].
embedding_model = tf.keras.Model(text_input, pooled_output)sentences = tf.constant(["(your text here)"])print(embedding_model(sentences))

从encoder_inputs的输出形状来看，默认的序列长度似乎是128。但是，我不知道如何改变它？理想情况下，我希望使用更大的序列长度。
这里有一个从预处理器页面修改序列长度的例子，但是我不知道如何将它合并到我上面的功能模型定义中？我将非常感谢任何关于这方面的帮助。

keras

来源：https://stackoverflow.com/questions/68936835/how-to-specify-input-sequence-length-for-bert-tokenizer-in-tensorflow

1条答案

按热度按时间

nlejzf6q1#

只是在这里离开文档（还没有测试），但你可能会：

preprocessor = hub.load(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")

text_inputs = [tf.keras.layers.Input(shape=(), dtype=tf.string)]

看起来您没有对上面的数据进行标记化-请参见下面

tokenize = hub.KerasLayer(preprocessor.tokenize)
tokenized_inputs = [tokenize(segment) for segment in text_inputs]

接下来选择序列长度：

seq_length = 128  # Your choice here.

下面是传入seq_length的位置：

bert_pack_inputs = hub.KerasLayer(
    preprocessor.bert_pack_inputs,
    arguments=dict(seq_length=seq_length))  # Optional argument.

现在，通过运行bert_pack_inputs对输入进行编码（这将替换上面的preprocessor(text_input)）

encoder_inputs = bert_pack_inputs(tokenized_inputs)

然后是代码的其余部分：

encoder = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4",
    trainable=True)

outputs = encoder(encoder_inputs)
pooled_output = outputs["pooled_output"]      # [batch_size, 768].
sequence_output = outputs["sequence_output"]  # [batch_size, seq_length, 768].
embedding_model = tf.keras.Model(text_input, pooled_output)
sentences = tf.constant(["(your text here)"])
print(embedding_model(sentences))

赞(0）回复(0）举报 2022-12-04

我来回答

keras 如何在Tensorflow中指定BERT标记器的输入序列长度？

1条答案

相关问题

热门标签

最新问答