Huggingface Transformers NER -偏移Map导致NumPy布尔数组索引分配中的ValueError

41ik7eoe  于 2023-10-19  发布在  其他
关注(0)|答案(1)|浏览(173)

我在Google colab中使用Kaggle(https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus?select=ner_dataset.csv)上的Annotated Corpus for Named Entity Recognition数据尝试了NER教程Token Classification with W-NUT Emerging Entities(https://huggingface.co/transformers/custom_tagets.html#tok-ner)。
我将详细描述我的过程,以便于理解我在做什么,并让社区帮助我找出索引分配错误的来源。
为了从我保存的Google Drive中加载数据,我使用了以下代码

# import pandas library
import pandas as pd

# columns to select
cols_to_select = ["Sentence #", "Word", "Tag"]

# google drive data path
data_path =  '/content/drive/MyDrive/Colab Notebooks/ner/ner_dataset.csv'

# load the data from google colab
dataset = pd.read_csv(data_path, encoding="latin-1")[cols_to_select].fillna(method = 'ffill')

我运行以下代码来解析句子和标记

class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),
                                                     s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def retrieve(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None
# get full data
getter = SentenceGetter(dataset)

# get sentences
sentences = [[s[0] for s in sent] for sent in getter.sentences]

# get tags/labels
tags = [[s[1] for s in sent] for sent in getter.sentences]

# take a look at the data
print(sentences[0][0:5], tags[0][0:5], sep='\n')

然后,我将数据分为训练集、瓦尔集和测试集

# import the sklearn module
from sklearn.model_selection import train_test_split

# split data in to temp and test sets
temp_texts, test_texts, temp_tags, test_tags = train_test_split(sentences,
                                                                tags,
                                                                test_size=0.20,
                                                                random_state=15)

# split data into train and validation sets
train_texts, val_texts, train_tags, val_tags = train_test_split(temp_texts,
                                                                temp_tags, 
                                                                test_size=0.20,
                                                                random_state=15)

分割数据后,我为标记和令牌创建了编码

unique_tags=dataset.Tag.unique()

# create tags to id
tag2id = {tag: id for id, tag in enumerate(unique_tags)}

# create id to tags
id2tag = {id: tag for tag, id in tag2id.items()}

然后,我在colab中安装了Transformer库

# install the transformers library
! pip install transformers

接下来我导入了小bert模型

# import the transformers module
from transformers import BertTokenizerFast

# import the small bert model
model_name = "google/bert_uncased_L-4_H-512_A-8"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

然后我为这些标记创建了编码

# create train set encodings
train_encodings = tokenizer(train_texts, 
                            is_split_into_words=True, 
                            return_offsets_mapping=True, 
                            padding=True,
                            max_length=128,
                            truncation=True)

# create validation set encodings
val_encodings = tokenizer(val_texts,
                          is_split_into_words=True, 
                          return_offsets_mapping=True, 
                          padding=True,
                          max_length=128,
                          truncation=True)

# create test set encodings
test_encodings = tokenizer(test_texts,
                          is_split_into_words=True,
                          return_offsets_mapping=True, 
                          padding=True,
                          max_length=128,
                          truncation=True)

在本教程中,它使用偏移Map来处理单词段标记化产生的问题,特别是标记和标签之间的不匹配。正是在运行教程中的偏移Map代码时,我得到了错误。下面是教程中使用的偏移Map函数:

# the offset function
import numpy as np

def encode_tags(tags, encodings):
    labels = [[tag2id[tag] for tag in doc] for doc in tags]
    encoded_labels = []
    for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
        # create an empty array of -100
        doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
        arr_offset = np.array(doc_offset)

        # set labels whose first offset position is 0 and the second is not 0
        doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
        encoded_labels.append(doc_enc_labels.tolist())

    return encoded_labels

# return the encoded labels
train_labels = encode_tags(train_tags, train_encodings)
val_labels = encode_tags(val_tags, val_encodings)
test_labels = encode_tags(test_tags, test_encodings)

在运行上面的代码后,它给了我以下错误,我无法找出错误的来源。任何帮助和指针将不胜感激。

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-afdff0186eb3> in <module>()
     17 
     18 # return the encoded labels
---> 19 train_labels = encode_tags(train_tags, train_encodings)
     20 val_labels = encode_tags(val_tags, val_encodings)
     21 test_labels = encode_tags(test_tags, test_encodings)

<ipython-input-19-afdff0186eb3> in encode_tags(tags, encodings)
     11 
     12         # set labels whose first offset position is 0 and the second is not 0
---> 13         doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
     14         encoded_labels.append(doc_enc_labels.tolist())
     15 

ValueError: NumPy boolean array indexing assignment cannot assign 38 input values to the 37 output values where the mask is true
nwo49xxi

nwo49xxi1#

你可以用下面的代码替换encode_tags函数来理解你的错误:

def encode_tags(tags, encodings):
labels = [[tag2id[tag] for tag in doc] for doc in tags]
encoded_labels = []
i = 0
for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
    doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
    arr_offset = np.array(doc_offset)
    if len(doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)]) != len(doc_labels):
        for a, b in zip(encodings[i].tokens, doc_offset):
            print(a, b)
    doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
    encoded_labels.append(doc_enc_labels.tolist())
    i += 1

return encoded_labels

对于我来说,我也有同样的错误,因为tokenizer添加了空的子token,offset_mapping =(0,1)。
希望这能帮助别人解决问题。

相关问题