pytorch 训练BARTForSequenceClassification返回具有不均匀维度的数据

zbsbpyhn  于 2023-06-06  发布在  其他
关注(0)|答案(1)|浏览(166)

我试图在我拥有的数据集上微调基于BART的模型。数据集看起来像这样:它具有列“id”、“text”、“label”和“dataset_id”。“text”列是我想用作模型输入的内容,它是纯文本。“label”是0或1的值。
我已经用transformers ==4.28.0编写了Training的代码。
下面是dataset类的代码:

class TextDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings['input_ids'])

这是用于加载和编码数据的代码:

def load_data(directory):
    files = os.listdir(directory)
    dfs = []
    for file in files:
        if file.endswith('train.csv'):
            df = pd.read_csv(os.path.join(directory, file))
            dfs.append(df)
    return pd.concat(dfs, ignore_index=True)
print(len(load_data("splitted_data/gender-bias")))

def encode_data(tokenizer, text, labels):
    inputs = tokenizer(text, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
    inputs['labels'] = torch.tensor(labels)
    return inputs

这是评估指标的代码。我使用scikit中的f1_score函数。

def compute_metrics(eval_pred):
    logits = eval_pred.predictions
    labels = eval_pred.label_ids
    predictions = np.argmax(logits, axis=-1)
    return {"f1": f1_score(labels, predictions)}

这是训练函数:

def train_model(train_dataset, eval_dataset):
    # Define the training arguments
    training_args = TrainingArguments(
        output_dir='./baseline/results',           # output directory
        num_train_epochs=5,               # total number of training epochs
        per_device_train_batch_size=32,   # batch size per device during training
        per_device_eval_batch_size=64,    # batch size for evaluation
        warmup_steps=500,                 # number of warmup steps for learning rate scheduler
        weight_decay=0.01,                # strength of weight decay
        evaluation_strategy="steps",      # evaluation is done at each training step
        eval_steps=50,                    # number of training steps between evaluations
        load_best_model_at_end=True,      # load the best model when finished training (defaults to `False`)
        save_strategy='steps',            # save the model after each training step
        save_steps=500,                   # number of training steps between saves
        metric_for_best_model='f1',       # metric to use to compare models
        greater_is_better=True            # whether a larger metric value is better
    )

    # Define the trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics
    )

    # Train the model
    trainer.train()

    return trainer

这就是我如何定义模型等等。

model = BartForSequenceClassification.from_pretrained('facebook/bart-base', num_labels=2)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')

train_df = load_data("splitted_data/gender-bias")
train_encodings = encode_data(tokenizer, train_df['text'].tolist(), train_df['label'].tolist())

# For simplicity, let's split our training data to create a pseudo-evaluation set
train_size = int(0.9 * len(train_encodings['input_ids']))  # 90% for training
train_dataset = {k: v[:train_size] for k, v in train_encodings.items()}
print(train_dataset)
print(len(train_dataset))

eval_dataset = {k: v[train_size:] for k, v in train_encodings.items()}  # 10% for evaluation

# Convert the dictionary data to PyTorch Dataset
train_dataset = TextDataset(train_dataset)
eval_dataset = TextDataset(eval_dataset)

trainer = train_model(train_dataset, eval_dataset)

训练看起来很好。然而,当在训练期间进行评估时,我的compute_metrics函数会产生错误,该函数将参数作为模型的输出。该模型应该是一个二元分类模型,在其输出中返回每个标签的概率,我相信。

np.argmax(np.array(logits), axis=-1) 21 
ValueError: could not broadcast input array from shape (3208,2) into shape (3208,)

我尝试输出logits的类型,结果是type(logits)返回Tuple。考虑到这可能是由于评估数据集可能被拆分为批次,并且返回的Tuple是许多单独的numpy数组,我还尝试连接元组。

def compute_metrics(eval_pred):
    logits = eval_pred.predictions
    labels = eval_pred.label_ids
    logits = np.concatenate(logits, axis=0)
    predictions = np.argmax(logits, axis=-1)
    return {"f1": f1_score(labels, predictions)}

但这又带来了一个新的错误:

packages/numpy/core/overrides.py in concatenate(*args, **kwargs) 

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 3 dimension(s)

我该如何解决这个问题?

hgc7kmma

hgc7kmma1#

我找到答案了由于返回的元组具有[(3208, 2), (3208, 128, 768)]的形状,因此它同时返回两个东西。这个元组的第一个元素表示预测的二元逻辑,而第二个元素似乎是我的BART模型的一层的输出。因此,当我这样写的时候,代码运行得很好:

def compute_metrics(eval_pred):
    logits = eval_pred.predictions[0]
    labels = eval_pred.label_ids
    predictions = np.argmax(logits, axis=-1)
    return {"f1": f1_score(labels, predictions)}

相关问题