我试图在我拥有的数据集上微调基于BART的模型。数据集看起来像这样:它具有列“id”、“text”、“label”和“dataset_id”。“text”列是我想用作模型输入的内容,它是纯文本。“label”是0或1的值。
我已经用transformers ==4.28.0编写了Training的代码。
下面是dataset类的代码:
class TextDataset(Dataset):
def __init__(self, encodings):
self.encodings = encodings
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def __len__(self):
return len(self.encodings['input_ids'])
这是用于加载和编码数据的代码:
def load_data(directory):
files = os.listdir(directory)
dfs = []
for file in files:
if file.endswith('train.csv'):
df = pd.read_csv(os.path.join(directory, file))
dfs.append(df)
return pd.concat(dfs, ignore_index=True)
print(len(load_data("splitted_data/gender-bias")))
def encode_data(tokenizer, text, labels):
inputs = tokenizer(text, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
inputs['labels'] = torch.tensor(labels)
return inputs
这是评估指标的代码。我使用scikit中的f1_score函数。
def compute_metrics(eval_pred):
logits = eval_pred.predictions
labels = eval_pred.label_ids
predictions = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, predictions)}
这是训练函数:
def train_model(train_dataset, eval_dataset):
# Define the training arguments
training_args = TrainingArguments(
output_dir='./baseline/results', # output directory
num_train_epochs=5, # total number of training epochs
per_device_train_batch_size=32, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
evaluation_strategy="steps", # evaluation is done at each training step
eval_steps=50, # number of training steps between evaluations
load_best_model_at_end=True, # load the best model when finished training (defaults to `False`)
save_strategy='steps', # save the model after each training step
save_steps=500, # number of training steps between saves
metric_for_best_model='f1', # metric to use to compare models
greater_is_better=True # whether a larger metric value is better
)
# Define the trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics
)
# Train the model
trainer.train()
return trainer
这就是我如何定义模型等等。
model = BartForSequenceClassification.from_pretrained('facebook/bart-base', num_labels=2)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
train_df = load_data("splitted_data/gender-bias")
train_encodings = encode_data(tokenizer, train_df['text'].tolist(), train_df['label'].tolist())
# For simplicity, let's split our training data to create a pseudo-evaluation set
train_size = int(0.9 * len(train_encodings['input_ids'])) # 90% for training
train_dataset = {k: v[:train_size] for k, v in train_encodings.items()}
print(train_dataset)
print(len(train_dataset))
eval_dataset = {k: v[train_size:] for k, v in train_encodings.items()} # 10% for evaluation
# Convert the dictionary data to PyTorch Dataset
train_dataset = TextDataset(train_dataset)
eval_dataset = TextDataset(eval_dataset)
trainer = train_model(train_dataset, eval_dataset)
训练看起来很好。然而,当在训练期间进行评估时,我的compute_metrics函数会产生错误,该函数将参数作为模型的输出。该模型应该是一个二元分类模型,在其输出中返回每个标签的概率,我相信。
np.argmax(np.array(logits), axis=-1) 21
ValueError: could not broadcast input array from shape (3208,2) into shape (3208,)
我尝试输出logits的类型,结果是type(logits)
返回Tuple
。考虑到这可能是由于评估数据集可能被拆分为批次,并且返回的Tuple是许多单独的numpy数组,我还尝试连接元组。
def compute_metrics(eval_pred):
logits = eval_pred.predictions
labels = eval_pred.label_ids
logits = np.concatenate(logits, axis=0)
predictions = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, predictions)}
但这又带来了一个新的错误:
packages/numpy/core/overrides.py in concatenate(*args, **kwargs)
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 3 dimension(s)
我该如何解决这个问题?
1条答案
按热度按时间hgc7kmma1#
我找到答案了由于返回的元组具有
[(3208, 2), (3208, 128, 768)]
的形状,因此它同时返回两个东西。这个元组的第一个元素表示预测的二元逻辑,而第二个元素似乎是我的BART模型的一层的输出。因此,当我这样写的时候,代码运行得很好: