英文:
Training a BARTForSequenceClassification returns data with ununiform dimentsions
问题
我将只翻译代码部分,以下是您提供的代码的翻译:
class TextDataset(Dataset):
def __init__(self, encodings):
self.encodings = encodings
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def __len__(self):
return len(self.encodings['input_ids'])
def load_data(directory):
files = os.listdir(directory)
dfs = []
for file in files:
if file.endswith('train.csv'):
df = pd.read_csv(os.path.join(directory, file))
dfs.append(df)
return pd.concat(dfs, ignore_index=True)
print(len(load_data("splitted_data/gender-bias")))
def encode_data(tokenizer, text, labels):
inputs = tokenizer(text, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
inputs['labels'] = torch.tensor(labels)
return inputs
def compute_metrics(eval_pred):
logits = eval_pred.predictions
labels = eval_pred.label_ids
predictions = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, predictions)}
def train_model(train_dataset, eval_dataset):
training_args = TrainingArguments(
output_dir='./baseline/results',
num_train_epochs=5,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
evaluation_strategy="steps",
eval_steps=50,
load_best_model_at_end=True,
save_strategy='steps',
save_steps=500,
metric_for_best_model='f1',
greater_is_better=True
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics
)
trainer.train()
return trainer
model = BartForSequenceClassification.from_pretrained('facebook/bart-base', num_labels=2)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
train_df = load_data("splitted_data/gender-bias")
train_encodings = encode_data(tokenizer, train_df['text'].tolist(), train_df['label'].tolist())
train_size = int(0.9 * len(train_encodings['input_ids']))
train_dataset = {k: v[:train_size] for k, v in train_encodings.items()}
print(train_dataset)
print(len(train_dataset))
eval_dataset = {k: v[train_size:] for k, v in train_encodings.items()}
train_dataset = TextDataset(train_dataset)
eval_dataset = TextDataset(eval_dataset)
trainer = train_model(train_dataset, eval_dataset)
这是您提供的代码的中文翻译。如果您有任何其他问题或需要进一步的帮助,请随时提出。
英文:
I am trying to fine-tune a BART-base model on a dataset that I have. The dataset looks like this: It has columns "id", "text", "label" and "dataset_id". The "text" column is what I want to use as inputs to the model, and it is plain text. "label" is a value of either 0 or 1.
I've already written the code for Training, using transfomers==4.28.0.
This is the code for the dataset class:
class TextDataset(Dataset):
def __init__(self, encodings):
self.encodings = encodings
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def __len__(self):
return len(self.encodings['input_ids'])
This is the code for loading and encoding of the data:
def load_data(directory):
files = os.listdir(directory)
dfs = []
for file in files:
if file.endswith('train.csv'):
df = pd.read_csv(os.path.join(directory, file))
dfs.append(df)
return pd.concat(dfs, ignore_index=True)
print(len(load_data("splitted_data/gender-bias")))
def encode_data(tokenizer, text, labels):
inputs = tokenizer(text, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
inputs['labels'] = torch.tensor(labels)
return inputs
This is the code for the metrics for evaluation. I use the f1_score function from scikit.
def compute_metrics(eval_pred):
logits = eval_pred.predictions
labels = eval_pred.label_ids
predictions = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, predictions)}
This is the training function:
def train_model(train_dataset, eval_dataset):
# Define the training arguments
training_args = TrainingArguments(
output_dir='./baseline/results', # output directory
num_train_epochs=5, # total number of training epochs
per_device_train_batch_size=32, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
evaluation_strategy="steps", # evaluation is done at each training step
eval_steps=50, # number of training steps between evaluations
load_best_model_at_end=True, # load the best model when finished training (defaults to `False`)
save_strategy='steps', # save the model after each training step
save_steps=500, # number of training steps between saves
metric_for_best_model='f1', # metric to use to compare models
greater_is_better=True # whether a larger metric value is better
)
# Define the trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics
)
# Train the model
trainer.train()
return trainer
This is how I defined the model and etc.
model = BartForSequenceClassification.from_pretrained('facebook/bart-base', num_labels=2)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
train_df = load_data("splitted_data/gender-bias")
train_encodings = encode_data(tokenizer, train_df['text'].tolist(), train_df['label'].tolist())
# For simplicity, let's split our training data to create a pseudo-evaluation set
train_size = int(0.9 * len(train_encodings['input_ids'])) # 90% for training
train_dataset = {k: v[:train_size] for k, v in train_encodings.items()}
print(train_dataset)
print(len(train_dataset))
eval_dataset = {k: v[train_size:] for k, v in train_encodings.items()} # 10% for evaluation
# Convert the dictionary data to PyTorch Dataset
train_dataset = TextDataset(train_dataset)
eval_dataset = TextDataset(eval_dataset)
trainer = train_model(train_dataset, eval_dataset)
The training looks just fine. However, when it comes to evaluation during training, an error is raised from my compute_metrics function, which takes a parameter as the output of the model. The model should be a binary classification model, returning the probabilistic of each label in its output I believe.
np.argmax(np.array(logits), axis=-1) 21
ValueError: could not broadcast input array from shape (3208,2) into shape (3208,)
I've tried to output the type of the logits, and it turns out that type(logits)
return Tuple
. Considering that this might be caused the fact that evaluation dataset might be split into batches, and the returned Tuple is a number of separate numpy arrays, I've also tried to concatenate the tuple.
def compute_metrics(eval_pred):
logits = eval_pred.predictions
labels = eval_pred.label_ids
logits = np.concatenate(logits, axis=0)
predictions = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, predictions)}
But this raised a new error:
packages/numpy/core/overrides.py in concatenate(*args, **kwargs)
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 3 dimension(s)
How can I solve this issue?
答案1
得分: 0
我找到了答案。由于返回的元组具有形状[(3208, 2), (3208, 128, 768)]
,它同时返回了两个东西。该元组的第一个元素表示预测的二进制对数,而第二个元素似乎是我的BART模型的一层输出。因此,当我按照下面的方式编写代码时,代码运行正常:
def compute_metrics(eval_pred):
logits = eval_pred.predictions[0]
labels = eval_pred.label_ids
predictions = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, predictions)}
英文:
I've found the answer. Since the returned tuple has a shape of [(3208, 2), (3208, 128, 768)]
, it is returning two things simultaneously. The first element of this tuple represents the binary logits for the predictions, while the second element seems to be an output of a layer of my BART model. Hence, the code works well when I write it as below:
def compute_metrics(eval_pred):
logits = eval_pred.predictions[0]
labels = eval_pred.label_ids
predictions = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, predictions)}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论