训练BARTForSequenceClassification返回的数据具有不一致的维度。

huangapple go评论116阅读模式
英文:

Training a BARTForSequenceClassification returns data with ununiform dimentsions

问题

我将只翻译代码部分,以下是您提供的代码的翻译:

class TextDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings['input_ids'])

def load_data(directory):
    files = os.listdir(directory)
    dfs = []
    for file in files:
        if file.endswith('train.csv'):
            df = pd.read_csv(os.path.join(directory, file))
            dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

print(len(load_data("splitted_data/gender-bias")))

def encode_data(tokenizer, text, labels):
    inputs = tokenizer(text, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
    inputs['labels'] = torch.tensor(labels)
    return inputs

def compute_metrics(eval_pred):
    logits = eval_pred.predictions
    labels = eval_pred.label_ids
    predictions = np.argmax(logits, axis=-1)
    return {"f1": f1_score(labels, predictions)}

def train_model(train_dataset, eval_dataset):
    training_args = TrainingArguments(
        output_dir='./baseline/results',
        num_train_epochs=5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=64,
        warmup_steps=500,
        weight_decay=0.01,
        evaluation_strategy="steps",
        eval_steps=50,
        load_best_model_at_end=True,
        save_strategy='steps',
        save_steps=500,
        metric_for_best_model='f1',
        greater_is_better=True
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics
    )

    trainer.train()

    return trainer

model = BartForSequenceClassification.from_pretrained('facebook/bart-base', num_labels=2)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')

train_df = load_data("splitted_data/gender-bias")
train_encodings = encode_data(tokenizer, train_df['text'].tolist(), train_df['label'].tolist())

train_size = int(0.9 * len(train_encodings['input_ids']))
train_dataset = {k: v[:train_size] for k, v in train_encodings.items()}
print(train_dataset)
print(len(train_dataset))

eval_dataset = {k: v[train_size:] for k, v in train_encodings.items()}

train_dataset = TextDataset(train_dataset)
eval_dataset = TextDataset(eval_dataset)

trainer = train_model(train_dataset, eval_dataset)

这是您提供的代码的中文翻译。如果您有任何其他问题或需要进一步的帮助,请随时提出。

英文:

I am trying to fine-tune a BART-base model on a dataset that I have. The dataset looks like this: It has columns "id", "text", "label" and "dataset_id". The "text" column is what I want to use as inputs to the model, and it is plain text. "label" is a value of either 0 or 1.

I've already written the code for Training, using transfomers==4.28.0.

This is the code for the dataset class:

class TextDataset(Dataset):
def __init__(self, encodings):
self.encodings = encodings
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def __len__(self):
return len(self.encodings['input_ids'])

This is the code for loading and encoding of the data:

def load_data(directory):
files = os.listdir(directory)
dfs = []
for file in files:
if file.endswith('train.csv'):
df = pd.read_csv(os.path.join(directory, file))
dfs.append(df)
return pd.concat(dfs, ignore_index=True)
print(len(load_data("splitted_data/gender-bias")))
def encode_data(tokenizer, text, labels):
inputs = tokenizer(text, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
inputs['labels'] = torch.tensor(labels)
return inputs

This is the code for the metrics for evaluation. I use the f1_score function from scikit.

def compute_metrics(eval_pred):
logits = eval_pred.predictions
labels = eval_pred.label_ids
predictions = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, predictions)}

This is the training function:

def train_model(train_dataset, eval_dataset):
# Define the training arguments
training_args = TrainingArguments(
output_dir='./baseline/results',           # output directory
num_train_epochs=5,               # total number of training epochs
per_device_train_batch_size=32,   # batch size per device during training
per_device_eval_batch_size=64,    # batch size for evaluation
warmup_steps=500,                 # number of warmup steps for learning rate scheduler
weight_decay=0.01,                # strength of weight decay
evaluation_strategy="steps",      # evaluation is done at each training step
eval_steps=50,                    # number of training steps between evaluations
load_best_model_at_end=True,      # load the best model when finished training (defaults to `False`)
save_strategy='steps',            # save the model after each training step
save_steps=500,                   # number of training steps between saves
metric_for_best_model='f1',       # metric to use to compare models
greater_is_better=True            # whether a larger metric value is better
)
# Define the trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics
)
# Train the model
trainer.train()
return trainer

This is how I defined the model and etc.

model = BartForSequenceClassification.from_pretrained('facebook/bart-base', num_labels=2)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
train_df = load_data("splitted_data/gender-bias")
train_encodings = encode_data(tokenizer, train_df['text'].tolist(), train_df['label'].tolist())
# For simplicity, let's split our training data to create a pseudo-evaluation set
train_size = int(0.9 * len(train_encodings['input_ids']))  # 90% for training
train_dataset = {k: v[:train_size] for k, v in train_encodings.items()}
print(train_dataset)
print(len(train_dataset))
eval_dataset = {k: v[train_size:] for k, v in train_encodings.items()}  # 10% for evaluation
# Convert the dictionary data to PyTorch Dataset
train_dataset = TextDataset(train_dataset)
eval_dataset = TextDataset(eval_dataset)
trainer = train_model(train_dataset, eval_dataset)

The training looks just fine. However, when it comes to evaluation during training, an error is raised from my compute_metrics function, which takes a parameter as the output of the model. The model should be a binary classification model, returning the probabilistic of each label in its output I believe.

np.argmax(np.array(logits), axis=-1) 21 
ValueError: could not broadcast input array from shape (3208,2) into shape (3208,)

I've tried to output the type of the logits, and it turns out that type(logits) return Tuple. Considering that this might be caused the fact that evaluation dataset might be split into batches, and the returned Tuple is a number of separate numpy arrays, I've also tried to concatenate the tuple.

def compute_metrics(eval_pred):
logits = eval_pred.predictions
labels = eval_pred.label_ids
logits = np.concatenate(logits, axis=0)
predictions = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, predictions)}

But this raised a new error:

packages/numpy/core/overrides.py in concatenate(*args, **kwargs) 
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 3 dimension(s)

How can I solve this issue?

答案1

得分: 0

我找到了答案。由于返回的元组具有形状[(3208, 2), (3208, 128, 768)],它同时返回了两个东西。该元组的第一个元素表示预测的二进制对数,而第二个元素似乎是我的BART模型的一层输出。因此,当我按照下面的方式编写代码时,代码运行正常:

def compute_metrics(eval_pred):
    logits = eval_pred.predictions[0]
    labels = eval_pred.label_ids
    predictions = np.argmax(logits, axis=-1)
    return {"f1": f1_score(labels, predictions)}
英文:

I've found the answer. Since the returned tuple has a shape of [(3208, 2), (3208, 128, 768)], it is returning two things simultaneously. The first element of this tuple represents the binary logits for the predictions, while the second element seems to be an output of a layer of my BART model. Hence, the code works well when I write it as below:

def compute_metrics(eval_pred):
logits = eval_pred.predictions[0]
labels = eval_pred.label_ids
predictions = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, predictions)}

huangapple
  • 本文由 发表于 2023年5月28日 09:27:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76349622.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定