训练BARTForSequenceClassification返回的数据具有不一致的维度。

huangapple go评论153阅读模式
英文:

Training a BARTForSequenceClassification returns data with ununiform dimentsions

问题

我将只翻译代码部分,以下是您提供的代码的翻译:

  1. class TextDataset(Dataset):
  2. def __init__(self, encodings):
  3. self.encodings = encodings
  4. def __getitem__(self, idx):
  5. return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  6. def __len__(self):
  7. return len(self.encodings['input_ids'])
  8. def load_data(directory):
  9. files = os.listdir(directory)
  10. dfs = []
  11. for file in files:
  12. if file.endswith('train.csv'):
  13. df = pd.read_csv(os.path.join(directory, file))
  14. dfs.append(df)
  15. return pd.concat(dfs, ignore_index=True)
  16. print(len(load_data("splitted_data/gender-bias")))
  17. def encode_data(tokenizer, text, labels):
  18. inputs = tokenizer(text, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
  19. inputs['labels'] = torch.tensor(labels)
  20. return inputs
  21. def compute_metrics(eval_pred):
  22. logits = eval_pred.predictions
  23. labels = eval_pred.label_ids
  24. predictions = np.argmax(logits, axis=-1)
  25. return {"f1": f1_score(labels, predictions)}
  26. def train_model(train_dataset, eval_dataset):
  27. training_args = TrainingArguments(
  28. output_dir='./baseline/results',
  29. num_train_epochs=5,
  30. per_device_train_batch_size=32,
  31. per_device_eval_batch_size=64,
  32. warmup_steps=500,
  33. weight_decay=0.01,
  34. evaluation_strategy="steps",
  35. eval_steps=50,
  36. load_best_model_at_end=True,
  37. save_strategy='steps',
  38. save_steps=500,
  39. metric_for_best_model='f1',
  40. greater_is_better=True
  41. )
  42. trainer = Trainer(
  43. model=model,
  44. args=training_args,
  45. train_dataset=train_dataset,
  46. eval_dataset=eval_dataset,
  47. compute_metrics=compute_metrics
  48. )
  49. trainer.train()
  50. return trainer
  51. model = BartForSequenceClassification.from_pretrained('facebook/bart-base', num_labels=2)
  52. tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
  53. train_df = load_data("splitted_data/gender-bias")
  54. train_encodings = encode_data(tokenizer, train_df['text'].tolist(), train_df['label'].tolist())
  55. train_size = int(0.9 * len(train_encodings['input_ids']))
  56. train_dataset = {k: v[:train_size] for k, v in train_encodings.items()}
  57. print(train_dataset)
  58. print(len(train_dataset))
  59. eval_dataset = {k: v[train_size:] for k, v in train_encodings.items()}
  60. train_dataset = TextDataset(train_dataset)
  61. eval_dataset = TextDataset(eval_dataset)
  62. trainer = train_model(train_dataset, eval_dataset)

这是您提供的代码的中文翻译。如果您有任何其他问题或需要进一步的帮助,请随时提出。

英文:

I am trying to fine-tune a BART-base model on a dataset that I have. The dataset looks like this: It has columns "id", "text", "label" and "dataset_id". The "text" column is what I want to use as inputs to the model, and it is plain text. "label" is a value of either 0 or 1.

I've already written the code for Training, using transfomers==4.28.0.

This is the code for the dataset class:

  1. class TextDataset(Dataset):
  2. def __init__(self, encodings):
  3. self.encodings = encodings
  4. def __getitem__(self, idx):
  5. return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  6. def __len__(self):
  7. return len(self.encodings['input_ids'])

This is the code for loading and encoding of the data:

  1. def load_data(directory):
  2. files = os.listdir(directory)
  3. dfs = []
  4. for file in files:
  5. if file.endswith('train.csv'):
  6. df = pd.read_csv(os.path.join(directory, file))
  7. dfs.append(df)
  8. return pd.concat(dfs, ignore_index=True)
  9. print(len(load_data("splitted_data/gender-bias")))
  10. def encode_data(tokenizer, text, labels):
  11. inputs = tokenizer(text, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
  12. inputs['labels'] = torch.tensor(labels)
  13. return inputs

This is the code for the metrics for evaluation. I use the f1_score function from scikit.

  1. def compute_metrics(eval_pred):
  2. logits = eval_pred.predictions
  3. labels = eval_pred.label_ids
  4. predictions = np.argmax(logits, axis=-1)
  5. return {"f1": f1_score(labels, predictions)}

This is the training function:

  1. def train_model(train_dataset, eval_dataset):
  2. # Define the training arguments
  3. training_args = TrainingArguments(
  4. output_dir='./baseline/results', # output directory
  5. num_train_epochs=5, # total number of training epochs
  6. per_device_train_batch_size=32, # batch size per device during training
  7. per_device_eval_batch_size=64, # batch size for evaluation
  8. warmup_steps=500, # number of warmup steps for learning rate scheduler
  9. weight_decay=0.01, # strength of weight decay
  10. evaluation_strategy="steps", # evaluation is done at each training step
  11. eval_steps=50, # number of training steps between evaluations
  12. load_best_model_at_end=True, # load the best model when finished training (defaults to `False`)
  13. save_strategy='steps', # save the model after each training step
  14. save_steps=500, # number of training steps between saves
  15. metric_for_best_model='f1', # metric to use to compare models
  16. greater_is_better=True # whether a larger metric value is better
  17. )
  18. # Define the trainer
  19. trainer = Trainer(
  20. model=model,
  21. args=training_args,
  22. train_dataset=train_dataset,
  23. eval_dataset=eval_dataset,
  24. compute_metrics=compute_metrics
  25. )
  26. # Train the model
  27. trainer.train()
  28. return trainer

This is how I defined the model and etc.

  1. model = BartForSequenceClassification.from_pretrained('facebook/bart-base', num_labels=2)
  2. tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
  3. train_df = load_data("splitted_data/gender-bias")
  4. train_encodings = encode_data(tokenizer, train_df['text'].tolist(), train_df['label'].tolist())
  5. # For simplicity, let's split our training data to create a pseudo-evaluation set
  6. train_size = int(0.9 * len(train_encodings['input_ids'])) # 90% for training
  7. train_dataset = {k: v[:train_size] for k, v in train_encodings.items()}
  8. print(train_dataset)
  9. print(len(train_dataset))
  10. eval_dataset = {k: v[train_size:] for k, v in train_encodings.items()} # 10% for evaluation
  11. # Convert the dictionary data to PyTorch Dataset
  12. train_dataset = TextDataset(train_dataset)
  13. eval_dataset = TextDataset(eval_dataset)
  14. trainer = train_model(train_dataset, eval_dataset)

The training looks just fine. However, when it comes to evaluation during training, an error is raised from my compute_metrics function, which takes a parameter as the output of the model. The model should be a binary classification model, returning the probabilistic of each label in its output I believe.

  1. np.argmax(np.array(logits), axis=-1) 21
  2. ValueError: could not broadcast input array from shape (3208,2) into shape (3208,)

I've tried to output the type of the logits, and it turns out that type(logits) return Tuple. Considering that this might be caused the fact that evaluation dataset might be split into batches, and the returned Tuple is a number of separate numpy arrays, I've also tried to concatenate the tuple.

  1. def compute_metrics(eval_pred):
  2. logits = eval_pred.predictions
  3. labels = eval_pred.label_ids
  4. logits = np.concatenate(logits, axis=0)
  5. predictions = np.argmax(logits, axis=-1)
  6. return {"f1": f1_score(labels, predictions)}

But this raised a new error:

  1. packages/numpy/core/overrides.py in concatenate(*args, **kwargs)
  2. ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 3 dimension(s)

How can I solve this issue?

答案1

得分: 0

我找到了答案。由于返回的元组具有形状[(3208, 2), (3208, 128, 768)],它同时返回了两个东西。该元组的第一个元素表示预测的二进制对数,而第二个元素似乎是我的BART模型的一层输出。因此,当我按照下面的方式编写代码时,代码运行正常:

  1. def compute_metrics(eval_pred):
  2. logits = eval_pred.predictions[0]
  3. labels = eval_pred.label_ids
  4. predictions = np.argmax(logits, axis=-1)
  5. return {"f1": f1_score(labels, predictions)}
英文:

I've found the answer. Since the returned tuple has a shape of [(3208, 2), (3208, 128, 768)], it is returning two things simultaneously. The first element of this tuple represents the binary logits for the predictions, while the second element seems to be an output of a layer of my BART model. Hence, the code works well when I write it as below:

  1. def compute_metrics(eval_pred):
  2. logits = eval_pred.predictions[0]
  3. labels = eval_pred.label_ids
  4. predictions = np.argmax(logits, axis=-1)
  5. return {"f1": f1_score(labels, predictions)}

huangapple
  • 本文由 发表于 2023年5月28日 09:27:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76349622.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定