2023年5月28日 09:27:59go评论153阅读模式

英文:

Training a BARTForSequenceClassification returns data with ununiform dimentsions

问题

我将只翻译代码部分，以下是您提供的代码的翻译：

class TextDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings['input_ids'])
def load_data(directory):
    files = os.listdir(directory)
    dfs = []
    for file in files:
        if file.endswith('train.csv'):
            df = pd.read_csv(os.path.join(directory, file))
            dfs.append(df)
    return pd.concat(dfs, ignore_index=True)
print(len(load_data("splitted_data/gender-bias")))
def encode_data(tokenizer, text, labels):
    inputs = tokenizer(text, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
    inputs['labels'] = torch.tensor(labels)
    return inputs
def compute_metrics(eval_pred):
    logits = eval_pred.predictions
    labels = eval_pred.label_ids
    predictions = np.argmax(logits, axis=-1)
    return {"f1": f1_score(labels, predictions)}
def train_model(train_dataset, eval_dataset):
    training_args = TrainingArguments(
        output_dir='./baseline/results',
        num_train_epochs=5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=64,
        warmup_steps=500,
        weight_decay=0.01,
        evaluation_strategy="steps",
        eval_steps=50,
        load_best_model_at_end=True,
        save_strategy='steps',
        save_steps=500,
        metric_for_best_model='f1',
        greater_is_better=True
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics
    )
    trainer.train()
    return trainer
model = BartForSequenceClassification.from_pretrained('facebook/bart-base', num_labels=2)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
train_df = load_data("splitted_data/gender-bias")
train_encodings = encode_data(tokenizer, train_df['text'].tolist(), train_df['label'].tolist())
train_size = int(0.9 * len(train_encodings['input_ids']))
train_dataset = {k: v[:train_size] for k, v in train_encodings.items()}
print(train_dataset)
print(len(train_dataset))
eval_dataset = {k: v[train_size:] for k, v in train_encodings.items()}
train_dataset = TextDataset(train_dataset)
eval_dataset = TextDataset(eval_dataset)
trainer = train_model(train_dataset, eval_dataset)

这是您提供的代码的中文翻译。如果您有任何其他问题或需要进一步的帮助，请随时提出。

英文:

I am trying to fine-tune a BART-base model on a dataset that I have. The dataset looks like this: It has columns "id", "text", "label" and "dataset_id". The "text" column is what I want to use as inputs to the model, and it is plain text. "label" is a value of either 0 or 1.

I've already written the code for Training, using transfomers==4.28.0.

This is the code for the dataset class:

class TextDataset(Dataset):
def __init__(self, encodings):
self.encodings = encodings
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def __len__(self):
return len(self.encodings[&#39;input_ids&#39;])

This is the code for loading and encoding of the data:

def load_data(directory):
files = os.listdir(directory)
dfs = []
for file in files:
if file.endswith(&#39;train.csv&#39;):
df = pd.read_csv(os.path.join(directory, file))
dfs.append(df)
return pd.concat(dfs, ignore_index=True)
print(len(load_data(&quot;splitted_data/gender-bias&quot;)))
def encode_data(tokenizer, text, labels):
inputs = tokenizer(text, padding=&quot;max_length&quot;, truncation=True, max_length=128, return_tensors=&quot;pt&quot;)
inputs[&#39;labels&#39;] = torch.tensor(labels)
return inputs

This is the code for the metrics for evaluation. I use the f1_score function from scikit.

def compute_metrics(eval_pred):
logits = eval_pred.predictions
labels = eval_pred.label_ids
predictions = np.argmax(logits, axis=-1)
return {&quot;f1&quot;: f1_score(labels, predictions)}

This is the training function:

def train_model(train_dataset, eval_dataset):
# Define the training arguments
training_args = TrainingArguments(
output_dir=&#39;./baseline/results&#39;,           # output directory
num_train_epochs=5,               # total number of training epochs
per_device_train_batch_size=32,   # batch size per device during training
per_device_eval_batch_size=64,    # batch size for evaluation
warmup_steps=500,                 # number of warmup steps for learning rate scheduler
weight_decay=0.01,                # strength of weight decay
evaluation_strategy=&quot;steps&quot;,      # evaluation is done at each training step
eval_steps=50,                    # number of training steps between evaluations
load_best_model_at_end=True,      # load the best model when finished training (defaults to `False`)
save_strategy=&#39;steps&#39;,            # save the model after each training step
save_steps=500,                   # number of training steps between saves
metric_for_best_model=&#39;f1&#39;,       # metric to use to compare models
greater_is_better=True            # whether a larger metric value is better
)
# Define the trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics
)
# Train the model
trainer.train()
return trainer

This is how I defined the model and etc.

model = BartForSequenceClassification.from_pretrained(&#39;facebook/bart-base&#39;, num_labels=2)
tokenizer = BartTokenizer.from_pretrained(&#39;facebook/bart-base&#39;)
train_df = load_data(&quot;splitted_data/gender-bias&quot;)
train_encodings = encode_data(tokenizer, train_df[&#39;text&#39;].tolist(), train_df[&#39;label&#39;].tolist())
# For simplicity, let&#39;s split our training data to create a pseudo-evaluation set
train_size = int(0.9 * len(train_encodings[&#39;input_ids&#39;]))  # 90% for training
train_dataset = {k: v[:train_size] for k, v in train_encodings.items()}
print(train_dataset)
print(len(train_dataset))
eval_dataset = {k: v[train_size:] for k, v in train_encodings.items()}  # 10% for evaluation
# Convert the dictionary data to PyTorch Dataset
train_dataset = TextDataset(train_dataset)
eval_dataset = TextDataset(eval_dataset)
trainer = train_model(train_dataset, eval_dataset)

The training looks just fine. However, when it comes to evaluation during training, an error is raised from my compute_metrics function, which takes a parameter as the output of the model. The model should be a binary classification model, returning the probabilistic of each label in its output I believe.

np.argmax(np.array(logits), axis=-1) 21 
ValueError: could not broadcast input array from shape (3208,2) into shape (3208,)

I've tried to output the type of the logits, and it turns out that type(logits) return Tuple. Considering that this might be caused the fact that evaluation dataset might be split into batches, and the returned Tuple is a number of separate numpy arrays, I've also tried to concatenate the tuple.

def compute_metrics(eval_pred):
logits = eval_pred.predictions
labels = eval_pred.label_ids
logits = np.concatenate(logits, axis=0)
predictions = np.argmax(logits, axis=-1)
return {&quot;f1&quot;: f1_score(labels, predictions)}

But this raised a new error:

packages/numpy/core/overrides.py in concatenate(*args, **kwargs) 
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 3 dimension(s)

How can I solve this issue?

答案1

得分: 0

我找到了答案。由于返回的元组具有形状[(3208, 2), (3208, 128, 768)]，它同时返回了两个东西。该元组的第一个元素表示预测的二进制对数，而第二个元素似乎是我的BART模型的一层输出。因此，当我按照下面的方式编写代码时，代码运行正常：

def compute_metrics(eval_pred):
    logits = eval_pred.predictions[0]
    labels = eval_pred.label_ids
    predictions = np.argmax(logits, axis=-1)
    return {"f1": f1_score(labels, predictions)}

英文:

I've found the answer. Since the returned tuple has a shape of [(3208, 2), (3208, 128, 768)], it is returning two things simultaneously. The first element of this tuple represents the binary logits for the predictions, while the second element seems to be an output of a layer of my BART model. Hence, the code works well when I write it as below:

def compute_metrics(eval_pred):
logits = eval_pred.predictions[0]
labels = eval_pred.label_ids
predictions = np.argmax(logits, axis=-1)
return {&quot;f1&quot;: f1_score(labels, predictions)}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

训练BARTForSequenceClassification返回的数据具有不一致的维度。

问题

答案1

如何在Python中使用find()函数从字符串中提取子串？

如何通过另一个数据框中的键来筛选数据框中的列

Python-polars: Create row per unique value in a pl.DataFrame column, columns with another, and values with a third

将Snowflake表中的值存储到Python变量中

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。