英文:
Error : Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2])) while training a binary classifier deepset/gbert-base
问题
Sure, here's the translated content you requested:
我已经了解了以前在这里讨论的大多数解决方案,但仍然没有运气。
我正在尝试实现一个二元分类器。我正在使用一个自定义数据集,其中有一个带有德语文本数据的文本列,标签列有两个类别,要么是0要么是1。
我在这里使用的模型是deepset/gbert-base模型,标签数量为2。
我已经按照hugging face的官方教程https://huggingface.co/learn/nlp-course/chapter3/4 进行了操作。
我一直做得差不多,直到这一步:
outputs = model(**batch)
我尝试了这个论坛和其他编程论坛中建议的以下解决方法:
-
我检查了PyTorch版本(在线论坛建议:更新PyTorch版本,如果版本低于2),我正在使用以下版本:
2.0.0+cu118
-
标签的数据类型是float类型,不包含任何空值(在线论坛建议:检查标签的数据类型是否为float,因为模型期望它是这种格式)
-
我还尝试将标签形状从[0]和[1]更改为[1,0],用于类别零,[0,1]用于类别1,因为错误消息表示从模型到损失函数的输入大小为[16,2],而这里的标签大小为[16]。但是,将形状从[0]和[1]更改为[1,0]用于类别零,[0,1]用于类别1也没有解决问题。
-
我还尝试使用Trainer API来按照hugging face的官方教程https://huggingface.co/learn/nlp-course/chapter3/3?fw=pt 进行自定义损失函数,从binary_cross_entropy_with_logits更改为nn.CrossEntropyLoss()。只是尝试更改损失函数以查看代码是否运行,但结果仍然出现相同的错误。
-
我还尝试使用除上述模型之外的不同模型,包括:
- nlptown/bert-base-multilingual-uncased-sentiment
- papluca/xlm-roberta-base-language-detection
- oliverguhr/german-sentiment-bert
但仍然出现相同的错误。
代码如下:
# 代码部分...
在标记化后,我的数据如下所示:
在标记化后,我的数据如下所示:
> DatasetDict({
train: Dataset({
features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 2512
})
test: Dataset({
features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1255
})
validation: Dataset({
features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1255
})
})
在train_dataloader中的批次项目如下所示:
> {'labels': torch.Size([8]), 'input_ids': torch.Size([8, 69]), 'token_type_ids': torch.Size([8, 69]), 'attention_mask': torch.Size([8, 69])}
详细错误如下:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-b84c8f6552ab> in <cell line: 1>()
----> 1 outputs = model(**batch)
2 #print(outputs.shape)
3 print(outputs.loss, outputs.logits.shape)
4 frames
/usr/local/lib/python3.9/dist-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
3161
3162 if not (target.size() == input.size()):
-> 3163 raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
3164
3165 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
ValueError: Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2]))
如果您有任何有关此问题的线索,将不胜感激。
英文:
I am aware of most of the solutions which are discussed here previously regarding the same problem but still I had no luck with those solutions.
I’m trying to implement a binary classifier. I’m using is a customized dataset and having one text column with german text data and the label column has two classes either 0 or 1.
I’m using here the deepset/gbert-base model and number of labels as 2.
I have followed the official tutorial of hugging face https://huggingface.co/learn/nlp-course/chapter3/4
I’m getting everything similar till the step:
outputs = model(**batch)
I have tried the following work arounds suggested in this forum and other coding forums. Which are mentioned below:
-
I checked the pytorch version(Suggested by online forums : to update the pytorch version which are below verison 2) and I’m using the following:
2.0.0+cu118
-
The labels are of the float type and does not contain any null value (Suggested by online forums : to check if the data type of labels is float as the model expect it in that format)
-
Also tried to change the label shape from [0] and [1] to [1,0] for class zero and [0,1] for class 1 because the error says the input from the model to the loss function is of size [16,2] and the target size which are labels here are of size [16] . But changing the shape from [0] and [1] to [1,0] for class zero and [0,1] for class 1 also did not solve the problem.
-
I also tried to implement through Trainer API following the official tutorial of hugging face https://huggingface.co/learn/nlp-course/chapter3/3?fw=pt and tried to customize the loss function from binary_cross_entropy_with_logits to nn.CrossEntropyLoss() . Just tried to change the loss function to see if the code runs but ended up with the same error.
-
Also tried using different models apart from the above mentioned model. which are:
nlptown/bert-base-multilingual-uncased-sentiment
papluca/xlm-roberta-base-language-detection
oliverguhr/german-sentiment-bert
But getting the same error.
Code:
from transformers import AutoTokenizer, DataCollatorWithPadding
tokenizer = AutoTokenizer.from_pretrained("deepset/gbert-base")
def tokenize_function(examples):
return tokenizer(examples["text1"], truncation=True)
tokenized_datasets = final_dataset_dict.map(tokenize_function, batched=True)
data_collator= DataCollatorWithPadding(tokenizer)
tokenized_datasets = tokenized_datasets.remove_columns(["text1"])
tokenized_datasets["train"].column_names
tokenized_datasets.set_format("torch")
from torch.utils.data import DataLoader
train_dataloader = DataLoader(tokenized_datasets["train"], shuffle = True, batch_size = 8, collate_fn = data_collator)
eval_dataloader = DataLoader(tokenized_datasets["unsupervised"], batch_size = 8, collate_fn = data_collator)
for batch in train_dataloader:
break
print({k: v.shape for k, v in batch.items()})
#print(batch)
from transformers import AutoModelForSequenceClassification
checkpoint = "deepset/gbert-base"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels =2)
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)
After tokenization my data looks like this :
> DatasetDict({
train: Dataset({
features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 2512
})
test: Dataset({
features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1255
})
validation: Dataset({
features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1255
})
})
The batch items in the train_dataloader looks like this.
> {'labels': torch.Size([8]), 'input_ids': torch.Size([8, 69]), 'token_type_ids': torch.Size([8, 69]), 'attention_mask': torch.Size([8, 69])}
The detailed error is as follows:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-b84c8f6552ab> in <cell line: 1>()
----> 1 outputs = model(**batch)
2 #print(outputs.shape)
3 print(outputs.loss, outputs.logits.shape)
4 frames
/usr/local/lib/python3.9/dist-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
3161
3162 if not (target.size() == input.size()):
-> 3163 raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
3164
3165 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
ValueError: Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2]))
Any lead from this problem will be very much appreciated.
I expect the output like :
enter image description here
答案1
得分: 1
将标签数据类型更改为整数解决了这个问题。
df['labels'] = df['labels'].astype(int)
英文:
Changing the label datatype to integer solved the problem.
df['labels'] = df['labels'].astype(int)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论