Error : Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2])) while training a binary classifier deepset/gbert-base

huangapple go评论68阅读模式
英文:

Error : Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2])) while training a binary classifier deepset/gbert-base

问题

Sure, here's the translated content you requested:

我已经了解了以前在这里讨论的大多数解决方案,但仍然没有运气。

我正在尝试实现一个二元分类器。我正在使用一个自定义数据集,其中有一个带有德语文本数据的文本列,标签列有两个类别,要么是0要么是1。

我在这里使用的模型是deepset/gbert-base模型,标签数量为2。
我已经按照hugging face的官方教程https://huggingface.co/learn/nlp-course/chapter3/4 进行了操作。

我一直做得差不多,直到这一步:

outputs = model(**batch)

我尝试了这个论坛和其他编程论坛中建议的以下解决方法:

  1. 我检查了PyTorch版本(在线论坛建议:更新PyTorch版本,如果版本低于2),我正在使用以下版本:2.0.0+cu118

  2. 标签的数据类型是float类型,不包含任何空值(在线论坛建议:检查标签的数据类型是否为float,因为模型期望它是这种格式)

  3. 我还尝试将标签形状从[0]和[1]更改为[1,0],用于类别零,[0,1]用于类别1,因为错误消息表示从模型到损失函数的输入大小为[16,2],而这里的标签大小为[16]。但是,将形状从[0]和[1]更改为[1,0]用于类别零,[0,1]用于类别1也没有解决问题。

  4. 我还尝试使用Trainer API来按照hugging face的官方教程https://huggingface.co/learn/nlp-course/chapter3/3?fw=pt 进行自定义损失函数,从binary_cross_entropy_with_logits更改为nn.CrossEntropyLoss()。只是尝试更改损失函数以查看代码是否运行,但结果仍然出现相同的错误。

  5. 我还尝试使用除上述模型之外的不同模型,包括:

    • nlptown/bert-base-multilingual-uncased-sentiment
    • papluca/xlm-roberta-base-language-detection
    • oliverguhr/german-sentiment-bert

但仍然出现相同的错误。

代码如下:

# 代码部分...

在标记化后,我的数据如下所示:

在标记化后,我的数据如下所示:

> DatasetDict({
train: Dataset({
features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 2512
})
test: Dataset({
features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1255
})
validation: Dataset({
features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1255
})
})

在train_dataloader中的批次项目如下所示:

> {'labels': torch.Size([8]), 'input_ids': torch.Size([8, 69]), 'token_type_ids': torch.Size([8, 69]), 'attention_mask': torch.Size([8, 69])}

详细错误如下:

 ---------------------------------------------------------------------------
 ValueError                                Traceback (most recent call last)
 <ipython-input-36-b84c8f6552ab> in <cell line: 1>()
 ----> 1 outputs = model(**batch)
       2 #print(outputs.shape)
       3 print(outputs.loss, outputs.logits.shape)
 
 4 frames
 /usr/local/lib/python3.9/dist-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
    3161 
    3162     if not (target.size() == input.size()):
 -> 3163         raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
    3164 
    3165     return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
 
 ValueError: Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2]))

如果您有任何有关此问题的线索,将不胜感激。

英文:

I am aware of most of the solutions which are discussed here previously regarding the same problem but still I had no luck with those solutions.

I’m trying to implement a binary classifier. I’m using is a customized dataset and having one text column with german text data and the label column has two classes either 0 or 1.

I’m using here the deepset/gbert-base model and number of labels as 2.
I have followed the official tutorial of hugging face https://huggingface.co/learn/nlp-course/chapter3/4
I’m getting everything similar till the step:

outputs = model(**batch)

I have tried the following work arounds suggested in this forum and other coding forums. Which are mentioned below:

  1. I checked the pytorch version(Suggested by online forums : to update the pytorch version which are below verison 2) and I’m using the following:2.0.0+cu118

  2. The labels are of the float type and does not contain any null value (Suggested by online forums : to check if the data type of labels is float as the model expect it in that format)

  3. Also tried to change the label shape from [0] and [1] to [1,0] for class zero and [0,1] for class 1 because the error says the input from the model to the loss function is of size [16,2] and the target size which are labels here are of size [16] . But changing the shape from [0] and [1] to [1,0] for class zero and [0,1] for class 1 also did not solve the problem.

  4. I also tried to implement through Trainer API following the official tutorial of hugging face https://huggingface.co/learn/nlp-course/chapter3/3?fw=pt and tried to customize the loss function from binary_cross_entropy_with_logits to nn.CrossEntropyLoss() . Just tried to change the loss function to see if the code runs but ended up with the same error.

  5. Also tried using different models apart from the above mentioned model. which are:

nlptown/bert-base-multilingual-uncased-sentiment
papluca/xlm-roberta-base-language-detection
oliverguhr/german-sentiment-bert

But getting the same error.

Code:

from transformers import AutoTokenizer, DataCollatorWithPadding
tokenizer = AutoTokenizer.from_pretrained("deepset/gbert-base")
 
def tokenize_function(examples):
    return tokenizer(examples["text1"], truncation=True)
 
tokenized_datasets = final_dataset_dict.map(tokenize_function, batched=True)
data_collator= DataCollatorWithPadding(tokenizer)
tokenized_datasets = tokenized_datasets.remove_columns(["text1"])
tokenized_datasets["train"].column_names
tokenized_datasets.set_format("torch")
 
from torch.utils.data import DataLoader
 
train_dataloader = DataLoader(tokenized_datasets["train"], shuffle = True, batch_size = 8, collate_fn = data_collator)
eval_dataloader = DataLoader(tokenized_datasets["unsupervised"], batch_size = 8, collate_fn = data_collator)
 
for batch in train_dataloader:
  break
print({k: v.shape for k, v in batch.items()})
#print(batch)
 
from transformers import AutoModelForSequenceClassification
checkpoint = "deepset/gbert-base"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels =2)
 
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

After tokenization my data looks like this :
> DatasetDict({
train: Dataset({
features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 2512
})
test: Dataset({
features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1255
})
validation: Dataset({
features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1255
})
})
The batch items in the train_dataloader looks like this.
> {'labels': torch.Size([8]), 'input_ids': torch.Size([8, 69]), 'token_type_ids': torch.Size([8, 69]), 'attention_mask': torch.Size([8, 69])}
The detailed error is as follows:

 ---------------------------------------------------------------------------
 ValueError                                Traceback (most recent call last)
 <ipython-input-36-b84c8f6552ab> in <cell line: 1>()
 ----> 1 outputs = model(**batch)
       2 #print(outputs.shape)
       3 print(outputs.loss, outputs.logits.shape)
 
 4 frames
 /usr/local/lib/python3.9/dist-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
    3161 
    3162     if not (target.size() == input.size()):
 -> 3163         raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
    3164 
    3165     return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
 
 ValueError: Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2]))

Any lead from this problem will be very much appreciated.

I expect the output like :
enter image description here

答案1

得分: 1

将标签数据类型更改为整数解决了这个问题。

df['labels'] = df['labels'].astype(int)

英文:

Changing the label datatype to integer solved the problem.

df['labels'] = df['labels'].astype(int)

huangapple
  • 本文由 发表于 2023年5月10日 17:10:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76216697.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定