Transformers 始终只使用一个线性层用于分类头部吗?

huangapple go评论56阅读模式
英文:

Transformers always only use a single Linear layer for classification head?

问题

例如,在类BertForSequenceClassification的定义中,仅使用一个线性层用于分类器。如果只使用一个线性层,它是否只对pooled_out进行线性投影?这样的分类器会产生良好的预测吗?为什么不使用多个线性层?transformers是否提供使用多个线性层作为分类头的选项?

我查看了几个其他类。它们都将单个线性层用作分类头。

英文:

For example, in the class BertForSequenceClassification definition, only one Linear layer is used for the classifier. If just one Linear layer is used, doesn’t it just do linear projection for pooled_out? Will such a classifier produce good predictions? Why not use multiple Linear layers? Does transformers offer any option for using multiple Linear layers as the classification head?

I looked at several other classes. They all use a single Linear layer as the classification head.

答案1

得分: 2

  1. 嵌入层(在您的情况下是 self.bert = BertModel(config))将原始数据(例如句子、图像等)转换为某些语义感知的向量空间。这就是所有架构设计的地方(例如注意力、CNN、LSTM等),它们都比一个简单的全连接层更适合其选择的任务。因此,如果您有能力添加多个全连接层,为什么不再添加一个注意力块呢?另一方面,一个良好模型的嵌入应该具有较大的类间距和较小的类内方差,这可以很容易地以线性方式映射到它们对应的类别,一个全连接层已经足够。

  2. 最好使预训练的部分尽可能大,以便作为下游用户,我只需要训练/微调模型的一小部分(例如全连接分类层)。

英文:

To add onto the previous answer,

  1. Embedding layers (self.bert = BertModel(config) in your case) transform the original data (a sentence, an image etc.) into some semantic-aware vector spaces. This is where all the architecture designs come in (e.g. attention, cnn, lstm etc.), which are all far more superior than a simple FC for their chosen tasks. So if you have the capacity of adding multiple FCs, why not just add another attention block? On the other hand, the embeddings from a decent model should have large inter-class distance and small intra-class variance, which could easily be projected to their corresponding classes in a linear fashion, and a FC is more than enough.

  2. It would be ideal to have the pretrained portion as big as possible such that, as a downstream user, I just have to train/finetune a tiny bit of the model (e.g. the fc classification layer)

答案2

得分: 1

由于存在无限数量的下游场景,且没有适用于所有任务的通用头部,transformers库只添加了足够执行任务的头部。如果性能不符合预期,您需要在自己的数据和不同架构上进行尝试。

该库与PyTorch完全兼容,您可以将每个模型的基类用作您自己的神经网络模块:

import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

class MyOwnBert(nn.Module):
    def __init__(self, model_id, num_labels):
        super(MyOwnBert, self).__init__()
        self.bert = BertModel.from_pretrained(model_id)
        
        self.my_fancy_outputLayer = nn.Sequential(
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(self.bert.config.hidden_size, self.bert.config.hidden_size),
            nn.GELU(),
            nn.Linear(self.bert.config.hidden_size, num_labels),
        )
        
    def forward(
        self,
        input_ids = None,
        attention_mask = None,
        token_type_ids = None,
        position_ids = None,
        labels = None,
    ):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
        )

        pooled_output = outputs.pooler_output

        logits = self.my_fancy_outputLayer(pooled_output)

        return logits

model_id = "bert-base-uncased"
t = BertTokenizer.from_pretrained(model_id)
m = MyOwnBert(model_id, 4)
m(**t("this is just an example", return_tensors="pt"))

输出:

tensor([[ 0.0291, -0.0370,  0.0255,  0.0234]], grad_fn=<AddmmBackward0>)
英文:

Since there is an infinite number of downstream scenarios and not a one-fits-all head for a task, the transformers library only adds a head that is sufficient enough to perform the task. You need to play around with your own data and different architectures in case the performance doesn't meet your expectations.

The library is fully compatible with PyTorch and you can use the base class of each model as a module in your own NN:

import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

class MyOwnBert(nn.Module):
    def __init__(self, model_id, num_labels):
        super(MyOwnBert, self).__init__()
        self.bert = BertModel.from_pretrained(model_id)
        
        self.my_fancy_outputLayer = nn.Sequential(
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(self.bert.config.hidden_size, self.bert.config.hidden_size),
            nn.GELU(),
            nn.Linear(self.bert.config.hidden_size, num_labels),
        )
        
    def forward(
        self,
        input_ids = None,
        attention_mask = None,
        token_type_ids = None,
        position_ids = None,
        labels = None,
    ):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
        )

        pooled_output = outputs.pooler_output

        logits = self.my_fancy_outputLayer(pooled_output)

        return logits

model_id = &quot;bert-base-uncased&quot;
t = BertTokenizer.from_pretrained(model_id)
m = MyOwnBert(model_id, 4)
m(**t(&quot;this is just an example&quot;, return_tensors=&quot;pt&quot;))

Output:

tensor([[ 0.0291, -0.0370,  0.0255,  0.0234]], grad_fn=&lt;AddmmBackward0&gt;)

huangapple
  • 本文由 发表于 2023年2月24日 01:31:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75548317.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定