问题

例如，在类BertForSequenceClassification的定义中，仅使用一个线性层用于分类器。如果只使用一个线性层，它是否只对pooled_out进行线性投影？这样的分类器会产生良好的预测吗？为什么不使用多个线性层？transformers是否提供使用多个线性层作为分类头的选项？

我查看了几个其他类。它们都将单个线性层用作分类头。

英文:

For example, in the class BertForSequenceClassification definition, only one Linear layer is used for the classifier. If just one Linear layer is used, doesn’t it just do linear projection for pooled_out? Will such a classifier produce good predictions? Why not use multiple Linear layers? Does transformers offer any option for using multiple Linear layers as the classification head?

I looked at several other classes. They all use a single Linear layer as the classification head.

答案1

得分: 2

嵌入层（在您的情况下是 self.bert = BertModel(config)）将原始数据（例如句子、图像等）转换为某些语义感知的向量空间。这就是所有架构设计的地方（例如注意力、CNN、LSTM等），它们都比一个简单的全连接层更适合其选择的任务。因此，如果您有能力添加多个全连接层，为什么不再添加一个注意力块呢？另一方面，一个良好模型的嵌入应该具有较大的类间距和较小的类内方差，这可以很容易地以线性方式映射到它们对应的类别，一个全连接层已经足够。
最好使预训练的部分尽可能大，以便作为下游用户，我只需要训练/微调模型的一小部分（例如全连接分类层）。

英文:

To add onto the previous answer,

Embedding layers (self.bert = BertModel(config) in your case) transform the original data (a sentence, an image etc.) into some semantic-aware vector spaces. This is where all the architecture designs come in (e.g. attention, cnn, lstm etc.), which are all far more superior than a simple FC for their chosen tasks. So if you have the capacity of adding multiple FCs, why not just add another attention block? On the other hand, the embeddings from a decent model should have large inter-class distance and small intra-class variance, which could easily be projected to their corresponding classes in a linear fashion, and a FC is more than enough.
It would be ideal to have the pretrained portion as big as possible such that, as a downstream user, I just have to train/finetune a tiny bit of the model (e.g. the fc classification layer)

答案2

得分: 1

由于存在无限数量的下游场景，且没有适用于所有任务的通用头部，transformers库只添加了足够执行任务的头部。如果性能不符合预期，您需要在自己的数据和不同架构上进行尝试。

该库与PyTorch完全兼容，您可以将每个模型的基类用作您自己的神经网络模块：

import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

class MyOwnBert(nn.Module):
    def __init__(self, model_id, num_labels):
        super(MyOwnBert, self).__init__()
        self.bert = BertModel.from_pretrained(model_id)
        
        self.my_fancy_outputLayer = nn.Sequential(
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(self.bert.config.hidden_size, self.bert.config.hidden_size),
            nn.GELU(),
            nn.Linear(self.bert.config.hidden_size, num_labels),
        )
        
    def forward(
        self,
        input_ids = None,
        attention_mask = None,
        token_type_ids = None,
        position_ids = None,
        labels = None,
    ):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
        )

        pooled_output = outputs.pooler_output

        logits = self.my_fancy_outputLayer(pooled_output)

        return logits

model_id = "bert-base-uncased"
t = BertTokenizer.from_pretrained(model_id)
m = MyOwnBert(model_id, 4)
m(**t("this is just an example", return_tensors="pt"))

输出：

tensor([[ 0.0291, -0.0370,  0.0255,  0.0234]], grad_fn=<AddmmBackward0>)

英文:

Since there is an infinite number of downstream scenarios and not a one-fits-all head for a task, the transformers library only adds a head that is sufficient enough to perform the task. You need to play around with your own data and different architectures in case the performance doesn't meet your expectations.

The library is fully compatible with PyTorch and you can use the base class of each model as a module in your own NN:

import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

class MyOwnBert(nn.Module):
    def __init__(self, model_id, num_labels):
        super(MyOwnBert, self).__init__()
        self.bert = BertModel.from_pretrained(model_id)
        
        self.my_fancy_outputLayer = nn.Sequential(
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(self.bert.config.hidden_size, self.bert.config.hidden_size),
            nn.GELU(),
            nn.Linear(self.bert.config.hidden_size, num_labels),
        )
        
    def forward(
        self,
        input_ids = None,
        attention_mask = None,
        token_type_ids = None,
        position_ids = None,
        labels = None,
    ):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
        )

        pooled_output = outputs.pooler_output

        logits = self.my_fancy_outputLayer(pooled_output)

        return logits

model_id = &quot;bert-base-uncased&quot;
t = BertTokenizer.from_pretrained(model_id)
m = MyOwnBert(model_id, 4)
m(**t(&quot;this is just an example&quot;, return_tensors=&quot;pt&quot;))

Output:

tensor([[ 0.0291, -0.0370,  0.0255,  0.0234]], grad_fn=&lt;AddmmBackward0&gt;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Transformers 始终只使用一个线性层用于分类头部吗？

问题

答案1

答案2

评估集在PyTorch Hugging Face中为什么会占用内存？

如何将Traceback恢复为正常？

如何获取GPT中标记的向量嵌入？

如何使用Huggingface transformers加载基于llama的fine-tuned peft/lora模型？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论