如何提高这个微调后的BERT模型的神经网络的结果?

huangapple go评论188阅读模式
英文:

How to improve the results of this neural network of finetuned BERT model?

问题

我正在处理一个NLP分类问题,尝试将培训课程分类到99个类别中。我成功地建立了一些模型,包括[贝叶斯分类器][1],但准确度只有55%(非常差)。

鉴于这些结果,我尝试微调camemBERT模型(我的数据是法语),以改进模型结果,但我以前从未使用过这些方法,所以我尝试根据这个[示例][2]来修改我的代码。

在上面的示例中,有2个标签,而我有99个标签。

我保留了某些部分

epochs = 5
MAX_LEN = 128
batch_size = 16
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = CamembertTokenizer.from_pretrained('camembert-base', do_lower_case=True)

我选择了相同的变量名称,在文本中,您有特征列,在标签中,您有标签

text = training['Intitulé (Ce champ doit respecter la nomenclature suivante : Code action – Libellé)_x']
labels = training['Domaine sou domaine ']

我使用与示例中相同的值对序列进行标记和填充,因为我不知道哪些值适合我的数据

# 使用tokenizer将句子转换为标记
input_ids = [tokenizer.encode(sent, add_special_tokens=True, max_length=MAX_LEN) for sent in text]

# 填充我们的输入标记
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

# 创建注意力掩码
attention_masks = []
# 为每个标记创建一个1的掩码,然后是0的填充
for seq in input_ids:
    seq_mask = [float(i > 0) for i in seq]
    attention_masks.append(seq_mask)

我注意到示例中的标签是数字的,所以我使用了这段代码将我的标签转换为数字

label_map = {label: i for i, label in enumerate(set(labels))}
numeric_labels = [label_map[label] for label in labels]
labels = numeric_labels

然后,我开始构建模型,从张量开始

# 使用train_test_split将数据拆分为训练集和验证集以进行训练
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(
    input_ids, labels, random_state=42, test_size=0.1
)

train_masks, validation_masks = train_test_split(
    attention_masks, random_state=42, test_size=0.1
)

# 将数据转换为torch张量
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

# 创建数据加载器
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
# 定义模型架构
model = CamembertForSequenceClassification.from_pretrained('camembert-base', num_labels=99)

# 将模型移到适当的设备
model.to(device)                                                           

然后,我继续创建神经网络

param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [{'params': 

, 'weight_decay_rate': 0.01}] optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5, eps=10e-8) # 计算我们的预测与标签的准确度的函数 def flat_accuracy(preds, labels): pred_flat = np.argmax(preds, axis=1).flatten() labels_flat = labels.flatten() return np.sum(pred_flat == labels_flat) / len(labels_flat) train_loss_set = [] # trange是正常Python范围周围的tqdm包装器 for _ in trange(epochs, desc="Epoch"): # 用于训练的跟踪变量 tr_loss = 0 nb_tr_examples, nb_tr_steps = 0, 0 # 训练模型 model.train() for step, batch in enumerate(train_dataloader): # 将批次添加到设备CPU或GPU batch = tuple(t.to(device) for t in batch) # 从数据加载器中解包输入 b_input_ids, b_input_mask, b_labels = batch # 清除梯度(默认情况下,它们会累积) optimizer.zero_grad() # 前向传递 outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels) # 获取损失值 loss = outputs[0] # 将其添加到训练损失列表中 train_loss_set.append(loss.item()) # 反向传递 loss.backward() # 使用计算的梯度更新参数并采取一步 optimizer.step() # 更新跟踪变量 tr_loss += loss.item() nb_tr_examples += b_input_ids.size(0) nb_tr_steps += 1 print("Train loss: {}".format(tr_loss / nb_tr_steps)) # 用于验证的跟踪变量 eval_loss, eval_accuracy = 0, 0 nb_eval_steps, nb_eval_examples = 0, 0 # 对模型进行验证 model.eval() # 为一个时期评估数据 for batch in validation_dataloader: # 将批次添加到设备CPU或GPU batch = tuple(t.to(device) for t in batch) # 从数据加载器中解包输入 b_input_ids, b_input_mask, b_labels = batch # 告诉模型不要计算或存储梯度,节省内存并加快验证速度 with torch.no_grad(): # 前向传递,计算对数预 <details> <summary>英文:</summary> I&#39;m working on a NLP classification problem where I&#39;m trying to classify training courses into 99 categories. I managed to make a few models including [the Bayesian classifier][1] but it had an accuracy of 55% (very bad). Given those results, I tried to fine-tune the camemBERT model (my data is in french) to improve the model results but I never used these methods before so I tried to follow this [example][2] and adapt it to my code. In the example above, there are 2 labels while I have 99 labels. I left certain parts intact

epochs = 5
MAX_LEN = 128
batch_size = 16
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = CamembertTokenizer.from_pretrained('camembert-base',do_lower_case=True)

I selected the same variable names, in text you have the feature column and in labels you have the labels

text = training['Intitulé (Ce champ doit respecter la nomenclature suivante : Code action – Libellé)_x']
labels = training['Domaine sou domaine ']

I tokenized and padded the sequences using the same values in the example because I didn&#39;t know which values are right for my data

#user tokenizer to convert sentences into tokenizer
input_ids = [tokenizer.encode(sent, add_special_tokens=True, max_length=MAX_LEN) for sent in text]

Pad our input tokens

input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

Create attention masks

attention_masks = []

Create a mask of 1s for each token followed by 0s for padding

for seq in input_ids:
seq_mask = [float(i > 0) for i in seq]
attention_masks.append(seq_mask)


I noticed that the labels are numeric in the example above so I changed my labels to numeric using this code

label_map = {label: i for i, label in enumerate(set(labels))}
numeric_labels = [label_map[label] for label in labels]
labels = numeric_labels

I started building the model starting with the tensors

Use train_test_split to split our data into train and validation sets for training

train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(
input_ids, labels, random_state=42, test_size=0.1
)

train_masks, validation_masks = train_test_split(
attention_masks, random_state=42, test_size=0.1
)

Convert the data to torch tensors

train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

Create data loaders

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

Define the model architecture

model = CamembertForSequenceClassification.from_pretrained('camembert-base', num_labels=99)

Move the model to the appropriate device

model.to(device)

the output is:

CamembertForSequenceClassification(
(roberta): RobertaModel(
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(32005, 768, padding_idx=1)
(position_embeddings): Embedding(514, 768, padding_idx=1)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): RobertaEncoder(
(layer): ModuleList(
(0-11): 12 x RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
)
(classifier): RobertaClassificationHead(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(out_proj): Linear(in_features=768, out_features=99, bias=True)
)
)

Then I proceeded with creating the neural network

param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [{'params':

, 'weight_decay_rate': 0.01}]
optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5, eps=10e-8)

Function to calculate the accuracy of our predictions vs labels

def flat_accuracy(preds, labels):
pred_flat = np.argmax(preds, axis=1).flatten()
labels_flat = labels.flatten()
return np.sum(pred_flat == labels_flat) / len(labels_flat)

train_loss_set = []

trange is a tqdm wrapper around the normal python range

for _ in trange(epochs, desc="Epoch"):
# Tracking variables for training
tr_loss = 0
nb_tr_examples, nb_tr_steps = 0, 0

# Train the model
model.train()
for step, batch in enumerate(train_dataloader):
# Add batch to device CPU or GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
b_input_ids, b_input_mask, b_labels = batch
# Clear out the gradients (by default they accumulate)
optimizer.zero_grad()
# Forward pass
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
# Get loss value
loss = outputs[0]
# Add it to train loss list
train_loss_set.append(loss.item())    
# Backward pass
loss.backward()
# Update parameters and take a step using the computed gradient
optimizer.step()
# Update tracking variables
tr_loss += loss.item()
nb_tr_examples += b_input_ids.size(0)
nb_tr_steps += 1
print(&quot;Train loss: {}&quot;.format(tr_loss / nb_tr_steps))
# Tracking variables for validation
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
# Validation of the model
model.eval()
# Evaluate data for one epoch
for batch in validation_dataloader:
# Add batch to device CPU or GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
b_input_ids, b_input_mask, b_labels = batch
# Telling the model not to compute or store gradients, saving memory and speeding up validation
with torch.no_grad():
# Forward pass, calculate logit predictions
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
loss, logits = outputs[:2]
# Move logits and labels to CPU if GPU is used
logits = logits.detach().cpu().numpy()
label_ids = b_labels.to(&#39;cpu&#39;).numpy()
tmp_eval_accuracy = flat_accuracy(logits, label_ids)
eval_accuracy += tmp_eval_accuracy
nb_eval_steps += 1
print(&quot;Validation Accuracy: {}&quot;.format(eval_accuracy / nb_eval_steps))

And the code worked, but the accuracy level was at 30%, which is way worse than a Bayesian classifier that uses a very simple algorithm and straightforward calculation. This made me realize that I must have fine-tuned the model wrongly, but I don&#39;t understand fine-tuning well enough to know where I went wrong.  
[1]: https://stackoverflow.com/questions/76490589/valueerror-when-using-model-fit-even-with-the-vectors-being-aligned
[2]: https://www.kaggle.com/code/houssemayed/camembert-for-french-tweets-classification/comments
</details>
# 答案1
**得分**: 2
I'm currently working on some sequence classification task and something I noticed during my training probably help you in your case.
**Truncation:** If there's a sentence greater than 128 tokens(MAX_LEN) and you are truncating it, then essentially model is only able to predict on partial data point(partial string as the string is truncated if it's length is >128 tokens).
- So, for my usecase, I was using Roberta Model which has a MAXLENGTH of 512 tokens. I cannot go beyond that in a given datapoint. So I had to generate windows of the each string into multiple sub-sequences of 512 tokens and do padding on the last sub string(if it's less than 512 tokens) since every datapoint will not always be in exact multiples of 512 tokens. Then I aggregated the predictions of each sequence.
While it was a trick I used which seemed to be realistic to me, what you can actually do is like below -
- I'm not aware of the BERT model you are using but could you try
increasing the max length to the maximum allowed(not sure if it's 128 itself) accomodate most of your data points without any truncation.
points.
- How to do this? : You may create a distribution plot on tokens of each datapoint and see if the median/mean/nth percentile/max of the
distribution can be a max_length parameter and train the model on this.
<details>
<summary>英文:</summary>
I&#39;m currently working on some sequence classification task and something I noticed during my training probably help you in your case. 
**Truncation:** If there&#39;s a sentence greater than 128 tokens(MAX_LEN) and you are truncating it, then essentially model is only able to predict on partial data point(partial string as the string is truncated if it&#39;s length is &gt;128 tokens). 
- So, for my usecase, I was using Roberta Model which has a MAXLENGTH of 512 tokens. I cannot go beyond that in a given datapoint. So I had to generate windows of the each string into multiple sub-sequences of 512 tokens and do padding on the last sub string(if it&#39;s less than 512 tokens) since every datapoint will not always be in exact multiples of 512 tokens. Then I aggregated the predictions of each sequence. 
While it was a trick I used which seemed to be realistic to me, what you can actually do is like below - 
- I&#39;m not aware of the BERT model you are using but could you try
increasing the max length to the maximum allowed(not sure if it&#39;s 128 itself) accomodate most of your data points without any truncation.
points.    
- How to do this? : You may create a distribution plot on tokens of each datapoint and see if the median/mean/nth percentile/max of the
distribution can be a max_length parameter and train the model on this.    
</details>
# 答案2
**得分**: 2
你应该使用[camembert](https://huggingface.co/docs/transformers/model_doc/camembert)或任何其他语言模型仅用于提取文本特征。之后,您可以使用分类器将特征向量分类为输入。
训练语言模型可能需要大量数据和计算资源,如果您没有这些资源,使用预训练的网络作为特征提取器会更好。
```python
from transformers import AutoTokenizer, CamembertModel
from sklearn.neighbors import KNeighborsClassifier
import torch
tokenizer = AutoTokenizer.from_pretrained("camembert-base")
model = CamembertModel.from_pretrained("camembert-base")
# 存储所有输入的特征
input_features = []
input_labels = []
with torch.no_grad():
for input_text, label in data.items(): # 或者根据您的数据存储方式
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
# 您可能需要将last_hidden_states张量转换为numpy数组
# 我使用[0,0]假设批次为1且与bert类似的[CLS]标记位置
input_features.append(last_hidden_states[0,0])
input_labels.append(label)
# 使用任何适合处理大量类别的分类器
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(input_features, input_labels)

编辑:

解释:CamembertForSequenceClassification(或任何完成分类等任务的深度学习模型)可以看作具有两个部分:

1)执行特征提取的基础模型,即将输入(文本)映射到潜在空间(高维空间)。这种映射只是将输入样本(如文本)转换为描述每个输入样本的“特性”等不同格式。

2)执行所需任务(如分类)的任务头,使用数据的新格式。它基本上做出决策,即如果一个数据点在潜在空间中的X坐标处,那么对于分类任务,它可能在y类中,而对于其他任务,可能在其他类中。

CamembertForSequenceClassification的情况下,特征提取器是CamembertModel,分类器头是CamembertClassificationHead(这是线性-丢弃-线性层)参考这里

正如您可以看到的,分类头只是2个层,可以轻松训练,并且您可以利用基础模型的预训练性质。由于基础模型也可以单独获得,因此您可以使用不同于2个线性层的分类方法,例如KNN,在处理大量类别且样本较少的情况下可能效果更好。

英文:

You should use camembert or any other language model just to extract text features. After that, you can use a classifier to classify the feature vectors as inputs.

Training a language model might require a lot of data and compute, if you don't have those, using a pretrained network as feature extractor would better.

from transformers import AutoTokenizer, CamembertModel
from sklearn.neighbors import KNeighborsClassifier
import torch
tokenizer = AutoTokenizer.from_pretrained(&quot;camembert-base&quot;)
model = CamembertModel.from_pretrained(&quot;camembert-base&quot;)
# Store features of all inputs
input_features = []
input_labels = []
with torch.no_grad():
for input_text, label in data.items(): # Or however your data is stored
inputs = tokenizer(input_text, return_tensors=&quot;pt&quot;)
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
# You might have to convert the last_hidden_states tensor to numpy array
# I am using [0,0] assuming 1 batch and [CLS] token position similar to bert
input_features.append(last_hidden_states[0,0])
input_labels.append(label)
# Use any classifier which might work well for large amount of classes
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(input_features, input_labels)

Edit:

Explanation: CamembertForSequenceClassification (or any deeplearning model that accomplishes a task like classification) can be seen as having 2 parts:

  1. A base model that does feature extraction, i.e. maps inputs (texts) to a latent space (high dimensional space). This mapping is just a representation of inputs like text to a different format which describes the "qualities" of each input sample.

  2. A task head that performs a required task like classification using the new format of the data. It basically makes a decision that if a datapoint lies at X coordinate in the latent space, it means it is likely in y class for classification task and something else for some other task.

In the case of CamembertForSequenceClassification the feature extractor is CamembertModel and the classifier head is CamembertClassificationHead (which is linear-dropout-linear layer) refer here.

As you can see the classification head is just 2 layers which can be trained easily and you can make use of the pretrained nature of the base model. As base model is also available separately, you can use a classification method different from the 2 linear layers like KNN which might work better for larger amount of classes with few samples.

答案3

得分: 2

OP在评论中提到,一些类别的样本比其他类别多得多。

我建议使用SMOTE(Synthetic Minority Oversampling Technique)。

或者使用类别加权(在训练过程中帮助模型更多地“关注”少数类别)

然而,OP补充道:

我有99个类别的40,000个观测数据,但有些类别只有2个观测数据,而其他类别有数百个,这非常不平衡。我打算尝试使用SMOTE或筛选出某些类别并测试模型。

SMOTE能用于字符串数据吗,还是最好寻找一些文本增强方法?

SMOTE最初是为连续数据设计的算法,将其用于分类或文本数据可能有些棘手。有适用于分类数据的SMOTE改进版本(例如SMOTE-NC),但即使这些也可能不太适合文本数据。

对于文本数据,有几种方式可以进行数据增强:

这些技术可以帮助您在文本分类任务中创建更多代表少数类别的示例。需要注意的一点是,虽然这些技术可以创建更多示例,但这些示例并不是真正“新”的数据,因此如果具有少数示例的类别在本质上很难分类,模型可能仍然会面临困难。

文本增强工具,如Python库nlpaug,可以帮助您执行这些类型的增强。它提供了各种增强方法的功能,包括通过词嵌入替换单词、替换字符、插入新字符/单词、交换字符/单词以及删除字符/单词。

另一个选择是将文本增强与类别加权(如我之前提到的)结合使用来处理不平衡问题。如果您的数据集中具有很少示例的类别即使使用增强数据也难以预测,这可能效果更好。

请记住验证您的增强数据的质量,并确保增强数据保持原始含义和上下文。增强数据的质量可以显著影响模型的性能。

最后,您还可以查看更高级的文本数据过采样技术,例如上下文化过采样(COS)方法,该方法利用了转换器(如BERT)生成语义相似的句子。例如,可以参考Lorenzo Pozzi的文章**用于具有不平衡数据的序列标记的BERT**。

英文:

The OP mentions in the comments that some classes have a lot more samples than others.

I suggested using SMOTE (Synthetic Minority Oversampling Technique).

That, or class weighting (to help the model pay more 'attention' to the underrepresented classes during training

However, the OP adds:

> I do have 40k observations for 99 categories, but some categories have only 2 observations while others have hundreds, it's very imbalanced. I'm going to try and use SMOTE or filter out certain categories and test the model.
>
> Can SMOTE be used for string data or is it better to look for some text augmentation methods?

SMOTE is an algorithm originally designed for continuous data, and using it with categorical or text data can be a bit tricky. There are adaptations of SMOTE for categorical data (like SMOTE-NC), but even these might not be perfect for text data.

For text data, there are several ways you can perform augmentation:

These techniques can help to create more examples of the under-represented classes in your text classification task. One thing to note, though, is that while these techniques can create more examples, the examples are not truly 'new' data, and so the model might still struggle if the classes with few examples are fundamentally hard to classify.

Text augmentation tools like the Python library nlpaug can help you perform these types of augmentation. It provides functionalities for various augmentation methods, including substitution of word by word embeddings, substitution of character, inserting new character/word, swap of character/word, and deletion of character/word.

Another option is to combine text augmentation and class weighting (as I mentioned before) to handle the imbalance problem. That could work better if the classes with very few examples in your dataset are hard to predict even with augmented data.

Remember to verify the quality of your augmented data, and ensure that the augmented data maintains the original meaning and context. The quality of your augmented data can significantly affect your model's performance.

Lastly, you could also look at more advanced over-sampling techniques for text data, such as the Contextualized Over-Sampling (COS) method, which leverages transformers (like BERT) to generate semantically similar sentences. See for instance "BERT for Sequence Labelling with Imbalanced Data" by Lorenzo Pozzi.

huangapple
  • 本文由 发表于 2023年7月6日 17:09:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76627232.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定