使用Pandas数据框进行setfit训练。

huangapple go评论70阅读模式
英文:

setfit training with a pandas dataframe

问题

我想在一个带有注释样本数据集上训练一个零样本分类器。

我正在按照一些教程,但由于它们都使用自己的数据和相同的预训练模型,我正在尝试确认:这是最好的方法吗?

数据示例:

导入pandas和数据集:

样本反馈数据,每个标签有8个样本:

创建带有反馈数据的DataFrame:

将其转换为数据集格式。

拥有先前的数据格式,这是微调模型的方法:

选择一个模型

使用Setfit进行训练

训练时的问题在于在笔记本电脑上超过500小时后进程从未结束,而数据集仅约有88条记录,共有11个标签。

英文:

I would like to train a zero shot classifier on an annotated sample dataset.

I am following some tutorials but as all use their own data and the same pretarined model, I am trying to confirm: Is this the best approach?

Data example: 

import pandas as pd
from datasets import Dataset
    
# Sample feedback data, it will have 8 samples per label
feedback_dict = [
    {'text': 'The product is great and works well.', 'label': 'Product Performance'},
    {'text': 'I love the design of the product.', 'label': 'Product Design'},
    {'text': 'The product is difficult to use.', 'label': 'Usability'},
    {'text': 'The customer service was very helpful.', 'label': 'Customer Service'},
    {'text': 'The product was delivered on time.', 'label': 'Delivery Time'}
]

# Create a DataFrame with the feedback data
df = pd.DataFrame(feedback_dict)

# convert to Dataset format
df = Dataset.from_pandas(df)

By having the previous data format, this is the approach for model finetunning:

from setfit import SetFitModel, SetFitTrainer

# Select a model
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# training with Setfit
trainer = SetFitTrainer(
    model=model,
    train_dataset=df, # to keep the code simple I do not create the df_train
    eval_dataset=df, # to keep the code simple I do not create the df_eval
    column_mapping={"text": "text", "label": "label"} 
)

trainer.train()

The issue here is that the process never ends after more than 500 hours in a laptop, and the dataset it is only about 88 records with 11 labels.

答案1

得分: 3

我尝试在Google Colab上运行你发布的示例,训练花费了37秒。

以下是经过调整以在Colab上运行的代码:

### 安装库
%%capture
!pip install datasets setfit

安装完库后,请运行以下代码:

### 导入数据集
import pandas as pd
from datasets import Dataset
# 示例反馈数据,每个标签有8个样本
feedback_dict = [
    {'text': '产品很棒,运行良好。', 'label': '产品性能'},
    {'text': '我喜欢产品的设计。', 'label': '产品设计'},
    {'text': '产品难以使用。', 'label': '可用性'},
    {'text': '客户服务非常有帮助。', 'label': '客户服务'},
    {'text': '产品按时交付。', 'label': '交付时间'}
]
# 使用反馈数据创建DataFrame
df = pd.DataFrame(feedback_dict)
# 转换为Dataset格式
df = Dataset.from_pandas(df)

### 运行训练
from setfit import SetFitModel, SetFitTrainer
# 选择一个模型
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
# 使用Setfit进行训练
trainer = SetFitTrainer(
    model=model,
    train_dataset=df, # 为了简化代码,我没有创建df_train
    eval_dataset=df, # 为了简化代码,我没有创建df_eval
    column_mapping={"text": "text", "label": "label"} 
)
trainer.train()

最后,你可以将训练好的模型下载到Google Drive,然后手动下载到你的计算机上。

### 将模型下载到Google Drive
from google.colab import drive
drive.mount('/content/drive')
trainer.model._save_pretrained('/content/drive/path/to/target/folder')

如果你的主要问题是训练时间,这应该解决了它。

英文:

I tried to run the example you posted on Google Colab, it took 37 seconds to run the training.

Here's you code with some tweak to make it work on Colab:

### Install libraries
%%capture
!pip install datasets setfit

After installing the libraries, run the following code:

### Import dataset
import pandas as pd
from datasets import Dataset
# Sample feedback data, it will have 8 samples per label
feedback_dict = [
    {'text': 'The product is great and works well.', 'label': 'Product Performance'},
    {'text': 'I love the design of the product.', 'label': 'Product Design'},
    {'text': 'The product is difficult to use.', 'label': 'Usability'},
    {'text': 'The customer service was very helpful.', 'label': 'Customer Service'},
    {'text': 'The product was delivered on time.', 'label': 'Delivery Time'}
]
# Create a DataFrame with the feedback data
df = pd.DataFrame(feedback_dict)
# convert to Dataset format
df = Dataset.from_pandas(df)

### Run training
from setfit import SetFitModel, SetFitTrainer
# Select a model
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
# training with Setfit
trainer = SetFitTrainer(
    model=model,
    train_dataset=df, # to keep the code simple I do not create the df_train
    eval_dataset=df, # to keep the code simple I do not create the df_eval
    column_mapping={"text": "text", "label": "label"} 
)
trainer.train()

And finally, you can download the trained model on drive and then download it on you PC manually.

### Download model to drive
from google.colab import drive
drive.mount('/content/drive')
trainer.model._save_pretrained('/content/drive/path/to/target/folder')

If your main issue is the training time, this should fix it.

答案2

得分: -1

代码部分不翻译。以下是翻译好的部分:

  • 你的代码没问题,但你需要更强大的机器,最好有GPU来训练Transformers。它们不适合性能较差的机器 使用Pandas数据框进行setfit训练。 尝试在Colab、Kaggle上免费使用,或者如果有机会的话,在私人虚拟机上尝试。进行几个epoch只需要几秒
  • 我分享了一个Colab笔记本链接,这里是性能和资源的情况:
  • 我的建议是利用带GPU的免费Kaggle笔记本,比Colab慢(根据我的经验大约慢4倍),但在可用性和时间限制方面更慷慨。这是Kaggle笔记本用于对比和尝试。
  • 祝你GPU训练愉快!
英文:

Nothing wrong with your code but you need more powerful machine possibly with GPU to train Transformers. They are not for the poor 使用Pandas数据框进行setfit训练。 Try on Colab, Kaggle for free or on a private VM if you have a chance. It takes few seconds to make few epochs.

I am sharing a Colab Notebook here and here is how the performance and resources look like:

使用Pandas数据框进行setfit训练。

My advice would be to utilize free Kaggle Notebooks with GPU, slower than Colab (by a factor of about 4x in my experience) but more generous in terms of availability and time limits. Here is the Kaggle Notebook too for comparison and play.

Happy GPU training!

huangapple
  • 本文由 发表于 2023年6月13日 18:54:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76464175.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定