2023年6月13日 18:54:23go评论158阅读模式

英文:

setfit training with a pandas dataframe

问题

我想在一个带有注释样本数据集上训练一个零样本分类器。

我正在按照一些教程，但由于它们都使用自己的数据和相同的预训练模型，我正在尝试确认：这是最好的方法吗？

数据示例：

导入pandas和数据集：

样本反馈数据，每个标签有8个样本：

创建带有反馈数据的DataFrame：

将其转换为数据集格式。

拥有先前的数据格式，这是微调模型的方法：

选择一个模型

使用Setfit进行训练

训练时的问题在于在笔记本电脑上超过500小时后进程从未结束，而数据集仅约有88条记录，共有11个标签。

英文:

I would like to train a zero shot classifier on an annotated sample dataset.

I am following some tutorials but as all use their own data and the same pretarined model, I am trying to confirm: Is this the best approach?

Data example: 

import pandas as pd
from datasets import Dataset
    
# Sample feedback data, it will have 8 samples per label
feedback_dict = [
    {&#39;text&#39;: &#39;The product is great and works well.&#39;, &#39;label&#39;: &#39;Product Performance&#39;},
    {&#39;text&#39;: &#39;I love the design of the product.&#39;, &#39;label&#39;: &#39;Product Design&#39;},
    {&#39;text&#39;: &#39;The product is difficult to use.&#39;, &#39;label&#39;: &#39;Usability&#39;},
    {&#39;text&#39;: &#39;The customer service was very helpful.&#39;, &#39;label&#39;: &#39;Customer Service&#39;},
    {&#39;text&#39;: &#39;The product was delivered on time.&#39;, &#39;label&#39;: &#39;Delivery Time&#39;}
]

# Create a DataFrame with the feedback data
df = pd.DataFrame(feedback_dict)

# convert to Dataset format
df = Dataset.from_pandas(df)

By having the previous data format, this is the approach for model finetunning:

from setfit import SetFitModel, SetFitTrainer

# Select a model
model = SetFitModel.from_pretrained(&quot;sentence-transformers/paraphrase-mpnet-base-v2&quot;)

# training with Setfit
trainer = SetFitTrainer(
    model=model,
    train_dataset=df, # to keep the code simple I do not create the df_train
    eval_dataset=df, # to keep the code simple I do not create the df_eval
    column_mapping={&quot;text&quot;: &quot;text&quot;, &quot;label&quot;: &quot;label&quot;} 
)

trainer.train()

The issue here is that the process never ends after more than 500 hours in a laptop, and the dataset it is only about 88 records with 11 labels.

答案1

得分: 3

我尝试在Google Colab上运行你发布的示例，训练花费了37秒。

以下是经过调整以在Colab上运行的代码：

### 安装库
%%capture
!pip install datasets setfit

安装完库后，请运行以下代码：

### 导入数据集
import pandas as pd
from datasets import Dataset
# 示例反馈数据，每个标签有8个样本
feedback_dict = [
    {'text': '产品很棒，运行良好。', 'label': '产品性能'},
    {'text': '我喜欢产品的设计。', 'label': '产品设计'},
    {'text': '产品难以使用。', 'label': '可用性'},
    {'text': '客户服务非常有帮助。', 'label': '客户服务'},
    {'text': '产品按时交付。', 'label': '交付时间'}
]
# 使用反馈数据创建DataFrame
df = pd.DataFrame(feedback_dict)
# 转换为Dataset格式
df = Dataset.from_pandas(df)

### 运行训练
from setfit import SetFitModel, SetFitTrainer
# 选择一个模型
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
# 使用Setfit进行训练
trainer = SetFitTrainer(
    model=model,
    train_dataset=df, # 为了简化代码，我没有创建df_train
    eval_dataset=df, # 为了简化代码，我没有创建df_eval
    column_mapping={"text": "text", "label": "label"} 
)
trainer.train()

最后，你可以将训练好的模型下载到Google Drive，然后手动下载到你的计算机上。

### 将模型下载到Google Drive
from google.colab import drive
drive.mount('/content/drive')
trainer.model._save_pretrained('/content/drive/path/to/target/folder')

如果你的主要问题是训练时间，这应该解决了它。

英文:

I tried to run the example you posted on Google Colab, it took 37 seconds to run the training.

Here's you code with some tweak to make it work on Colab:

### Install libraries
%%capture
!pip install datasets setfit

After installing the libraries, run the following code:

### Import dataset
import pandas as pd
from datasets import Dataset
# Sample feedback data, it will have 8 samples per label
feedback_dict = [
    {&#39;text&#39;: &#39;The product is great and works well.&#39;, &#39;label&#39;: &#39;Product Performance&#39;},
    {&#39;text&#39;: &#39;I love the design of the product.&#39;, &#39;label&#39;: &#39;Product Design&#39;},
    {&#39;text&#39;: &#39;The product is difficult to use.&#39;, &#39;label&#39;: &#39;Usability&#39;},
    {&#39;text&#39;: &#39;The customer service was very helpful.&#39;, &#39;label&#39;: &#39;Customer Service&#39;},
    {&#39;text&#39;: &#39;The product was delivered on time.&#39;, &#39;label&#39;: &#39;Delivery Time&#39;}
]
# Create a DataFrame with the feedback data
df = pd.DataFrame(feedback_dict)
# convert to Dataset format
df = Dataset.from_pandas(df)

### Run training
from setfit import SetFitModel, SetFitTrainer
# Select a model
model = SetFitModel.from_pretrained(&quot;sentence-transformers/paraphrase-mpnet-base-v2&quot;)
# training with Setfit
trainer = SetFitTrainer(
    model=model,
    train_dataset=df, # to keep the code simple I do not create the df_train
    eval_dataset=df, # to keep the code simple I do not create the df_eval
    column_mapping={&quot;text&quot;: &quot;text&quot;, &quot;label&quot;: &quot;label&quot;} 
)
trainer.train()

And finally, you can download the trained model on drive and then download it on you PC manually.

### Download model to drive
from google.colab import drive
drive.mount(&#39;/content/drive&#39;)
trainer.model._save_pretrained(&#39;/content/drive/path/to/target/folder&#39;)

If your main issue is the training time, this should fix it.

答案2

得分: -1

代码部分不翻译。以下是翻译好的部分：

你的代码没问题，但你需要更强大的机器，最好有GPU来训练Transformers。它们不适合性能较差的机器尝试在Colab、Kaggle上免费使用，或者如果有机会的话，在私人虚拟机上尝试。进行几个epoch只需要几秒。
我分享了一个Colab笔记本链接，这里是性能和资源的情况：
我的建议是利用带GPU的免费Kaggle笔记本，比Colab慢（根据我的经验大约慢4倍），但在可用性和时间限制方面更慷慨。这是Kaggle笔记本用于对比和尝试。
祝你GPU训练愉快！

英文:

Nothing wrong with your code but you need more powerful machine possibly with GPU to train Transformers. They are not for the poor Try on Colab, Kaggle for free or on a private VM if you have a chance. It takes few seconds to make few epochs.

I am sharing a Colab Notebook here and here is how the performance and resources look like:

My advice would be to utilize free Kaggle Notebooks with GPU, slower than Colab (by a factor of about 4x in my experience) but more generous in terms of availability and time limits. Here is the Kaggle Notebook too for comparison and play.

Happy GPU training!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Pandas数据框进行setfit训练。

问题

答案1

答案2

有没有一种方法可以使用ffmpeg将RTSP音频流仅流式传输到标准输出(stdout)？

VSCode 为每个项目创建一个 .hypothesis/unicode_data/13.0.0/charmap.json.gz 文件

“message”: “请求的 URL 不允许该方法。”

Moving from Django signals to save override: How to translate the "created" parameter of a Django post_save signal for a save method override

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论