2023年4月6日 23:22:42go评论72阅读模式

英文:

sentence transformer use of evaluator

问题

I came across this script which is the second link on this page and this explanation.

I am using all-mpnet-base-v2 (link) and I am using my custom data.

I am having a hard time understanding the use of

    dev_samples, name=&#39;sts-dev&#39;)

The documentation says:

> evaluator – an evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.

But in this case, as we are fine-tuning on our own examples, train_dataloader has train_samples which has our model sentences and scores.

Q1. How is train_samples different than dev_samples?

Q2a: If the model is going to print performance against dev_samples, then how is it going to help "to determine the best model that is saved to disc"?

Q2b: Are we required to run dev_samples against the model saved on the disc and then compare scores?

Q3. If my goal is to take a single model and then fine-tune it, is it okay to skip parameters evaluator and evaluation_steps?

Q4. How to determine the total steps in the model? Do I need to set evaluation_steps?

Updated

I followed the answer provided by Kyle and have the following follow-up questions.

In the fit method, I used the evaluator, and the below data was written to a file .

Q5. Which metric is used to select the best epoch? Is it cosine_pearson?

Q6: Why are the steps set to -1 in the above output?

Q7a: How to find steps based on the size of my data, batch size, etc.?

Currently, I have kept them at 1000. But I'm not sure if that is too much. I am running for 10 epochs, and I have 2509 examples in the training data, and the batch size is 64.

Q7b: Are my steps going to be 2509/64? If yes, then 1000 seems to be too high a number.

英文:

I came across this script which is second link on this page and this explanation
I am using all-mpnet-base-v2 (link) and I am using my custom data

I am having hard time understanding use of

evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
    dev_samples, name=&#39;sts-dev&#39;)

The documentation says:

> evaluator – An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.

But in this case, as we are fine tuning on our own examples, train_dataloaderhas train_samples which has our model sentences and scores.

Q1. How is train_samples different than dev_samples?

Q2a: If the model is going to print performance against dev_samples then how is it going to help "to determine the best model that is saved to disc"?

Q2b: Are we required to run dev_samples against the model saved on the disc and then compare scores?

Q3. If my goal is to take a single model and then fine tune it, is it okay to skip parameters evaluator and evaluation_steps?

Q4. How to determine total steps in the model? Do I need to set evaluation_steps?

Updated

I followed the answer provided by Kyle and have below follow up questions

In the fit method I used the evaluator and below data was written to a file

Q5. which metric is used to select the best epoch? is it cosine_pearson?

Q6: why steps are -1 in the above output?

Q7a: how to find steps based upon size of my data, batch size etc.

Currently i have kept them to 1000. But not sure if that it is too much. I am running for 10 epochs, i have 2509 examples in the training data and batch size is 64.

Q7b: are my steps going to be 2509/64? if yes then 1000 seems to be too high number

答案1

得分: 2

Question 1: train_samples 与 dev_samples 在 EmbeddingSimilarityEvaluator 上下文中有何不同？

Question 2: 模型将如何确定保存到磁盘的最佳模型？我们需要对保存在磁盘上的模型运行 dev_samples 然后比较分数吗？

Question 3: 如果我的目标是拿一个单一的模型然后进行微调，可以跳过参数 evaluator 和 evaluation_steps 吗？

Question 4: 如何确定模型的总步数？我需要设置 evaluation_steps。

Question 5: 用于选择最佳时期的度量标准是什么？是 cosine_pearson 吗？

Question 6: 输出中为什么有步骤 -1？

Question 7: 如何基于训练数据的大小（2,509个样本）和批量大小（64）来找到 evaluation_steps 的数量？1000 是不是太高了？

英文:

Question 1

> How is train_samples different from dev_samples in the context of the EmbeddingSimilarityEvaluator?

One needs to have a "held-out" split of data to be used for evaluation during training to avoid over-fitting. This "held-out" set is commonly referred to as the "development set" as it is the set of data that is used during development of the model/system. A pedagogical analogy can be drawn between a traditional education curriculum and that of training deep learning models: if one were to give students all the questions for a given topic, and then use the same subset of questions for evaluation, then eventually (most) students will learn to memorise the set of answers they repeatedly see while practicing, instead of learning the procedures to solve the questions in general. So if you are using your own custom data, make sure that a subset of that data is allocated to dev_samples in addition to train_samples and test_samples. Alternatively, if your own data is scarce, you can use the original training data to supplement your own training, development and test sets. The "test set" is the one that is only used after training has completed to determine the final performance of the model (i.e. all samples in the test set (ideally) haven't been seen before).

Question 2

> How is the model going to determine the best model that is saved to disc? Are we required to run dev_samples against the model saved on the disc and then compare scores?

The previous answer alludes to how this will work, but in brief, once the evaluator has been instantiated, it will measure the correlation against the gold labels and then return the similarity score (depending on what main_similarity was initially set). If the produced embeddings (based on the development set) offer a higher correlation with their gold labels, and therefore, a higher score overall, then this "better" model is saved to disk. Hence, there is no need for you to "run dev_samples against the model saved on the disc and then compare scores", this process happens automatically provided everything has been set up appropriately.

Question 3

> If my goal is to take a single model and then fine tune it, is it okay to skip parameters evaluator and evaluation_steps?

Based on the above answers, you can understand why you cannot "skip the evaluator and evaluation_steps". The evaluator is an integral part of "fine-tuning" (i.e. training) the model.

Question 4

> How to determine the total number of steps for the model? I need to set evaluation_steps.

The evaluation_steps parameter sets the number of training steps that must occur before the model is evaluated using the evaluator. If the authors have set this to 1000, then leave it as is unless you notice problems with training. Alternatively, experiment with either increasing of decreasing it and select a value that works best for training.

Follow-Up Questions

Question 5

> Which metric is used to select the best epoch? Is it cosine_pearson?

By default, the maximum of the Cosine Spearman, Manhattan Spearman, Euclidean Spearman and Dot Product Spearman is used.

Question 6

> Why are steps -1 in the output?

The -1 lets the user know that the evaluator was called after all training steps occurred for a particular epoch.

If the steps_per_epoch was not set when calling the model.fit(), it defaults to None which sets the number of steps_per_epoch to the size of the train_dataloader which is passed to train_objectives when model.fit() is initially called, i.e.:

model.fit(train_objectives=[(train_dataloader, train_loss)],
          ...)

In your case, train_samples is 2,509 and train_batch_size is 64, so the size of train_dataloader, and therefore steps_per_epoch, will be 39.

If the steps_per_epoch, is less than the evaluation_steps, then the number of training steps won't reach or exceed evaluation_steps and so additional calls to _eval_during_training on line 737 won't occur. This isn't a problem as the evaluation is forced to call at the end of each epoch anyway based on line 747.

Question 7

> How do I find the number of evaluation_steps based on the size of my training data (2,509 samples) and batch size (64)? Is 1000 too high?

The evaluation_steps is available to tell the model during the training process whether it should prematurely run an evaluation using the evaluator part-way through an epoch. Otherwise, the evaluation is forced to run at the end of the epoch after steps_per_epoch have completed.

Based on the numbers you provided, you could, for example, set evaluation_steps to 20 to get an evaluation to run approx. half-way through an epoch (assuming an epoch is 39 training_steps). See this answer and its question for more info. on batch size vs. epochs vs. steps per epoch.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用句子转换器的评估器

问题

Updated

Updated

答案1

Question 1

Question 2

Question 3

Question 4

Follow-Up Questions

Question 5

Question 6

Question 7

TypeError: 字节流的对象不可序列化为 JSON 格式。

按下按钮使用Selenium Python

自定义颜色的matplotlib线条，但图例不会更新。

Jupyter Notebook 无法导入 pyLDAvis.sklearn。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论