使用句子转换器的评估器

huangapple go评论72阅读模式
英文:

sentence transformer use of evaluator

问题

I came across this script which is the second link on this page and this explanation.

I am using all-mpnet-base-v2 (link) and I am using my custom data.

I am having a hard time understanding the use of

    dev_samples, name='sts-dev')

The documentation says:

> evaluator – an evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.

But in this case, as we are fine-tuning on our own examples, train_dataloader has train_samples which has our model sentences and scores.

Q1. How is train_samples different than dev_samples?

Q2a: If the model is going to print performance against dev_samples, then how is it going to help "to determine the best model that is saved to disc"?

Q2b: Are we required to run dev_samples against the model saved on the disc and then compare scores?

Q3. If my goal is to take a single model and then fine-tune it, is it okay to skip parameters evaluator and evaluation_steps?

Q4. How to determine the total steps in the model? Do I need to set evaluation_steps?


Updated

I followed the answer provided by Kyle and have the following follow-up questions.

In the fit method, I used the evaluator, and the below data was written to a file 使用句子转换器的评估器.

Q5. Which metric is used to select the best epoch? Is it cosine_pearson?

Q6: Why are the steps set to -1 in the above output?

Q7a: How to find steps based on the size of my data, batch size, etc.?

Currently, I have kept them at 1000. But I'm not sure if that is too much. I am running for 10 epochs, and I have 2509 examples in the training data, and the batch size is 64.

Q7b: Are my steps going to be 2509/64? If yes, then 1000 seems to be too high a number.

英文:

I came across this script which is second link on this page and this explanation
I am using all-mpnet-base-v2 (link) and I am using my custom data

I am having hard time understanding use of

evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
    dev_samples, name='sts-dev')

The documentation says:

> evaluator – An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.

But in this case, as we are fine tuning on our own examples, train_dataloaderhas train_samples which has our model sentences and scores.

Q1. How is train_samples different than dev_samples?

Q2a: If the model is going to print performance against dev_samples then how is it going to help "to determine the best model that is saved to disc"?

Q2b: Are we required to run dev_samples against the model saved on the disc and then compare scores?

Q3. If my goal is to take a single model and then fine tune it, is it okay to skip parameters evaluator and evaluation_steps?

Q4. How to determine total steps in the model? Do I need to set evaluation_steps?


Updated

I followed the answer provided by Kyle and have below follow up questions

In the fit method I used the evaluator and below data was written to a file
使用句子转换器的评估器

Q5. which metric is used to select the best epoch? is it cosine_pearson?

Q6: why steps are -1 in the above output?

Q7a: how to find steps based upon size of my data, batch size etc.

Currently i have kept them to 1000. But not sure if that it is too much. I am running for 10 epochs, i have 2509 examples in the training data and batch size is 64.

Q7b: are my steps going to be 2509/64? if yes then 1000 seems to be too high number

答案1

得分: 2

Question 1: train_samplesdev_samplesEmbeddingSimilarityEvaluator 上下文中有何不同?

Question 2: 模型将如何确定保存到磁盘的最佳模型?我们需要对保存在磁盘上的模型运行 dev_samples 然后比较分数吗?

Question 3: 如果我的目标是拿一个单一的模型然后进行微调,可以跳过参数 evaluatorevaluation_steps 吗?

Question 4: 如何确定模型的总步数?我需要设置 evaluation_steps

Question 5: 用于选择最佳时期的度量标准是什么?是 cosine_pearson 吗?

Question 6: 输出中为什么有步骤 -1

Question 7: 如何基于训练数据的大小(2,509个样本)和批量大小(64)来找到 evaluation_steps 的数量?1000 是不是太高了?

英文:

Question 1

> How is train_samples different from dev_samples in the context of the EmbeddingSimilarityEvaluator?

One needs to have a "held-out" split of data to be used for evaluation during training to avoid over-fitting. This "held-out" set is commonly referred to as the "development set" as it is the set of data that is used during development of the model/system. A pedagogical analogy can be drawn between a traditional education curriculum and that of training deep learning models: if one were to give students all the questions for a given topic, and then use the same subset of questions for evaluation, then eventually (most) students will learn to memorise the set of answers they repeatedly see while practicing, instead of learning the procedures to solve the questions in general. So if you are using your own custom data, make sure that a subset of that data is allocated to dev_samples in addition to train_samples and test_samples. Alternatively, if your own data is scarce, you can use the original training data to supplement your own training, development and test sets. The "test set" is the one that is only used after training has completed to determine the final performance of the model (i.e. all samples in the test set (ideally) haven't been seen before).

Question 2

> How is the model going to determine the best model that is saved to disc? Are we required to run dev_samples against the model saved on the disc and then compare scores?

The previous answer alludes to how this will work, but in brief, once the evaluator has been instantiated, it will measure the correlation against the gold labels and then return the similarity score (depending on what main_similarity was initially set). If the produced embeddings (based on the development set) offer a higher correlation with their gold labels, and therefore, a higher score overall, then this "better" model is saved to disk. Hence, there is no need for you to "run dev_samples against the model saved on the disc and then compare scores", this process happens automatically provided everything has been set up appropriately.

Question 3

> If my goal is to take a single model and then fine tune it, is it okay to skip parameters evaluator and evaluation_steps?

Based on the above answers, you can understand why you cannot "skip the evaluator and evaluation_steps". The evaluator is an integral part of "fine-tuning" (i.e. training) the model.

Question 4

> How to determine the total number of steps for the model? I need to set evaluation_steps.

The evaluation_steps parameter sets the number of training steps that must occur before the model is evaluated using the evaluator. If the authors have set this to 1000, then leave it as is unless you notice problems with training. Alternatively, experiment with either increasing of decreasing it and select a value that works best for training.

Follow-Up Questions

Question 5

> Which metric is used to select the best epoch? Is it cosine_pearson?

By default, the maximum of the Cosine Spearman, Manhattan Spearman, Euclidean Spearman and Dot Product Spearman is used.

Question 6

> Why are steps -1 in the output?

The -1 lets the user know that the evaluator was called after all training steps occurred for a particular epoch.

If the steps_per_epoch was not set when calling the model.fit(), it defaults to None which sets the number of steps_per_epoch to the size of the train_dataloader which is passed to train_objectives when model.fit() is initially called, i.e.:

model.fit(train_objectives=[(train_dataloader, train_loss)],
          ...)

In your case, train_samples is 2,509 and train_batch_size is 64, so the size of train_dataloader, and therefore steps_per_epoch, will be 39.

If the steps_per_epoch, is less than the evaluation_steps, then the number of training steps won't reach or exceed evaluation_steps and so additional calls to _eval_during_training on line 737 won't occur. This isn't a problem as the evaluation is forced to call at the end of each epoch anyway based on line 747.

Question 7

> How do I find the number of evaluation_steps based on the size of my training data (2,509 samples) and batch size (64)? Is 1000 too high?

The evaluation_steps is available to tell the model during the training process whether it should prematurely run an evaluation using the evaluator part-way through an epoch. Otherwise, the evaluation is forced to run at the end of the epoch after steps_per_epoch have completed.

Based on the numbers you provided, you could, for example, set evaluation_steps to 20 to get an evaluation to run approx. half-way through an epoch (assuming an epoch is 39 training_steps). See this answer and its question for more info. on batch size vs. epochs vs. steps per epoch.

huangapple
  • 本文由 发表于 2023年4月6日 23:22:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/75951190.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定