英文:
sentence transformer use of evaluator
问题
I came across this script which is the second link on this page and this explanation.
I am using all-mpnet-base-v2
(link) and I am using my custom data.
I am having a hard time understanding the use of
dev_samples, name='sts-dev')
The documentation says:
> evaluator – an evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.
But in this case, as we are fine-tuning on our own examples, train_dataloader
has train_samples
which has our model sentences and scores.
Q1. How is train_samples
different than dev_samples
?
Q2a: If the model is going to print performance against dev_samples
, then how is it going to help "to determine the best model that is saved to disc"?
Q2b: Are we required to run dev_samples
against the model saved on the disc and then compare scores?
Q3. If my goal is to take a single model and then fine-tune it, is it okay to skip parameters evaluator
and evaluation_steps
?
Q4. How to determine the total steps in the model? Do I need to set evaluation_steps
?
Updated
I followed the answer provided by Kyle and have the following follow-up questions.
In the fit
method, I used the evaluator
, and the below data was written to a file .
Q5. Which metric is used to select the best epoch? Is it cosine_pearson
?
Q6: Why are the steps set to -1
in the above output?
Q7a: How to find steps based on the size of my data, batch size, etc.?
Currently, I have kept them at 1000. But I'm not sure if that is too much. I am running for 10 epochs, and I have 2509 examples in the training data, and the batch size is 64.
Q7b: Are my steps going to be 2509/64? If yes, then 1000 seems to be too high a number.
英文:
I came across this script which is second link on this page and this explanation
I am using all-mpnet-base-v2
(link) and I am using my custom data
I am having hard time understanding use of
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
dev_samples, name='sts-dev')
The documentation says:
> evaluator – An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.
But in this case, as we are fine tuning on our own examples, train_dataloader
has train_samples
which has our model sentences and scores.
Q1. How is train_samples
different than dev_samples
?
Q2a: If the model is going to print performance against dev_samples
then how is it going to help "to determine the best model that is saved to disc"?
Q2b: Are we required to run dev_samples
against the model saved on the disc and then compare scores?
Q3. If my goal is to take a single model and then fine tune it, is it okay to skip parameters evaluator
and evaluation_steps
?
Q4. How to determine total steps in the model? Do I need to set evaluation_steps
?
Updated
I followed the answer provided by Kyle and have below follow up questions
In the fit
method I used the evaluator
and below data was written to a file
Q5. which metric is used to select the best epoch? is it cosine_pearson
?
Q6: why steps are -1
in the above output?
Q7a: how to find steps based upon size of my data, batch size etc.
Currently i have kept them to 1000. But not sure if that it is too much. I am running for 10 epochs, i have 2509 examples in the training data and batch size is 64.
Q7b: are my steps going to be 2509/64? if yes then 1000 seems to be too high number
答案1
得分: 2
Question 1: train_samples
与 dev_samples
在 EmbeddingSimilarityEvaluator
上下文中有何不同?
Question 2: 模型将如何确定保存到磁盘的最佳模型?我们需要对保存在磁盘上的模型运行 dev_samples
然后比较分数吗?
Question 3: 如果我的目标是拿一个单一的模型然后进行微调,可以跳过参数 evaluator
和 evaluation_steps
吗?
Question 4: 如何确定模型的总步数?我需要设置 evaluation_steps
。
Question 5: 用于选择最佳时期的度量标准是什么?是 cosine_pearson
吗?
Question 6: 输出中为什么有步骤 -1
?
Question 7: 如何基于训练数据的大小(2,509个样本)和批量大小(64)来找到 evaluation_steps
的数量?1000 是不是太高了?
英文:
Question 1
> How is train_samples
different from dev_samples
in the context of the EmbeddingSimilarityEvaluator
?
One needs to have a "held-out" split of data to be used for evaluation during training to avoid over-fitting. This "held-out" set is commonly referred to as the "development set" as it is the set of data that is used during development of the model/system. A pedagogical analogy can be drawn between a traditional education curriculum and that of training deep learning models: if one were to give students all the questions for a given topic, and then use the same subset of questions for evaluation, then eventually (most) students will learn to memorise the set of answers they repeatedly see while practicing, instead of learning the procedures to solve the questions in general. So if you are using your own custom data, make sure that a subset of that data is allocated to dev_samples
in addition to train_samples
and test_samples
. Alternatively, if your own data is scarce, you can use the original training data to supplement your own training, development and test sets. The "test set" is the one that is only used after training has completed to determine the final performance of the model (i.e. all samples in the test set (ideally) haven't been seen before).
Question 2
> How is the model going to determine the best model that is saved to disc? Are we required to run dev_samples
against the model saved on the disc and then compare scores?
The previous answer alludes to how this will work, but in brief, once the evaluator
has been instantiated, it will measure the correlation against the gold labels and then return the similarity score (depending on what main_similarity
was initially set). If the produced embeddings (based on the development set) offer a higher correlation with their gold labels, and therefore, a higher score overall, then this "better" model is saved to disk. Hence, there is no need for you to "run dev_samples
against the model saved on the disc and then compare scores", this process happens automatically provided everything has been set up appropriately.
Question 3
> If my goal is to take a single model and then fine tune it, is it okay to skip parameters evaluator
and evaluation_steps
?
Based on the above answers, you can understand why you cannot "skip the evaluator
and evaluation_steps
". The evaluator
is an integral part of "fine-tuning" (i.e. training) the model.
Question 4
> How to determine the total number of steps for the model? I need to set evaluation_steps
.
The evaluation_steps
parameter sets the number of training steps that must occur before the model is evaluated using the evaluator
. If the authors have set this to 1000, then leave it as is unless you notice problems with training. Alternatively, experiment with either increasing of decreasing it and select a value that works best for training.
Follow-Up Questions
Question 5
> Which metric is used to select the best epoch? Is it cosine_pearson
?
By default, the maximum of the Cosine Spearman, Manhattan Spearman, Euclidean Spearman and Dot Product Spearman is used.
Question 6
> Why are steps -1
in the output?
The -1
lets the user know that the evaluator was called after all training steps occurred for a particular epoch.
If the steps_per_epoch
was not set when calling the model.fit()
, it defaults to None
which sets the number of steps_per_epoch
to the size of the train_dataloader
which is passed to train_objectives
when model.fit()
is initially called, i.e.:
model.fit(train_objectives=[(train_dataloader, train_loss)],
...)
In your case, train_samples
is 2,509 and train_batch_size
is 64, so the size of train_dataloader
, and therefore steps_per_epoch
, will be 39.
If the steps_per_epoch
, is less than the evaluation_steps
, then the number of training steps won't reach or exceed evaluation_steps
and so additional calls to _eval_during_training
on line 737 won't occur. This isn't a problem as the evaluation is forced to call at the end of each epoch anyway based on line 747.
Question 7
> How do I find the number of evaluation_steps
based on the size of my training data (2,509 samples) and batch size (64)? Is 1000 too high?
The evaluation_steps
is available to tell the model during the training process whether it should prematurely run an evaluation using the evaluator
part-way through an epoch. Otherwise, the evaluation is forced to run at the end of the epoch after steps_per_epoch
have completed.
Based on the numbers you provided, you could, for example, set evaluation_steps
to 20 to get an evaluation to run approx. half-way through an epoch (assuming an epoch is 39 training_steps
). See this answer and its question for more info. on batch size vs. epochs vs. steps per epoch.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论