2023年6月9日 01:26:56go评论102阅读模式

英文:

How to get the logits of the model with a text classification pipeline from HuggingFace?

问题

You can obtain the logits from the distilbert-base-uncased-finetuned-sst-2-english model using the classifier pipeline by modifying your code as follows:

selected_model = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(selected_model)
model = AutoModelForSequenceClassification.from_pretrained(selected_model, num_labels=2)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
# Input text
text = "Your input sentence here"
# Tokenize the text
inputs = tokenizer(text, return_tensors="pt")
# Get the logits
with torch.no_grad():
    logits = model(**inputs).logits
print(logits)

This code first tokenizes your input text using the tokenizer, then passes it through the model to obtain the logits. You can replace "Your input sentence here" with the text you want to analyze, and it will return the logits for that input text.

英文:

I need to use pipeline in order to get the tokenization and inference from the distilbert-base-uncased-finetuned-sst-2-english model over my dataset.

My data is a list of sentences, for recreation purposes we can assume it is:

texts = ["this is the first sentence", "of my data.", "In fact, thats not true,", "but we are going to assume it", "is"]

Before using pipeline, I was getting the logits from the model outputs like this:

with torch.no_grad():
     logits = model(**tokenized_test).logits

Now I have to use pipeline, so this is the way I'm getting the model's output:

 selected_model = &quot;distilbert-base-uncased-finetuned-sst-2-english&quot;
 tokenizer = AutoTokenizer.from_pretrained(selected_model)
 model = AutoModelForSequenceClassification.from_pretrained(selected_model, num_labels=2)
 classifier = pipeline(&#39;sentiment-analysis&#39;, model=model, tokenizer=tokenizer)
 print(classifier(text))

which gives me:

[{'label': 'POSITIVE', 'score': 0.9746173024177551}, {'label': 'NEGATIVE', 'score': 0.5020197629928589}, {'label': 'NEGATIVE', 'score': 0.9995120763778687}, {'label': 'NEGATIVE', 'score': 0.9802979826927185}, {'label': 'POSITIVE', 'score': 0.9274746775627136}]

And I cant get the 'logits' field anymore.

Is there a way to get the logits instead of the label and score? Would a custom pipeline be the best and/or easiest way to do it?

答案1

得分: 4

When you use the default pipeline, the postprocess function will usually take the softmax, e.g.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
text = ['hello this is a test',
 'that transforms a list of sentences',
 'into a list of list of sentences',
 'in order to emulate, in this case, two batches of the same length',
 'to be tokenized by the hf tokenizer for the defined model']
classifier(text, batch_size=2, truncation="only_first")

[out]:

[{'label': 'NEGATIVE', 'score': 0.9379090666770935},
 {'label': 'POSITIVE', 'score': 0.9990271329879761},
 {'label': 'NEGATIVE', 'score': 0.9726701378822327},
 {'label': 'NEGATIVE', 'score': 0.9965035915374756},
 {'label': 'NEGATIVE', 'score': 0.9913086891174316}]

So what you want is to overload the postprocess logic by inheriting from the pipeline.

To check which pipeline the classifier inherits do this:

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
type(classifier)

[out]:

transformers.pipelines.text_classification.TextClassificationPipeline

Now that you know the parent class of the task pipeline you want to use, now you can do this and still enjoy the perks of the precoded batching from TextClassificationPipeline:

from transformers import TextClassificationPipeline
class MarioThePlumber(TextClassificationPipeline):
    def postprocess(self, model_outputs):
        best_class = model_outputs["logits"]
        return best_class
pipe = MarioThePlumber(model=model, tokenizer=tokenizer)
pipe(text, batch_size=2, truncation="only_first")

[out]:

[tensor([[ 1.5094, -1.2056]]),
 tensor([[-3.4114,  3.5229]]),
 tensor([[ 1.8835, -1.6886]]),
 tensor([[ 3.0780, -2.5745]]),
 tensor([[ 2.5383, -2.1984]])]

英文:

When you use the default pipeline, the postprocess function will usually take the softmax, e.g.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained(&#39;distilbert-base-uncased-finetuned-sst-2-english&#39;)
model = AutoModelForSequenceClassification.from_pretrained(&#39;distilbert-base-uncased-finetuned-sst-2-english&#39;)
text = [&#39;hello this is a test&#39;,
 &#39;that transforms a list of sentences&#39;,
 &#39;into a list of list of sentences&#39;,
 &#39;in order to emulate, in this case, two batches of the same lenght&#39;,
 &#39;to be tokenized by the hf tokenizer for the defined model&#39;]
classifier(text, batch_size=2, truncation=&quot;only_first&quot;)

[out]:

[{&#39;label&#39;: &#39;NEGATIVE&#39;, &#39;score&#39;: 0.9379090666770935},
 {&#39;label&#39;: &#39;POSITIVE&#39;, &#39;score&#39;: 0.9990271329879761},
 {&#39;label&#39;: &#39;NEGATIVE&#39;, &#39;score&#39;: 0.9726701378822327},
 {&#39;label&#39;: &#39;NEGATIVE&#39;, &#39;score&#39;: 0.9965035915374756},
 {&#39;label&#39;: &#39;NEGATIVE&#39;, &#39;score&#39;: 0.9913086891174316}]

So what you want is to overload the postprocess logic by inheriting from the pipeline.

To check which pipeline the classifier inherits do this:

classifier = pipeline(&#39;sentiment-analysis&#39;, model=model, tokenizer=tokenizer)
type(classifier)

[out]:

transformers.pipelines.text_classification.TextClassificationPipeline

Now that you know the parent class of the task pipeline you want to use, now you can do this and still enjoy the perks of the precoded batching from TextClassificationPipeline:

from transformers import TextClassificationPipeline
class MarioThePlumber(TextClassificationPipeline):
    def postprocess(self, model_outputs):
        best_class = model_outputs[&quot;logits&quot;]
        return best_class
pipe = MarioThePlumber(model=model, tokenizer=tokenizer)
pipe(text, batch_size=2, truncation=&quot;only_first&quot;)

[out]:

[tensor([[ 1.5094, -1.2056]]),
 tensor([[-3.4114,  3.5229]]),
 tensor([[ 1.8835, -1.6886]]),
 tensor([[ 3.0780, -2.5745]]),
 tensor([[ 2.5383, -2.1984]])]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从HuggingFace的文本分类管道中获取模型的logits？

问题

答案1

使用请求混合表单和文件，带注释和可选字段。

为什么在tkinter中每次显示新图像时，此级别中的计时器会变得更快？

Pandas中.iloc API的索引

Selenium 复选框点击

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论