2023年5月23日 01:23:38go评论101阅读模式

英文:

Create an unknown label for spaCy when returning list of text and label

问题

for token in doc:
    keywords.append((token.text, token.label_ if token.ent_type_ else 'UNKNOWN'))

英文:

I'm trying to create a condition statement for a function that will return the text and label for a passed list.
Here's the code:

def get_label(text: list):
    doc = nlp(&#39;. &#39;.join(text) + &#39;.&#39;)
    keywords = []
    for ent in doc.ents:
        keywords.append((ent.text, ent.label_))
    return keywords

The input is:

[&#39;Kaggle&#39;, &#39;Google&#39;, &#39;San Francisco&#39;, &#39;this week&#39;, &#39;as early as tomorrow&#39;, &#39;Kag-ingle&#39;, &#39;about half a million&#39;, &#39;Ben Hamner&#39;, &#39;2010&#39;, &#39;Earlier this month&#39;, &#39;YouTube&#39;, &#39;Google Cloud Platform&#39;, &#39;Crunchbase&#39;, &#39;$12.5 to $13 million&#39;, &#39;Index Ventures&#39;, &#39;SV Angel&#39;, &#39;Hal Varian&#39;, &#39;Khosla Ventures&#39;, &#39;Yuri Milner&#39;]

The output is:

[(&#39;Google&#39;, &#39;ORG&#39;), (&#39;San Francisco&#39;, &#39;GPE&#39;), (&#39;this week&#39;, &#39;DATE&#39;), (&#39;as early as tomorrow&#39;, &#39;DATE&#39;), (&#39;Kag-ingle&#39;, &#39;PERSON&#39;), (&#39;about half a million&#39;, &#39;CARDINAL&#39;), (&#39;Ben Hamner&#39;, &#39;PERSON&#39;), (&#39;2010&#39;, &#39;DATE&#39;), (&#39;Earlier this month&#39;, &#39;DATE&#39;), (&#39;Google Cloud Platform&#39;, &#39;ORG&#39;), (&#39;Crunchbase&#39;, &#39;ORG&#39;), (&#39;$12.5 to $13 million&#39;, &#39;MONEY&#39;), (&#39;Index Ventures&#39;, &#39;ORG&#39;), (&#39;Hal Varian&#39;, &#39;PERSON&#39;), (&#39;Khosla Ventures&#39;, &#39;ORG&#39;), (&#39;Yuri Milner&#39;, &#39;PERSON&#39;)]

However, the output should include the entities that were not labelled, assigning them the "UNKNOWN" label like this:

[(&#39;Kaggle&#39;, &#39;UNKNOWN&#39;), (&#39;Google&#39;, &#39;ORG&#39;), (&#39;San Francisco&#39;, &#39;GPE&#39;), (&#39;this week&#39;, &#39;DATE&#39;), (&#39;as early as tomorrow&#39;, &#39;DATE&#39;), (&#39;Kag-ingle&#39;, &#39;PERSON&#39;), (&#39;about half a million&#39;, &#39;CARDINAL&#39;), (&#39;Ben Hamner&#39;, &#39;PERSON&#39;), (&#39;2010&#39;, &#39;DATE&#39;), (&#39;Earlier this month&#39;, &#39;DATE&#39;), (&#39;YouTube&#39;, &#39;UNKNOWN&#39;), (&#39;Google Cloud Platform&#39;, &#39;ORG&#39;), (&#39;Crunchbase&#39;, &#39;ORG&#39;), (&#39;$12.5 to $13 million&#39;, &#39;MONEY&#39;), (&#39;Index Ventures&#39;, &#39;ORG&#39;), (&#39;Hal Varian&#39;, &#39;PERSON&#39;), (&#39;Khosla Ventures&#39;, &#39;ORG&#39;), (&#39;Yuri Milner&#39;, &#39;PERSON&#39;)]

I've tried using:

for token in doc.sents:
       keywords.append((token.text, token.label_))

Which returns:

[(&#39;Kaggle.&#39;, &#39;&#39;), (&#39;Google.&#39;, &#39;&#39;), (&#39;San Francisco.&#39;, &#39;&#39;), (&#39;this week.&#39;, &#39;&#39;), (&#39;as early as tomorrow.&#39;, &#39;&#39;), (&#39;Kag-ingle.&#39;, &#39;&#39;), (&#39;about half a million.&#39;, &#39;&#39;), (&#39;Ben Hamner. 2010.&#39;, &#39;&#39;), (&#39;Earlier this month.&#39;, &#39;&#39;), (&#39;YouTube.&#39;, &#39;&#39;), (&#39;Google Cloud Platform.&#39;, &#39;&#39;), (&#39;Crunchbase.&#39;, &#39;&#39;), (&#39;$12.5 to $13 million.&#39;, &#39;&#39;), (&#39;Index Ventures.&#39;, &#39;&#39;), (&#39;SV Angel.&#39;, &#39;&#39;), (&#39;Hal Varian.&#39;, &#39;&#39;), (&#39;Khosla Ventures.&#39;, &#39;&#39;), (&#39;Yuri Milner.&#39;, &#39;&#39;)]

This is (assuming) because there is a period at the end of each token preventing any label from returning.

If anyone has an idea of how I can fix this, I'd really appreciate the help.

答案1

得分: 1

Iterate over the items passed in and check whether they match one of the returned entities after spaCy has performed the labelling (see solution below).

Notes:

The output labels vary depending on the spaCy version and pipeline/pipeline version being used. I used spaCy 3.5.3 and the en_core_web_trf==3.5.0 pipeline to produce the following results.
spaCy returned "Bill Hamner" as "Bill Hamner." as the labelled entity, hence the extra condition in the if statement to check for these edge cases.

Solution:

import spacy
txt = ['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']
nlp = spacy.load("en_core_web_trf")
def get_label(text: list):
    doc = nlp(". ".join(text) + ".")
    keywords = []
    for item in text:
        found_label = False
        for ent in doc.ents:
            if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
                found_label = True
                keywords.append((item, ent.label_))
                break
        if not found_label:
            keywords.append((item, "UNKNOWN"))
    return keywords
for kw in get_label(txt):
    print(kw)

Output:

('Kaggle', 'UNKNOWN')
('Google', 'ORG')
('San Francisco', 'GPE')
('this week', 'DATE')
('as early as tomorrow', 'DATE')
('Kag-ingle', 'UNKNOWN')
('about half a million', 'CARDINAL')
('Ben Hamner', 'PERSON')
('2010', 'DATE')
('Earlier this month', 'DATE')
('YouTube', 'ORG')
('Google Cloud Platform', 'UNKNOWN')
('Crunchbase', 'ORG')
('$12.5 to $13 million', 'MONEY')
('Index Ventures', 'ORG')
('SV Angel', 'UNKNOWN')
('Hal Varian', 'PERSON')
('Khosla Ventures', 'ORG')
('Yuri Milner', 'PERSON')

Some premature optimization for the get_label function which may be faster if dealing with very large documents returned by the spaCy pipline (i.e. a very large tuple of labelled entities for doc.ents). I'll leave it up to you to time the difference to see if it's worth using this variation in your end-application:

def get_label(text: list):
    doc = nlp(". ".join(text) + ".")
    ents = list(doc.ents)
    keywords = []
    for item in text:
        found_label = False
        for idx, ent in enumerate(ents):
            if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
                found_label = True
                keywords.append((item, ent.label_))
                ents.pop(idx)  # reduce size of list to make subsequent searches faster
                break
        if not found_label:
            keywords.append((item, "UNKNOWN"))
    return keywords

英文:

Iterate over the items passed in and check whether they match one of the returned entities after spaCy has performed the labelling (see solution below).

Notes:

The output labels vary depending on the spaCy version and pipeline/pipeline version being used. I used spaCy 3.5.3 and the en_core_web_trf==3.5.0 pipeline to produce the following results.
spaCy returned "Bill Hamner" as "Bill Hamner." as the labelled entity, hence the extra condition in the if statement to check for these edge cases.

Solution

import spacy
txt = [&#39;Kaggle&#39;, &#39;Google&#39;, &#39;San Francisco&#39;, &#39;this week&#39;, &#39;as early as tomorrow&#39;, &#39;Kag-ingle&#39;, &#39;about half a million&#39;, &#39;Ben Hamner&#39;, &#39;2010&#39;, &#39;Earlier this month&#39;, &#39;YouTube&#39;, &#39;Google Cloud Platform&#39;, &#39;Crunchbase&#39;, &#39;$12.5 to $13 million&#39;, &#39;Index Ventures&#39;, &#39;SV Angel&#39;, &#39;Hal Varian&#39;, &#39;Khosla Ventures&#39;, &#39;Yuri Milner&#39;]
nlp = spacy.load(&quot;en_core_web_trf&quot;)
def get_label(text: list):
    doc = nlp(&quot;. &quot;.join(text) + &quot;.&quot;)
    keywords = []
    for item in text:
        found_label = False
        for ent in doc.ents:
            if item == ent.text or (ent.text[-1] == &quot;.&quot; and item == ent.text[:-1]):
                found_label = True
                keywords.append((item, ent.label_))
                break
        if not found_label:
            keywords.append((item, &quot;UNKNOWN&quot;))
    return keywords
for kw in get_label(txt):
    print(kw)

Output:

(&#39;Kaggle&#39;, &#39;UNKNOWN&#39;)
(&#39;Google&#39;, &#39;ORG&#39;)
(&#39;San Francisco&#39;, &#39;GPE&#39;)
(&#39;this week&#39;, &#39;DATE&#39;)
(&#39;as early as tomorrow&#39;, &#39;DATE&#39;)
(&#39;Kag-ingle&#39;, &#39;UNKNOWN&#39;)
(&#39;about half a million&#39;, &#39;CARDINAL&#39;)
(&#39;Ben Hamner&#39;, &#39;PERSON&#39;)
(&#39;2010&#39;, &#39;DATE&#39;)
(&#39;Earlier this month&#39;, &#39;DATE&#39;)
(&#39;YouTube&#39;, &#39;ORG&#39;)
(&#39;Google Cloud Platform&#39;, &#39;UNKNOWN&#39;)
(&#39;Crunchbase&#39;, &#39;ORG&#39;)
(&#39;$12.5 to $13 million&#39;, &#39;MONEY&#39;)
(&#39;Index Ventures&#39;, &#39;ORG&#39;)
(&#39;SV Angel&#39;, &#39;UNKNOWN&#39;)
(&#39;Hal Varian&#39;, &#39;PERSON&#39;)
(&#39;Khosla Ventures&#39;, &#39;ORG&#39;)
(&#39;Yuri Milner&#39;, &#39;PERSON&#39;)

Some premature optimization for the get_label function which may be faster if dealing with very large documents returned by the spaCy pipline (i.e. a very large tuple of labelled entities for doc.ents). I'll leave it up to you to time the difference to see if its worth using this variation in your end-application:

def get_label(text: list):
    doc = nlp(&quot;. &quot;.join(text) + &quot;.&quot;)
    ents = list(doc.ents)
    keywords = []
    for item in text:
        found_label = False
        for idx, ent in enumerate(ents):
            if item == ent.text or (ent.text[-1] == &quot;.&quot; and item == ent.text[:-1]):
                found_label = True
                keywords.append((item, ent.label_))
                ents.pop(idx)  # reduce size of list to make subsequent searches faster
                break
        if not found_label:
            keywords.append((item, &quot;UNKNOWN&quot;))
    return keywords

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为spaCy返回文本和标签列表时创建一个未知标签。

问题

答案1

Solution

Elpy-rpc in Emacs gives 'exited abnormally with code 1' error and unexpected output. How can I fix it?

在Python中的列表中追加问题。

使用贪婪行为匹配字符串在x次出现之后

如何使用Stripe React组件和Django重定向用户

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。