为spaCy返回文本和标签列表时创建一个未知标签。

huangapple go评论64阅读模式
英文:

Create an unknown label for spaCy when returning list of text and label

问题

for token in doc:
    keywords.append((token.text, token.label_ if token.ent_type_ else 'UNKNOWN'))
英文:

I'm trying to create a condition statement for a function that will return the text and label for a passed list.
Here's the code:

def get_label(text: list):
    doc = nlp('. '.join(text) + '.')
    keywords = []
    for ent in doc.ents:
        keywords.append((ent.text, ent.label_))
    return keywords

The input is:

['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']

The output is:

[('Google', 'ORG'), ('San Francisco', 'GPE'), ('this week', 'DATE'), ('as early as tomorrow', 'DATE'), ('Kag-ingle', 'PERSON'), ('about half a million', 'CARDINAL'), ('Ben Hamner', 'PERSON'), ('2010', 'DATE'), ('Earlier this month', 'DATE'), ('Google Cloud Platform', 'ORG'), ('Crunchbase', 'ORG'), ('$12.5 to $13 million', 'MONEY'), ('Index Ventures', 'ORG'), ('Hal Varian', 'PERSON'), ('Khosla Ventures', 'ORG'), ('Yuri Milner', 'PERSON')]

However, the output should include the entities that were not labelled, assigning them the "UNKNOWN" label like this:

[('Kaggle', 'UNKNOWN'), ('Google', 'ORG'), ('San Francisco', 'GPE'), ('this week', 'DATE'), ('as early as tomorrow', 'DATE'), ('Kag-ingle', 'PERSON'), ('about half a million', 'CARDINAL'), ('Ben Hamner', 'PERSON'), ('2010', 'DATE'), ('Earlier this month', 'DATE'), ('YouTube', 'UNKNOWN'), ('Google Cloud Platform', 'ORG'), ('Crunchbase', 'ORG'), ('$12.5 to $13 million', 'MONEY'), ('Index Ventures', 'ORG'), ('Hal Varian', 'PERSON'), ('Khosla Ventures', 'ORG'), ('Yuri Milner', 'PERSON')]

I've tried using:

for token in doc.sents:
       keywords.append((token.text, token.label_))

Which returns:

[('Kaggle.', ''), ('Google.', ''), ('San Francisco.', ''), ('this week.', ''), ('as early as tomorrow.', ''), ('Kag-ingle.', ''), ('about half a million.', ''), ('Ben Hamner. 2010.', ''), ('Earlier this month.', ''), ('YouTube.', ''), ('Google Cloud Platform.', ''), ('Crunchbase.', ''), ('$12.5 to $13 million.', ''), ('Index Ventures.', ''), ('SV Angel.', ''), ('Hal Varian.', ''), ('Khosla Ventures.', ''), ('Yuri Milner.', '')]

This is (assuming) because there is a period at the end of each token preventing any label from returning.

If anyone has an idea of how I can fix this, I'd really appreciate the help.

答案1

得分: 1

Iterate over the items passed in and check whether they match one of the returned entities after spaCy has performed the labelling (see solution below).

Notes:

  • The output labels vary depending on the spaCy version and pipeline/pipeline version being used. I used spaCy 3.5.3 and the en_core_web_trf==3.5.0 pipeline to produce the following results.
  • spaCy returned "Bill Hamner" as "Bill Hamner." as the labelled entity, hence the extra condition in the if statement to check for these edge cases.

Solution:

import spacy

txt = ['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']

nlp = spacy.load("en_core_web_trf")

def get_label(text: list):
    doc = nlp(". ".join(text) + ".")
    keywords = []
    for item in text:
        found_label = False
        for ent in doc.ents:
            if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
                found_label = True
                keywords.append((item, ent.label_))
                break
        if not found_label:
            keywords.append((item, "UNKNOWN"))
    return keywords

for kw in get_label(txt):
    print(kw)

Output:

('Kaggle', 'UNKNOWN')
('Google', 'ORG')
('San Francisco', 'GPE')
('this week', 'DATE')
('as early as tomorrow', 'DATE')
('Kag-ingle', 'UNKNOWN')
('about half a million', 'CARDINAL')
('Ben Hamner', 'PERSON')
('2010', 'DATE')
('Earlier this month', 'DATE')
('YouTube', 'ORG')
('Google Cloud Platform', 'UNKNOWN')
('Crunchbase', 'ORG')
('$12.5 to $13 million', 'MONEY')
('Index Ventures', 'ORG')
('SV Angel', 'UNKNOWN')
('Hal Varian', 'PERSON')
('Khosla Ventures', 'ORG')
('Yuri Milner', 'PERSON')

Some premature optimization for the get_label function which may be faster if dealing with very large documents returned by the spaCy pipline (i.e. a very large tuple of labelled entities for doc.ents). I'll leave it up to you to time the difference to see if it's worth using this variation in your end-application:

def get_label(text: list):
    doc = nlp(". ".join(text) + ".")
    ents = list(doc.ents)
    keywords = []
    for item in text:
        found_label = False
        for idx, ent in enumerate(ents):
            if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
                found_label = True
                keywords.append((item, ent.label_))
                ents.pop(idx)  # reduce size of list to make subsequent searches faster
                break
        if not found_label:
            keywords.append((item, "UNKNOWN"))
    return keywords
英文:

Iterate over the items passed in and check whether they match one of the returned entities after spaCy has performed the labelling (see solution below).

Notes:

  • The output labels vary depending on the spaCy version and pipeline/pipeline version being used. I used spaCy 3.5.3 and the en_core_web_trf==3.5.0 pipeline to produce the following results.
  • spaCy returned "Bill Hamner" as "Bill Hamner." as the labelled entity, hence the extra condition in the if statement to check for these edge cases.

Solution

import spacy

txt = ['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']

nlp = spacy.load("en_core_web_trf")


def get_label(text: list):
    doc = nlp(". ".join(text) + ".")
    keywords = []
    for item in text:
        found_label = False
        for ent in doc.ents:
            if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
                found_label = True
                keywords.append((item, ent.label_))
                break
        if not found_label:
            keywords.append((item, "UNKNOWN"))
    return keywords


for kw in get_label(txt):
    print(kw)

Output:

('Kaggle', 'UNKNOWN')
('Google', 'ORG')
('San Francisco', 'GPE')
('this week', 'DATE')
('as early as tomorrow', 'DATE')
('Kag-ingle', 'UNKNOWN')
('about half a million', 'CARDINAL')
('Ben Hamner', 'PERSON')
('2010', 'DATE')
('Earlier this month', 'DATE')
('YouTube', 'ORG')
('Google Cloud Platform', 'UNKNOWN')
('Crunchbase', 'ORG')
('$12.5 to $13 million', 'MONEY')
('Index Ventures', 'ORG')
('SV Angel', 'UNKNOWN')
('Hal Varian', 'PERSON')
('Khosla Ventures', 'ORG')
('Yuri Milner', 'PERSON')

Some premature optimization for the get_label function which may be faster if dealing with very large documents returned by the spaCy pipline (i.e. a very large tuple of labelled entities for doc.ents). I'll leave it up to you to time the difference to see if its worth using this variation in your end-application:

def get_label(text: list):
    doc = nlp(". ".join(text) + ".")
    ents = list(doc.ents)
    keywords = []
    for item in text:
        found_label = False
        for idx, ent in enumerate(ents):
            if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
                found_label = True
                keywords.append((item, ent.label_))
                ents.pop(idx)  # reduce size of list to make subsequent searches faster
                break
        if not found_label:
            keywords.append((item, "UNKNOWN"))
    return keywords

huangapple
  • 本文由 发表于 2023年5月23日 01:23:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76308600.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定