英文:
Create an unknown label for spaCy when returning list of text and label
问题
for token in doc:
keywords.append((token.text, token.label_ if token.ent_type_ else 'UNKNOWN'))
英文:
I'm trying to create a condition statement for a function that will return the text and label for a passed list.
Here's the code:
def get_label(text: list):
doc = nlp('. '.join(text) + '.')
keywords = []
for ent in doc.ents:
keywords.append((ent.text, ent.label_))
return keywords
The input is:
['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']
The output is:
[('Google', 'ORG'), ('San Francisco', 'GPE'), ('this week', 'DATE'), ('as early as tomorrow', 'DATE'), ('Kag-ingle', 'PERSON'), ('about half a million', 'CARDINAL'), ('Ben Hamner', 'PERSON'), ('2010', 'DATE'), ('Earlier this month', 'DATE'), ('Google Cloud Platform', 'ORG'), ('Crunchbase', 'ORG'), ('$12.5 to $13 million', 'MONEY'), ('Index Ventures', 'ORG'), ('Hal Varian', 'PERSON'), ('Khosla Ventures', 'ORG'), ('Yuri Milner', 'PERSON')]
However, the output should include the entities that were not labelled, assigning them the "UNKNOWN" label like this:
[('Kaggle', 'UNKNOWN'), ('Google', 'ORG'), ('San Francisco', 'GPE'), ('this week', 'DATE'), ('as early as tomorrow', 'DATE'), ('Kag-ingle', 'PERSON'), ('about half a million', 'CARDINAL'), ('Ben Hamner', 'PERSON'), ('2010', 'DATE'), ('Earlier this month', 'DATE'), ('YouTube', 'UNKNOWN'), ('Google Cloud Platform', 'ORG'), ('Crunchbase', 'ORG'), ('$12.5 to $13 million', 'MONEY'), ('Index Ventures', 'ORG'), ('Hal Varian', 'PERSON'), ('Khosla Ventures', 'ORG'), ('Yuri Milner', 'PERSON')]
I've tried using:
for token in doc.sents:
keywords.append((token.text, token.label_))
Which returns:
[('Kaggle.', ''), ('Google.', ''), ('San Francisco.', ''), ('this week.', ''), ('as early as tomorrow.', ''), ('Kag-ingle.', ''), ('about half a million.', ''), ('Ben Hamner. 2010.', ''), ('Earlier this month.', ''), ('YouTube.', ''), ('Google Cloud Platform.', ''), ('Crunchbase.', ''), ('$12.5 to $13 million.', ''), ('Index Ventures.', ''), ('SV Angel.', ''), ('Hal Varian.', ''), ('Khosla Ventures.', ''), ('Yuri Milner.', '')]
This is (assuming) because there is a period at the end of each token preventing any label from returning.
If anyone has an idea of how I can fix this, I'd really appreciate the help.
答案1
得分: 1
Iterate over the items passed in and check whether they match one of the returned entities after spaCy has performed the labelling (see solution below).
Notes:
- The output labels vary depending on the spaCy version and pipeline/pipeline version being used. I used spaCy 3.5.3 and the
en_core_web_trf==3.5.0
pipeline to produce the following results. - spaCy returned "Bill Hamner" as "Bill Hamner." as the labelled entity, hence the extra condition in the
if
statement to check for these edge cases.
Solution:
import spacy
txt = ['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']
nlp = spacy.load("en_core_web_trf")
def get_label(text: list):
doc = nlp(". ".join(text) + ".")
keywords = []
for item in text:
found_label = False
for ent in doc.ents:
if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
found_label = True
keywords.append((item, ent.label_))
break
if not found_label:
keywords.append((item, "UNKNOWN"))
return keywords
for kw in get_label(txt):
print(kw)
Output:
('Kaggle', 'UNKNOWN')
('Google', 'ORG')
('San Francisco', 'GPE')
('this week', 'DATE')
('as early as tomorrow', 'DATE')
('Kag-ingle', 'UNKNOWN')
('about half a million', 'CARDINAL')
('Ben Hamner', 'PERSON')
('2010', 'DATE')
('Earlier this month', 'DATE')
('YouTube', 'ORG')
('Google Cloud Platform', 'UNKNOWN')
('Crunchbase', 'ORG')
('$12.5 to $13 million', 'MONEY')
('Index Ventures', 'ORG')
('SV Angel', 'UNKNOWN')
('Hal Varian', 'PERSON')
('Khosla Ventures', 'ORG')
('Yuri Milner', 'PERSON')
Some premature optimization for the get_label
function which may be faster if dealing with very large documents returned by the spaCy pipline (i.e. a very large tuple of labelled entities for doc.ents
). I'll leave it up to you to time the difference to see if it's worth using this variation in your end-application:
def get_label(text: list):
doc = nlp(". ".join(text) + ".")
ents = list(doc.ents)
keywords = []
for item in text:
found_label = False
for idx, ent in enumerate(ents):
if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
found_label = True
keywords.append((item, ent.label_))
ents.pop(idx) # reduce size of list to make subsequent searches faster
break
if not found_label:
keywords.append((item, "UNKNOWN"))
return keywords
英文:
Iterate over the items passed in and check whether they match one of the returned entities after spaCy has performed the labelling (see solution below).
Notes:
- The output labels vary depending on the spaCy version and pipeline/pipeline version being used. I used spaCy 3.5.3 and the
en_core_web_trf==3.5.0
pipeline to produce the following results. - spaCy returned "Bill Hamner" as "Bill Hamner." as the labelled entity, hence the extra condition in the
if
statement to check for these edge cases.
Solution
import spacy
txt = ['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']
nlp = spacy.load("en_core_web_trf")
def get_label(text: list):
doc = nlp(". ".join(text) + ".")
keywords = []
for item in text:
found_label = False
for ent in doc.ents:
if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
found_label = True
keywords.append((item, ent.label_))
break
if not found_label:
keywords.append((item, "UNKNOWN"))
return keywords
for kw in get_label(txt):
print(kw)
Output:
('Kaggle', 'UNKNOWN')
('Google', 'ORG')
('San Francisco', 'GPE')
('this week', 'DATE')
('as early as tomorrow', 'DATE')
('Kag-ingle', 'UNKNOWN')
('about half a million', 'CARDINAL')
('Ben Hamner', 'PERSON')
('2010', 'DATE')
('Earlier this month', 'DATE')
('YouTube', 'ORG')
('Google Cloud Platform', 'UNKNOWN')
('Crunchbase', 'ORG')
('$12.5 to $13 million', 'MONEY')
('Index Ventures', 'ORG')
('SV Angel', 'UNKNOWN')
('Hal Varian', 'PERSON')
('Khosla Ventures', 'ORG')
('Yuri Milner', 'PERSON')
Some premature optimization for the get_label
function which may be faster if dealing with very large documents returned by the spaCy pipline (i.e. a very large tuple of labelled entities for doc.ents
). I'll leave it up to you to time the difference to see if its worth using this variation in your end-application:
def get_label(text: list):
doc = nlp(". ".join(text) + ".")
ents = list(doc.ents)
keywords = []
for item in text:
found_label = False
for idx, ent in enumerate(ents):
if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
found_label = True
keywords.append((item, ent.label_))
ents.pop(idx) # reduce size of list to make subsequent searches faster
break
if not found_label:
keywords.append((item, "UNKNOWN"))
return keywords
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论