为spaCy返回文本和标签列表时创建一个未知标签。

huangapple go评论101阅读模式
英文:

Create an unknown label for spaCy when returning list of text and label

问题

  1. for token in doc:
  2. keywords.append((token.text, token.label_ if token.ent_type_ else 'UNKNOWN'))
英文:

I'm trying to create a condition statement for a function that will return the text and label for a passed list.
Here's the code:

  1. def get_label(text: list):
  2. doc = nlp('. '.join(text) + '.')
  3. keywords = []
  4. for ent in doc.ents:
  5. keywords.append((ent.text, ent.label_))
  6. return keywords

The input is:

  1. ['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']

The output is:

  1. [('Google', 'ORG'), ('San Francisco', 'GPE'), ('this week', 'DATE'), ('as early as tomorrow', 'DATE'), ('Kag-ingle', 'PERSON'), ('about half a million', 'CARDINAL'), ('Ben Hamner', 'PERSON'), ('2010', 'DATE'), ('Earlier this month', 'DATE'), ('Google Cloud Platform', 'ORG'), ('Crunchbase', 'ORG'), ('$12.5 to $13 million', 'MONEY'), ('Index Ventures', 'ORG'), ('Hal Varian', 'PERSON'), ('Khosla Ventures', 'ORG'), ('Yuri Milner', 'PERSON')]

However, the output should include the entities that were not labelled, assigning them the "UNKNOWN" label like this:

  1. [('Kaggle', 'UNKNOWN'), ('Google', 'ORG'), ('San Francisco', 'GPE'), ('this week', 'DATE'), ('as early as tomorrow', 'DATE'), ('Kag-ingle', 'PERSON'), ('about half a million', 'CARDINAL'), ('Ben Hamner', 'PERSON'), ('2010', 'DATE'), ('Earlier this month', 'DATE'), ('YouTube', 'UNKNOWN'), ('Google Cloud Platform', 'ORG'), ('Crunchbase', 'ORG'), ('$12.5 to $13 million', 'MONEY'), ('Index Ventures', 'ORG'), ('Hal Varian', 'PERSON'), ('Khosla Ventures', 'ORG'), ('Yuri Milner', 'PERSON')]

I've tried using:

  1. for token in doc.sents:
  2. keywords.append((token.text, token.label_))

Which returns:

  1. [('Kaggle.', ''), ('Google.', ''), ('San Francisco.', ''), ('this week.', ''), ('as early as tomorrow.', ''), ('Kag-ingle.', ''), ('about half a million.', ''), ('Ben Hamner. 2010.', ''), ('Earlier this month.', ''), ('YouTube.', ''), ('Google Cloud Platform.', ''), ('Crunchbase.', ''), ('$12.5 to $13 million.', ''), ('Index Ventures.', ''), ('SV Angel.', ''), ('Hal Varian.', ''), ('Khosla Ventures.', ''), ('Yuri Milner.', '')]

This is (assuming) because there is a period at the end of each token preventing any label from returning.

If anyone has an idea of how I can fix this, I'd really appreciate the help.

答案1

得分: 1

Iterate over the items passed in and check whether they match one of the returned entities after spaCy has performed the labelling (see solution below).

Notes:

  • The output labels vary depending on the spaCy version and pipeline/pipeline version being used. I used spaCy 3.5.3 and the en_core_web_trf==3.5.0 pipeline to produce the following results.
  • spaCy returned "Bill Hamner" as "Bill Hamner." as the labelled entity, hence the extra condition in the if statement to check for these edge cases.

Solution:

  1. import spacy
  2. txt = ['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']
  3. nlp = spacy.load("en_core_web_trf")
  4. def get_label(text: list):
  5. doc = nlp(". ".join(text) + ".")
  6. keywords = []
  7. for item in text:
  8. found_label = False
  9. for ent in doc.ents:
  10. if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
  11. found_label = True
  12. keywords.append((item, ent.label_))
  13. break
  14. if not found_label:
  15. keywords.append((item, "UNKNOWN"))
  16. return keywords
  17. for kw in get_label(txt):
  18. print(kw)

Output:

  1. ('Kaggle', 'UNKNOWN')
  2. ('Google', 'ORG')
  3. ('San Francisco', 'GPE')
  4. ('this week', 'DATE')
  5. ('as early as tomorrow', 'DATE')
  6. ('Kag-ingle', 'UNKNOWN')
  7. ('about half a million', 'CARDINAL')
  8. ('Ben Hamner', 'PERSON')
  9. ('2010', 'DATE')
  10. ('Earlier this month', 'DATE')
  11. ('YouTube', 'ORG')
  12. ('Google Cloud Platform', 'UNKNOWN')
  13. ('Crunchbase', 'ORG')
  14. ('$12.5 to $13 million', 'MONEY')
  15. ('Index Ventures', 'ORG')
  16. ('SV Angel', 'UNKNOWN')
  17. ('Hal Varian', 'PERSON')
  18. ('Khosla Ventures', 'ORG')
  19. ('Yuri Milner', 'PERSON')

Some premature optimization for the get_label function which may be faster if dealing with very large documents returned by the spaCy pipline (i.e. a very large tuple of labelled entities for doc.ents). I'll leave it up to you to time the difference to see if it's worth using this variation in your end-application:

  1. def get_label(text: list):
  2. doc = nlp(". ".join(text) + ".")
  3. ents = list(doc.ents)
  4. keywords = []
  5. for item in text:
  6. found_label = False
  7. for idx, ent in enumerate(ents):
  8. if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
  9. found_label = True
  10. keywords.append((item, ent.label_))
  11. ents.pop(idx) # reduce size of list to make subsequent searches faster
  12. break
  13. if not found_label:
  14. keywords.append((item, "UNKNOWN"))
  15. return keywords
英文:

Iterate over the items passed in and check whether they match one of the returned entities after spaCy has performed the labelling (see solution below).

Notes:

  • The output labels vary depending on the spaCy version and pipeline/pipeline version being used. I used spaCy 3.5.3 and the en_core_web_trf==3.5.0 pipeline to produce the following results.
  • spaCy returned "Bill Hamner" as "Bill Hamner." as the labelled entity, hence the extra condition in the if statement to check for these edge cases.

Solution

  1. import spacy
  2. txt = ['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']
  3. nlp = spacy.load("en_core_web_trf")
  4. def get_label(text: list):
  5. doc = nlp(". ".join(text) + ".")
  6. keywords = []
  7. for item in text:
  8. found_label = False
  9. for ent in doc.ents:
  10. if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
  11. found_label = True
  12. keywords.append((item, ent.label_))
  13. break
  14. if not found_label:
  15. keywords.append((item, "UNKNOWN"))
  16. return keywords
  17. for kw in get_label(txt):
  18. print(kw)

Output:

  1. ('Kaggle', 'UNKNOWN')
  2. ('Google', 'ORG')
  3. ('San Francisco', 'GPE')
  4. ('this week', 'DATE')
  5. ('as early as tomorrow', 'DATE')
  6. ('Kag-ingle', 'UNKNOWN')
  7. ('about half a million', 'CARDINAL')
  8. ('Ben Hamner', 'PERSON')
  9. ('2010', 'DATE')
  10. ('Earlier this month', 'DATE')
  11. ('YouTube', 'ORG')
  12. ('Google Cloud Platform', 'UNKNOWN')
  13. ('Crunchbase', 'ORG')
  14. ('$12.5 to $13 million', 'MONEY')
  15. ('Index Ventures', 'ORG')
  16. ('SV Angel', 'UNKNOWN')
  17. ('Hal Varian', 'PERSON')
  18. ('Khosla Ventures', 'ORG')
  19. ('Yuri Milner', 'PERSON')

Some premature optimization for the get_label function which may be faster if dealing with very large documents returned by the spaCy pipline (i.e. a very large tuple of labelled entities for doc.ents). I'll leave it up to you to time the difference to see if its worth using this variation in your end-application:

  1. def get_label(text: list):
  2. doc = nlp(". ".join(text) + ".")
  3. ents = list(doc.ents)
  4. keywords = []
  5. for item in text:
  6. found_label = False
  7. for idx, ent in enumerate(ents):
  8. if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
  9. found_label = True
  10. keywords.append((item, ent.label_))
  11. ents.pop(idx) # reduce size of list to make subsequent searches faster
  12. break
  13. if not found_label:
  14. keywords.append((item, "UNKNOWN"))
  15. return keywords

huangapple
  • 本文由 发表于 2023年5月23日 01:23:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76308600.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定