英文:
How to create a Entity Ruler pattern that includes dot and hyphen?
问题
你可以尝试使用HTML解码来处理文本中的特殊字符,以及使用正则表达式模式来匹配 CPF。以下是你可以尝试的代码:
import spacy
import html
nlp = spacy.load("pt_core_news_sm")
text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"
# 解码HTML特殊字符
text = html.unescape(text)
# 添加实体规则器
ruler = nlp.add_pipe("entity_ruler")
patterns = [
{"label": "CPF", "pattern": [{"TEXT": {"REGEX": r"\d{3}\.\d{3}\.\d{3}-\d{2}"}}]},
]
ruler.add_patterns(patterns)
doc = nlp(text)
# 提取实体
for ent in doc.ents:
print(ent.text, ent.label_)
这个代码应该能够正确提取 CPF 实体。
英文:
I am trying to include brazilian CPF as entity on my NER app using spacy. The current code is the follow:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load("pt_core_news_sm")
text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"
ruler = nlp.add_pipe("entity_ruler")
patterns = [
{"label": "CPF", "pattern": [{"SHAPE": "ddd.ddd.ddd-dd"}]},
]
ruler.add_patterns(patterns)
doc = nlp(text)
#extract entities
for ent in doc.ents:
print (ent.text, ent.label_)
The result was only:
João PER
Bahia LOC
I tried using regex too:
{"label": "CPF", "pattern": [{"TEXT": {"REGEX": r"^\d{3}\.\d{3}\.\d{3}\-\d{2}$"}}]},
But not worked too
How can I fix that to retrieve CPF?
答案1
得分: 1
在查找标记间隔后,巴西的分词器将CPF拆分为两部分:
token_spacings = [token.text_with_ws for token in doc]
结果:
['João ', 'mora ', 'na ', 'Bahia', ', ', '22/11/1985', ', ', 'seu ', 'cpf ', 'é ', '111.222.', '333-11']
所以我认为你可以尝试这样做:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load("pt_core_news_sm")
text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"
ruler = nlp.add_pipe("entity_ruler")
patterns = [
{"label": "CPF", "pattern": [
{"SHAPE": "ddd.ddd."},
{"SHAPE": "ddd-dd"},
]},
]
ruler.add_patterns(patterns)
doc = nlp(text)
# 提取实体
for ent in doc.ents:
print(ent.text, ent.label_)
英文:
After looking for token spacings, the brazilian tokenizer split cpf in two parts:
token_spacings = [token.text_with_ws for token in doc]
Result:
['João ', 'mora ', 'na ', 'Bahia', ', ', '22/11/1985', ', ', 'seu ', 'cpf ', 'é ', '111.222.', '333-11']
So i think you may try this:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load("pt_core_news_sm")
text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"
ruler = nlp.add_pipe("entity_ruler")
patterns = [
{"label": "CPF", "pattern": [
{"SHAPE": "ddd.ddd."},
{"SHAPE": "ddd-dd"},
]},
]
ruler.add_patterns(patterns)
doc = nlp(text)
#extract entities
for ent in doc.ents:
print (ent.text, ent.label_)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论