如何创建一个包括点号和连字符的实体规则模式?

huangapple go评论63阅读模式
英文:

How to create a Entity Ruler pattern that includes dot and hyphen?

问题

你可以尝试使用HTML解码来处理文本中的特殊字符,以及使用正则表达式模式来匹配 CPF。以下是你可以尝试的代码:

import spacy
import html

nlp = spacy.load("pt_core_news_sm")

text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"

# 解码HTML特殊字符
text = html.unescape(text)

# 添加实体规则器
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {"label": "CPF", "pattern": [{"TEXT": {"REGEX": r"\d{3}\.\d{3}\.\d{3}-\d{2}"}}]},
]

ruler.add_patterns(patterns)
doc = nlp(text)

# 提取实体
for ent in doc.ents:
    print(ent.text, ent.label_)

这个代码应该能够正确提取 CPF 实体。

英文:

I am trying to include brazilian CPF as entity on my NER app using spacy. The current code is the follow:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("pt_core_news_sm")

text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {"label": "CPF", "pattern": [{"SHAPE": "ddd.ddd.ddd-dd"}]},
]

ruler.add_patterns(patterns)
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

The result was only:

João PER
Bahia LOC

I tried using regex too:

{"label": "CPF", "pattern": [{"TEXT": {"REGEX": r"^\d{3}\.\d{3}\.\d{3}\-\d{2}$"}}]},

But not worked too

How can I fix that to retrieve CPF?

答案1

得分: 1

在查找标记间隔后,巴西的分词器将CPF拆分为两部分:

token_spacings = [token.text_with_ws for token in doc]

结果:

['João ', 'mora ', 'na ', 'Bahia', ', ', '22/11/1985', ', ', 'seu ', 'cpf ', 'é ', '111.222.', '333-11']

所以我认为你可以尝试这样做:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("pt_core_news_sm")

text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {"label": "CPF", "pattern": [
            {"SHAPE": "ddd.ddd."},
            {"SHAPE": "ddd-dd"},
    ]},
]

ruler.add_patterns(patterns)
doc = nlp(text)

# 提取实体
for ent in doc.ents:
    print(ent.text, ent.label_)
英文:

After looking for token spacings, the brazilian tokenizer split cpf in two parts:

token_spacings = [token.text_with_ws for token in doc]

Result:

['João ', 'mora ', 'na ', 'Bahia', ', ', '22/11/1985', ', ', 'seu ', 'cpf ', 'é ', '111.222.', '333-11']

So i think you may try this:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("pt_core_news_sm")

text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {"label": "CPF", "pattern": [
            {"SHAPE": "ddd.ddd."},
            {"SHAPE": "ddd-dd"},
    ]},
]

ruler.add_patterns(patterns)
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

huangapple
  • 本文由 发表于 2023年6月8日 09:41:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76428082.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定