2023年6月8日 09:41:53go评论147阅读模式

英文:

How to create a Entity Ruler pattern that includes dot and hyphen?

问题

你可以尝试使用HTML解码来处理文本中的特殊字符，以及使用正则表达式模式来匹配 CPF。以下是你可以尝试的代码：

import spacy
import html

nlp = spacy.load("pt_core_news_sm")

text = "Jo&#227;o mora na Bahia, 22/11/1985, seu cpf &#233; 111.222.333-11"

# 解码HTML特殊字符
text = html.unescape(text)

# 添加实体规则器
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {"label": "CPF", "pattern": [{"TEXT": {"REGEX": r"\d{3}\.\d{3}\.\d{3}-\d{2}"}}]},
]

ruler.add_patterns(patterns)
doc = nlp(text)

# 提取实体
for ent in doc.ents:
    print(ent.text, ent.label_)

这个代码应该能够正确提取 CPF 实体。

英文:

I am trying to include brazilian CPF as entity on my NER app using spacy. The current code is the follow:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load(&quot;pt_core_news_sm&quot;)

text = &quot;Jo&#227;o mora na Bahia, 22/11/1985, seu cpf &#233; 111.222.333-11&quot;
ruler = nlp.add_pipe(&quot;entity_ruler&quot;)
patterns = [
    {&quot;label&quot;: &quot;CPF&quot;, &quot;pattern&quot;: [{&quot;SHAPE&quot;: &quot;ddd.ddd.ddd-dd&quot;}]},
]

ruler.add_patterns(patterns)
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

The result was only:

Jo&#227;o PER
Bahia LOC

I tried using regex too:

{&quot;label&quot;: &quot;CPF&quot;, &quot;pattern&quot;: [{&quot;TEXT&quot;: {&quot;REGEX&quot;: r&quot;^\d{3}\.\d{3}\.\d{3}\-\d{2}$&quot;}}]},

But not worked too

How can I fix that to retrieve CPF?

答案1

得分: 1

在查找标记间隔后，巴西的分词器将CPF拆分为两部分：

token_spacings = [token.text_with_ws for token in doc]

结果：

['João ', 'mora ', 'na ', 'Bahia', ', ', '22/11/1985', ', ', 'seu ', 'cpf ', 'é ', '111.222.', '333-11']

所以我认为你可以尝试这样做：

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("pt_core_news_sm")

text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {"label": "CPF", "pattern": [
            {"SHAPE": "ddd.ddd."},
            {"SHAPE": "ddd-dd"},
    ]},
]

ruler.add_patterns(patterns)
doc = nlp(text)

# 提取实体
for ent in doc.ents:
    print(ent.text, ent.label_)

英文:

After looking for token spacings, the brazilian tokenizer split cpf in two parts:

token_spacings = [token.text_with_ws for token&#160;in&#160;doc]

Result:

[&#39;Jo&#227;o &#39;, &#39;mora &#39;, &#39;na &#39;, &#39;Bahia&#39;, &#39;, &#39;, &#39;22/11/1985&#39;, &#39;, &#39;, &#39;seu &#39;, &#39;cpf &#39;, &#39;&#233; &#39;, &#39;111.222.&#39;,&#160;&#39;333-11&#39;]

So i think you may try this:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load(&quot;pt_core_news_sm&quot;)

text = &quot;Jo&#227;o mora na Bahia, 22/11/1985, seu cpf &#233; 111.222.333-11&quot;
ruler = nlp.add_pipe(&quot;entity_ruler&quot;)
patterns = [
    {&quot;label&quot;: &quot;CPF&quot;, &quot;pattern&quot;: [
            {&quot;SHAPE&quot;: &quot;ddd.ddd.&quot;},
            {&quot;SHAPE&quot;: &quot;ddd-dd&quot;},
    ]},
]

ruler.add_patterns(patterns)
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text,&#160;ent.label_)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何创建一个包括点号和连字符的实体规则模式？

问题

答案1

如何防止Python在发送电子邮件时泄漏文件路径？

Python在循环中增加参数数量

Python：如何在导入时替换一段代码

RuntimeError: asyncio.run() cannot be called from a running event loop. Anyone know what to do?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论