英文:
How to normalise keywords extracted with Named Entity Recognition
问题
尝试使用NER从职位描述中提取关键词(标签),包括React, AWS, Team Building, Marketing
。
在SpaCy中训练了一个定制模型后,我遇到了一个问题——提取的标签在所有数据中并不统一/规范化。
例如,如果职位描述涉及“前端开发”,NER可以以多种方式提取关键词“前端”(取决于职位描述),例如:Frontend
、Front End
、Front-End
、front-end
等。
有没有一种可靠的方法来规范/统一提取的关键词?所有关键词直接进入数据库,如果每个关键词有太多变体,我会得到太多噪音。
解决这个问题的一种方法是创建映射,例如:
"Frontend": ["Front End", "Front-End", "front-end"]
但这种方法似乎不太明智。也许在SpaCy本身有规范化标签的选项?
英文:
I'm trying to employ NER to extract keywords (tags) from job postings. This can be anything along with React, AWS, Team Building, Marketing
.
After training a custom model in SpaCy I'm presented with a problem - extracted tags are not unified/normalized across all of the data.
For example, if job posting is about frontend development
, NER can extract the keyword frontend
in many ways (depending on job description), for example: Frontend
, Front End
, Front-End
, front-end
and so on.
Is there a reliable way to normalise/unify the extracted keywords? All the keywords go directly into the database and, with all the variants of each keyword, I would end up with too much noise.
One way to tackle the problem would be to create mappings such as:
"Frontend": ["Front End", "Front-End", "front-end"]
but that approach seems not too bright. Perhaps within SpaCy itself there's an option to normalise tags?
答案1
得分: 3
当然,这些简单的规则可以快速帮助您折叠类似的s
字符串:
s.lower()
s.replace("-", " ")
s.replace(" ", "")
有几种音标算法,比如Metaphone,擅长将“听起来相似”的变体合并为单一的基本实体。
频繁的二元分析可能有助于识别表示单一实体的常见双词短语。
Spacy的token.lemma_
和token.text
可以帮助词干提取。
学习例如“React”和“Frontend”在这个上下文中更或多是同义词,可能需要更重的方法,比如word2vec,WordNet,或像ChatGPT这样的LLM。
英文:
Certainly these simple rules can quickly help you to collapse similar s
strings:
s.lower()
s.replace("-", " ")
s.replace(" ", "")
There are several
phonetic algorithms
such as
Metaphone,
that are good at collapsing "sounds alike" variants
into a single base entity.
A frequent bi-gram analysis may help you to identify
common two-word phrases that denote a single entity.
Spacy's token.lemma_
and token.text
can help with stemming.
Learning that e.g. "React" and "Frontend" are more or less synonyms
in this context would require a heavier weight approach, such as word2vec,
WordNet,
or a LLM like ChatGPT.
答案2
得分: 2
补充一下J_H的精彩回答,如果我们想要找到类似“React”和“前端”这样的相关术语,可以直接使用spacy完成。例如,让我们找出查尔斯·狄更斯维基百科条目第一段的所有命名实体并将它们聚类。
$ python -m spacy download en_core_web_lg # 600 MiB,只需执行一次
import spacy
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
nlp = spacy.load("en_core_web_lg")
paragraph = """查尔斯·约翰·赫夫姆·狄更斯(/ˈdɪkɪnz/;1812年2月7日 – 1870年6月9日)是一位英国作家和社会评论家。他创造了一些世界上最著名的虚构人物,并被许多人视为维多利亚时代最伟大的小说家。[1]他的作品在他的一生中享有空前的声望,并且到了20世纪,评论家和学者们才认识到他是文学天才。他的小说和短篇小说今天广泛阅读。[2][3]
出生在朴茨茅斯,狄更斯12岁时离开学校,在他的父亲因债务而被关押时在一个擦鞋蜡工厂工作。三年后,他重返学校,然后开始他的文学生涯,成为一名记者。狄更斯编辑了一份周刊杂志长达20年,写了15部小说、五部中篇小说、数百篇短篇小说和非虚构文章,广泛演讲和朗诵,是一位不知疲倦的书信作者,并为儿童权利、教育以及其他社会改革积极奔走。
狄更斯的文学成功始于1836年连载出版的《匹克威克外传》[4],这是一场出版现象——主要归功于第四集中引入的角色山姆·韦勒——引发了匹克威克商品和衍生品。仅仅几年之内,狄更斯已经成为国际文学名人,以其幽默、讽刺和对人物及社会的敏锐观察而闻名。他的小说大多数是以每月或每周的形式出版的,这在当时成为小说出版的主导维多利亚时代的方式。[4][5]他连载出版小说中的悬念结尾使读者充满悬念。[6]连载格式使狄更斯能够评估观众的反应,他经常根据这些反馈修改情节和人物发展。[5]例如,当他妻子的足疗师对《大卫·科波菲尔》中的莫切小姐似乎反映了她自己的残疾感到痛苦时,狄更斯改进了角色,加入了积极的特征。[7]他的情节构思精心,经常将时事事件的元素编织到他的叙述中。[8]文盲的穷人群众会支付半便士,以便每个新的每月剧集都能为他们朗读,为他们打开并激发一个新的读者阶层。[9]
他的1843年中篇小说《圣诞颂歌》尤其受欢迎,继续在各种艺术流派中激发改编。《雾都孤儿》和《远大前程》也经常被改编,像他的许多小说一样,唤起了早期维多利亚时代伦敦的形象。他的1859年小说《双城记》(设定在伦敦和巴黎)是他最著名的历史小说作品。他那个时代最著名的名人,他在职业生涯的后期,应公众要求进行了一系列公开朗读巡演。[10]“狄更斯式”这个术语用来描述与狄更斯及其著作类似的东西,比如贫困的社会或工作条件,或者滑稽可笑的人物。"""
doc = nlp(paragraph)
df = pd.DataFrame([(e.text, e.label_, np.array(e.vector)) for e in doc.ents], columns=['text', 'type', 'vec'])
X = np.vstack(df.vec.to_numpy())
dbscan = DBSCAN(metric='cosine', min_samples=1, eps=0.4)
df['cluster'] = dbscan.fit_predict(X)
最后,让我们显示这些聚类:
groups = df.groupby(by=['cluster'])['text']
for g in groups:
print(g[-1].values)
结果为:
['查尔斯·约翰·赫夫姆·狄更斯' '狄更斯' '狄更斯' '狄更斯' '狄更斯'
'狄更斯' '大卫·科波菲尔' '狄更斯' '狄更斯']
['1812年2月7日']
<details>
<summary>英文:</summary>
To supplement J_H's great answer, if we want to find related terms like "React" and "frontend", this can be done with spacy out of the box. E.g., let's find all the named entities from the first paragraph of the Wikipedia entry for Charles Dickens and cluster them.
$ python -m spacy download en_core_web_lg # 600 MiB, only need to do this once
import spacy
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
nlp = spacy.load("en_core_web_lg")
paragraph = """
Charles John Huffam Dickens (/ˈdɪkɪnz/; 7 February 1812 – 9 June 1870) was an English writer and social critic. He created some of the world's best-known fictional characters and is regarded by many as the greatest novelist of the Victorian era.[1] His works enjoyed unprecedented popularity during his lifetime and, by the 20th century, critics and scholars had recognised him as a literary genius. His novels and short stories are widely read today.[2][3]
Born in Portsmouth, Dickens left school at the age of 12 to work in a boot-blacking factory when his father was incarcerated in a debtors' prison. After three years he returned to school, before he began his literary career as a journalist. Dickens edited a weekly journal for 20 years, wrote 15 novels, five novellas, hundreds of short stories and non-fiction articles, lectured and performed readings extensively, was an indefatigable letter writer, and campaigned vigorously for children's rights, for education, and for other social reforms.
Dickens's literary success began with the 1836 serial publication of The Pickwick Papers, a publishing phenomenon—thanks largely to the introduction of the character Sam Weller in the fourth episode—that sparked Pickwick merchandise and spin-offs. Within a few years Dickens had become an international literary celebrity, famous for his humour, satire and keen observation of character and society. His novels, most of them published in monthly or weekly installments, pioneered the serial publication of narrative fiction, which became the dominant Victorian mode for novel publication.[4][5] Cliffhanger endings in his serial publications kept readers in suspense.[6] The instalment format allowed Dickens to evaluate his audience's reaction, and he often modified his plot and character development based on such feedback.[5] For example, when his wife's chiropodist expressed distress at the way Miss Mowcher in David Copperfield seemed to reflect her own disabilities, Dickens improved the character with positive features.[7] His plots were carefully constructed and he often wove elements from topical events into his narratives.[8] Masses of the illiterate poor would individually pay a halfpenny to have each new monthly episode read to them, opening up and inspiring a new class of readers.[9]
His 1843 novella A Christmas Carol remains especially popular and continues to inspire adaptations in every artistic genre. Oliver Twist and Great Expectations are also frequently adapted and, like many of his novels, evoke images of early Victorian London. His 1859 novel A Tale of Two Cities (set in London and Paris) is his best-known work of historical fiction. The most famous celebrity of his era, he undertook, in response to public demand, a series of public reading tours in the later part of his career.[10] The term Dickensian is used to describe something that is reminiscent of Dickens and his writings, such as poor social or working conditions, or comically repulsive characters."""
doc = nlp(paragraph)
df = pd.DataFrame([(e.text, e.label_, np.array(e.vector)) for e in doc.ents], columns=['text', 'type', 'vec'])
X = np.vstack(df.vec.to_numpy())
dbscan = DBSCAN(metric='cosine', min_samples=1, eps=0.4)
df['cluster'] = dbscan.fit_predict(X)
Finally, let's display the clusters:
groups = df.groupby(by=['cluster'])['text']
for g in groups:
print(g[-1].values)
Resulting in
['Charles John Huffam Dickens' 'Dickens' 'Dickens' 'Dickens' 'Dickens'
'Dickens' 'David Copperfield' 'Dickens' 'Dickens']
['7 February 1812']
['English']
['the 20th century']
['Portsmouth']
['the age of 12' 'three years' '20 years' '15' 'five' 'fourth'
'a few years' 'A Tale of Two Cities']
['weekly' 'monthly' 'weekly' 'monthly']
['hundreds']
['1836' '1843' '1859']
['The Pickwick Papers' 'Pickwick']
['Sam Weller']
['Victorian' 'early Victorian London' 'London' 'Paris']
['Mowcher']
['a halfpenny']
['A Christmas Carol']
['Oliver Twist']
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论