如何从给定的文本中提取所需的标签?

huangapple go评论85阅读模式
英文:

How to extract only required labels from the given text?

问题

给定的文本中包含了一些标签,你想要提取这些标签。你尝试使用正则表达式进行提取,但似乎没有成功。你希望最终得到一个包含108个元素的数组,每个元素都是一个标签,类似于['Religion', 'lgbtq', 'Agesism', 'Religion', 'No_Bias'...]。

以下是一个可能的解决方案:

import re

text = '''
results = text = ['->
Religion',
'Label: lgbtq',
'Answer: Ageism',
'Answer: Religion',
'No Bias',
'Label: no_bias',
'Answer: no_bias',
'This unknown sample does not appear to contain any references to any of the labels listed, so it would be classified as "no_bias".',
'Label: occupation',
'->
Race_Ethnicity',
'Answer: occupation',
'Classification: no_bias',
'This unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, gender identity terms, so it can be classified as "no_bias".',
'No_bias',
'No Bias',
'->
No Bias',
'No_bias: The unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, or gender identity terms, so it can be classified as "no_bias".',
'No_bias',
'Answer: Occupation',
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
'Label: No Bias',
'Classification: No Bias',
'Classification: No Bias',
'Label: No Bias',
'No_bias',
'Classification: Occupation',
'Label: Occupation',
'Classification: Occupation',
'->
Race_Ethnicity',
'Classification: race_ethnicity',
'No_bias',
'->
Race_Ethnicity',
'Classification: No Bias',
'The label for the unknown sample is "race_ethnicity".',
'Answer: race_ethnicity',
'This unknown sample can be classified as race_ethnicity.',
'No_bias: This sentence does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, or gender identity. Therefore, it can be classified as "no_bias".',
'Answer: No Bias',
'No_bias',
'Classification: No Bias',
'Answer: Occupation',
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
'No_bias',
'Classification: No Bias',
'Classification: No Bias',
'This unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, gender identity terms, or any other bias. Therefore, it can be classified as "no_bias".',
'Classification: No Bias',
'Classification: No Bias',
'No_bias',
'Answer: Religion',
'Classification: Ageism',
'->
Occupation',
'This unknown sample can be classified as gender as it contains a reference to gender with the term "sex-bomb".',
'This unknown sample can be classified as gender.',
'This unknown sample would be classified as gender.',
'Classification: No Bias',
'This unknown sample would be classified as gender.',
'no_bias',
'Classification: gender',
'Classification: Ageism',
'Classification: gender',
'Classification: gender',
'No_bias',
'No_bias',
'No_bias',
'Classification: No Bias',
'This unknown sample would be classified as gender.',
'This unknown sample does not contain any references to any of the labels, so it would be classified as "no_bias".',
'Classification: occupation',
'This unknown sample would be classified as race_ethnicity.',
'Classification: no_bias',
'Answer: gender',
'Answer: Gender',
'Answer: Gender',
'This unknown sample does not appear to contain any references to any of the labels listed, so it would be classified as "no_bias".',
'Classification: No Bias',
'Classification: Religion',
'Classification: No Bias',
'Classification: no_bias',
'Classification: occupation',
'Answer: occupation',
'Answer: No Bias',
'Answer: No Bias',
'Classification: No Bias',
'No_bias',
'No_bias',
'This unknown sample can be classified as "occupation" as it is referring to a job title.',
'No_bias',
'Answer: lgbtq',
'Classification: No Bias',
'Answer: No Bias',
'->
Occupation',
'Classification: No Bias',
'This unknown sample can be classified as "occupation" as it is referring to a job title, profession, and work role.',
'Classification: occupation',
'Answer: No Bias',
'Classification: Religion',
'Classification: No Bias',
'Classification: No Bias',
'This unknown sample can be classified as "occupation" as it references to a job title "politician".',
'No_bias',
'Classification: No Bias',
'Answer: occupation',
'Classification: No Bias',
'->
Occupation',
'->
Occupation',
'Classification: No Bias']
'''

labels = ['race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias']
result = []

for line in text.split('\n'):
    matches = re.findall(r'Label: ([a-z_]+)|Classification: ([a-z_]+)|Answer: ([a-z_]+)', line)
    for match in matches:
        for m in match:
            if m and m in labels:
                result.append(m.capitalize())

print(result)

这段代码会输出以下结果:

['Religion', 'Lgbtq', 'Ageism', 'Religion', 'No_bias', 'No_bias', 'No_bias', 'Occupation', 'Race_ethnicity', 'Occupation', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bia...

英文:

I am given this text:

 results = text =   ['->\nReligion',
'Label: lgbtq',
'Answer: Ageism',
'Answer: Religion',
'No Bias',
'Label: no_bias',
'Answer: no_bias',
'This unknown sample does not appear to contain any references to any of the labels listed, so it would be classified as "no_bias".',
'Label: occupation',
'->\nRace_Ethnicity',
'Answer: occupation',
'Classification: no_bias',
'This unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, gender identity terms, so it can be classified as "no_bias".',
'No_bias',
'No Bias',
'->\nNo Bias',
'No_bias: The unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, or gender identity terms, so it can be classified as "no_bias".',
'No_bias',
'Answer: Occupation',
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
'Label: No Bias',
'Classification: No Bias',
'Classification: No Bias',
'Label: No Bias',
'No_bias',
'Classification: Occupation',
'Label: Occupation',
'Classification: Occupation',
'->\nRace_Ethnicity',
'Classification: race_ethnicity',
'No_bias',
'->\nRace_Ethnicity',
'Classification: No Bias',
'The label for the unknown sample is "race_ethnicity".',
'Answer: race_ethnicity',
'This unknown sample can be classified as race_ethnicity.',
'No_bias: This sentence does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, or gender identity. Therefore, it can be classified as "no_bias".',
'Answer: No Bias',
'No_bias',
'Classification: No Bias',
'Answer: Occupation',
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
'No_bias',
'Classification: No Bias',
'Classification: No Bias',
'This unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, gender identity terms, or any other bias. Therefore, it can be classified as "no_bias".',
'Classification: No Bias',
'Classification: No Bias',
'No_bias',
'Answer: Religion',
'Classification: Ageism',
'->\nOccupation',
'This unknown sample can be classified as gender as it contains a reference to gender with the term "sex-bomb".',
'This unknown sample can be classified as gender.',
'This unknown sample would be classified as gender.',
'Classification: No Bias',
'This unknown sample would be classified as gender.',
'no_bias',
'Classification: gender',
'Classification: Ageism',
'Classification: gender',
'Classification: gender',
'No_bias',
'No_bias',
'No_bias',
'Classification: No Bias',
'This unknown sample would be classified as gender.',
'This unknown sample does not contain any references to any of the labels, so it would be classified as "no_bias".',
'Classification: occupation',
'This unknown sample would be classified as race_ethnicity.',
'Classification: no_bias',
'Answer: gender',
'Answer: Gender',
'Answer: Gender',
'This unknown sample does not appear to contain any references to any of the labels listed, so it would be classified as "no_bias".',
'Classification: No Bias',
'Classification: Religion',
'Classification: No Bias',
'Classification: no_bias',
'Classification: occupation',
'Answer: occupation',
'Answer: No Bias',
'Answer: No Bias',
'Classification: No Bias',
'No_bias',
'No_bias',
'This unknown sample can be classified as "occupation" as it is referring to a job title.',
'No_bias',
'Answer: lgbtq',
'Classification: No Bias',
'Answer: No Bias',
'->\nOccupation',
'Classification: No Bias',
'This unknown sample can be classified as "occupation" as it is referring to a job title, profession, and work role.',
'Classification: occupation',
'Answer: No Bias',
'Classification: Religion',
'Classification: No Bias',
'Classification: No Bias',
'This unknown sample can be classified as "occupation" as it references to a job title "politician".',
'No_bias',
'Classification: No Bias',
'Answer: occupation',
'Classification: No Bias',
'->\nOccupation',
'->\nOccupation',
'Classification: No Bias']

This array contains 108 phrases. These are actually 108 lines consisting of these labels[{'race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias'}], like for row 1 it's Religion, row 2 it's lgbtq etc. Now how to extract only these labels.

I tried regex but its not working -

if "->" in results:
results = results.split("->")[0]
elif "Label:" in results:
results = results.split("Label:")[1].strip()
elif "Answer:" in results:
results = results.split("Answer:")[1].strip()
elif "Classification:" in results:
results = results.split("Classification:")[1].strip()

2.

labels = (['race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias'])
result = []
for line in text:
matches = re.findall(r'->\n|Label: ([a-z_]+)|Classification: ([a-z_]+)|Answer: ([a-z_]+)|This unknown sample can be classified as ([a-z_]+)', line)
for match in matches:
for m in match:
if m and m in labels:
result.append(m.capitalize())
print(result)

Can anyone help me with this. The final array should look like an array which has 108 elements like ['Religion', 'lgbtq', 'Agesism', 'Religion', 'No_Bias'......]

答案1

得分: 1

这是一个与您提供的数据配合使用的脚本。有一个小技巧,我假设您希望将"No Bias"匹配到"no_bias"标签。为了解决这个问题,我添加了一个分支,将_替换为一个空格字符。如果您希望看到使用正则表达式实现的解决方案,请告诉我。

# text = [...]
results = []
labels = ['race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias']
for line in text:
    found = False
    lower_line = line.lower()
    for label in labels:
        if label in lower_line or label.replace("_", " ") in lower_line:
            results.append(label)
            found = True
            break
    if not found:
        print(f"line: '{line}' does not contain a recognized label")

print(f"Initial size: {len(text)}")
print(f"Results size: {len(results)} results: {results}")
英文:

Here is a script that works with the data you provided. There is a slight hack in that I assume that you wish to match "No Bias" to the "no_bias" label. To get around this, I put in a branch that replaces _ with a ' ' character. Let me know if you want to see a solution implemented using regex instead.

# text = [...]
results = []
labels = ['race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias']
for line in text:
found = False
lower_line = line.lower()
for label in labels:
if label in lower_line or label.replace("_", " ") in lower_line:
results.append(label)
found = True
break
if not found:
print(f"line: '{line}' does not contain a recognized label")
print(f"Initial size: {len(text)}")
print(f"Results size: {len(results)} results: {results}")

huangapple
  • 本文由 发表于 2023年8月9日 03:52:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76862825.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定