英文:
How to extract only required labels from the given text?
问题
给定的文本中包含了一些标签,你想要提取这些标签。你尝试使用正则表达式进行提取,但似乎没有成功。你希望最终得到一个包含108个元素的数组,每个元素都是一个标签,类似于['Religion', 'lgbtq', 'Agesism', 'Religion', 'No_Bias'...]。
以下是一个可能的解决方案:
import re
text = '''
results = text = ['->
Religion',
'Label: lgbtq',
'Answer: Ageism',
'Answer: Religion',
'No Bias',
'Label: no_bias',
'Answer: no_bias',
'This unknown sample does not appear to contain any references to any of the labels listed, so it would be classified as "no_bias".',
'Label: occupation',
'->
Race_Ethnicity',
'Answer: occupation',
'Classification: no_bias',
'This unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, gender identity terms, so it can be classified as "no_bias".',
'No_bias',
'No Bias',
'->
No Bias',
'No_bias: The unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, or gender identity terms, so it can be classified as "no_bias".',
'No_bias',
'Answer: Occupation',
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
'Label: No Bias',
'Classification: No Bias',
'Classification: No Bias',
'Label: No Bias',
'No_bias',
'Classification: Occupation',
'Label: Occupation',
'Classification: Occupation',
'->
Race_Ethnicity',
'Classification: race_ethnicity',
'No_bias',
'->
Race_Ethnicity',
'Classification: No Bias',
'The label for the unknown sample is "race_ethnicity".',
'Answer: race_ethnicity',
'This unknown sample can be classified as race_ethnicity.',
'No_bias: This sentence does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, or gender identity. Therefore, it can be classified as "no_bias".',
'Answer: No Bias',
'No_bias',
'Classification: No Bias',
'Answer: Occupation',
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
'No_bias',
'Classification: No Bias',
'Classification: No Bias',
'This unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, gender identity terms, or any other bias. Therefore, it can be classified as "no_bias".',
'Classification: No Bias',
'Classification: No Bias',
'No_bias',
'Answer: Religion',
'Classification: Ageism',
'->
Occupation',
'This unknown sample can be classified as gender as it contains a reference to gender with the term "sex-bomb".',
'This unknown sample can be classified as gender.',
'This unknown sample would be classified as gender.',
'Classification: No Bias',
'This unknown sample would be classified as gender.',
'no_bias',
'Classification: gender',
'Classification: Ageism',
'Classification: gender',
'Classification: gender',
'No_bias',
'No_bias',
'No_bias',
'Classification: No Bias',
'This unknown sample would be classified as gender.',
'This unknown sample does not contain any references to any of the labels, so it would be classified as "no_bias".',
'Classification: occupation',
'This unknown sample would be classified as race_ethnicity.',
'Classification: no_bias',
'Answer: gender',
'Answer: Gender',
'Answer: Gender',
'This unknown sample does not appear to contain any references to any of the labels listed, so it would be classified as "no_bias".',
'Classification: No Bias',
'Classification: Religion',
'Classification: No Bias',
'Classification: no_bias',
'Classification: occupation',
'Answer: occupation',
'Answer: No Bias',
'Answer: No Bias',
'Classification: No Bias',
'No_bias',
'No_bias',
'This unknown sample can be classified as "occupation" as it is referring to a job title.',
'No_bias',
'Answer: lgbtq',
'Classification: No Bias',
'Answer: No Bias',
'->
Occupation',
'Classification: No Bias',
'This unknown sample can be classified as "occupation" as it is referring to a job title, profession, and work role.',
'Classification: occupation',
'Answer: No Bias',
'Classification: Religion',
'Classification: No Bias',
'Classification: No Bias',
'This unknown sample can be classified as "occupation" as it references to a job title "politician".',
'No_bias',
'Classification: No Bias',
'Answer: occupation',
'Classification: No Bias',
'->
Occupation',
'->
Occupation',
'Classification: No Bias']
'''
labels = ['race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias']
result = []
for line in text.split('\n'):
matches = re.findall(r'Label: ([a-z_]+)|Classification: ([a-z_]+)|Answer: ([a-z_]+)', line)
for match in matches:
for m in match:
if m and m in labels:
result.append(m.capitalize())
print(result)
这段代码会输出以下结果:
['Religion', 'Lgbtq', 'Ageism', 'Religion', 'No_bias', 'No_bias', 'No_bias', 'Occupation', 'Race_ethnicity', 'Occupation', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bia...
英文:
I am given this text:
results = text = ['->\nReligion',
'Label: lgbtq',
'Answer: Ageism',
'Answer: Religion',
'No Bias',
'Label: no_bias',
'Answer: no_bias',
'This unknown sample does not appear to contain any references to any of the labels listed, so it would be classified as "no_bias".',
'Label: occupation',
'->\nRace_Ethnicity',
'Answer: occupation',
'Classification: no_bias',
'This unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, gender identity terms, so it can be classified as "no_bias".',
'No_bias',
'No Bias',
'->\nNo Bias',
'No_bias: The unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, or gender identity terms, so it can be classified as "no_bias".',
'No_bias',
'Answer: Occupation',
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
'Label: No Bias',
'Classification: No Bias',
'Classification: No Bias',
'Label: No Bias',
'No_bias',
'Classification: Occupation',
'Label: Occupation',
'Classification: Occupation',
'->\nRace_Ethnicity',
'Classification: race_ethnicity',
'No_bias',
'->\nRace_Ethnicity',
'Classification: No Bias',
'The label for the unknown sample is "race_ethnicity".',
'Answer: race_ethnicity',
'This unknown sample can be classified as race_ethnicity.',
'No_bias: This sentence does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, or gender identity. Therefore, it can be classified as "no_bias".',
'Answer: No Bias',
'No_bias',
'Classification: No Bias',
'Answer: Occupation',
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
'No_bias',
'Classification: No Bias',
'Classification: No Bias',
'This unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, gender identity terms, or any other bias. Therefore, it can be classified as "no_bias".',
'Classification: No Bias',
'Classification: No Bias',
'No_bias',
'Answer: Religion',
'Classification: Ageism',
'->\nOccupation',
'This unknown sample can be classified as gender as it contains a reference to gender with the term "sex-bomb".',
'This unknown sample can be classified as gender.',
'This unknown sample would be classified as gender.',
'Classification: No Bias',
'This unknown sample would be classified as gender.',
'no_bias',
'Classification: gender',
'Classification: Ageism',
'Classification: gender',
'Classification: gender',
'No_bias',
'No_bias',
'No_bias',
'Classification: No Bias',
'This unknown sample would be classified as gender.',
'This unknown sample does not contain any references to any of the labels, so it would be classified as "no_bias".',
'Classification: occupation',
'This unknown sample would be classified as race_ethnicity.',
'Classification: no_bias',
'Answer: gender',
'Answer: Gender',
'Answer: Gender',
'This unknown sample does not appear to contain any references to any of the labels listed, so it would be classified as "no_bias".',
'Classification: No Bias',
'Classification: Religion',
'Classification: No Bias',
'Classification: no_bias',
'Classification: occupation',
'Answer: occupation',
'Answer: No Bias',
'Answer: No Bias',
'Classification: No Bias',
'No_bias',
'No_bias',
'This unknown sample can be classified as "occupation" as it is referring to a job title.',
'No_bias',
'Answer: lgbtq',
'Classification: No Bias',
'Answer: No Bias',
'->\nOccupation',
'Classification: No Bias',
'This unknown sample can be classified as "occupation" as it is referring to a job title, profession, and work role.',
'Classification: occupation',
'Answer: No Bias',
'Classification: Religion',
'Classification: No Bias',
'Classification: No Bias',
'This unknown sample can be classified as "occupation" as it references to a job title "politician".',
'No_bias',
'Classification: No Bias',
'Answer: occupation',
'Classification: No Bias',
'->\nOccupation',
'->\nOccupation',
'Classification: No Bias']
This array contains 108 phrases. These are actually 108 lines consisting of these labels[{'race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias'}], like for row 1 it's Religion, row 2 it's lgbtq etc. Now how to extract only these labels.
I tried regex but its not working -
if "->" in results:
results = results.split("->")[0]
elif "Label:" in results:
results = results.split("Label:")[1].strip()
elif "Answer:" in results:
results = results.split("Answer:")[1].strip()
elif "Classification:" in results:
results = results.split("Classification:")[1].strip()
2.
labels = (['race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias'])
result = []
for line in text:
matches = re.findall(r'->\n|Label: ([a-z_]+)|Classification: ([a-z_]+)|Answer: ([a-z_]+)|This unknown sample can be classified as ([a-z_]+)', line)
for match in matches:
for m in match:
if m and m in labels:
result.append(m.capitalize())
print(result)
Can anyone help me with this. The final array should look like an array which has 108 elements like ['Religion', 'lgbtq', 'Agesism', 'Religion', 'No_Bias'......]
答案1
得分: 1
这是一个与您提供的数据配合使用的脚本。有一个小技巧,我假设您希望将"No Bias"匹配到"no_bias"标签。为了解决这个问题,我添加了一个分支,将_替换为一个空格字符。如果您希望看到使用正则表达式实现的解决方案,请告诉我。
# text = [...]
results = []
labels = ['race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias']
for line in text:
found = False
lower_line = line.lower()
for label in labels:
if label in lower_line or label.replace("_", " ") in lower_line:
results.append(label)
found = True
break
if not found:
print(f"line: '{line}' does not contain a recognized label")
print(f"Initial size: {len(text)}")
print(f"Results size: {len(results)} results: {results}")
英文:
Here is a script that works with the data you provided. There is a slight hack in that I assume that you wish to match "No Bias" to the "no_bias" label. To get around this, I put in a branch that replaces _ with a ' ' character. Let me know if you want to see a solution implemented using regex instead.
# text = [...]
results = []
labels = ['race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias']
for line in text:
found = False
lower_line = line.lower()
for label in labels:
if label in lower_line or label.replace("_", " ") in lower_line:
results.append(label)
found = True
break
if not found:
print(f"line: '{line}' does not contain a recognized label")
print(f"Initial size: {len(text)}")
print(f"Results size: {len(results)} results: {results}")
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论