2023年8月9日 03:52:24go评论143阅读模式

英文:

How to extract only required labels from the given text?

问题

给定的文本中包含了一些标签，你想要提取这些标签。你尝试使用正则表达式进行提取，但似乎没有成功。你希望最终得到一个包含108个元素的数组，每个元素都是一个标签，类似于['Religion', 'lgbtq', 'Agesism', 'Religion', 'No_Bias'...]。

以下是一个可能的解决方案：

import re
text = '''
results = text = ['->
Religion',
'Label: lgbtq',
'Answer: Ageism',
'Answer: Religion',
'No Bias',
'Label: no_bias',
'Answer: no_bias',
'This unknown sample does not appear to contain any references to any of the labels listed, so it would be classified as "no_bias".',
'Label: occupation',
'->
Race_Ethnicity',
'Answer: occupation',
'Classification: no_bias',
'This unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, gender identity terms, so it can be classified as "no_bias".',
'No_bias',
'No Bias',
'->
No Bias',
'No_bias: The unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, or gender identity terms, so it can be classified as "no_bias".',
'No_bias',
'Answer: Occupation',
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
'Label: No Bias',
'Classification: No Bias',
'Classification: No Bias',
'Label: No Bias',
'No_bias',
'Classification: Occupation',
'Label: Occupation',
'Classification: Occupation',
'->
Race_Ethnicity',
'Classification: race_ethnicity',
'No_bias',
'->
Race_Ethnicity',
'Classification: No Bias',
'The label for the unknown sample is "race_ethnicity".',
'Answer: race_ethnicity',
'This unknown sample can be classified as race_ethnicity.',
'No_bias: This sentence does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, or gender identity. Therefore, it can be classified as "no_bias".',
'Answer: No Bias',
'No_bias',
'Classification: No Bias',
'Answer: Occupation',
"This unknown sample does not belong to any of the labels given and hence it can be classified as 'no_bias'.",
'No_bias',
'Classification: No Bias',
'Classification: No Bias',
'This unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, gender identity terms, or any other bias. Therefore, it can be classified as "no_bias".',
'Classification: No Bias',
'Classification: No Bias',
'No_bias',
'Answer: Religion',
'Classification: Ageism',
'->
Occupation',
'This unknown sample can be classified as gender as it contains a reference to gender with the term "sex-bomb".',
'This unknown sample can be classified as gender.',
'This unknown sample would be classified as gender.',
'Classification: No Bias',
'This unknown sample would be classified as gender.',
'no_bias',
'Classification: gender',
'Classification: Ageism',
'Classification: gender',
'Classification: gender',
'No_bias',
'No_bias',
'No_bias',
'Classification: No Bias',
'This unknown sample would be classified as gender.',
'This unknown sample does not contain any references to any of the labels, so it would be classified as "no_bias".',
'Classification: occupation',
'This unknown sample would be classified as race_ethnicity.',
'Classification: no_bias',
'Answer: gender',
'Answer: Gender',
'Answer: Gender',
'This unknown sample does not appear to contain any references to any of the labels listed, so it would be classified as "no_bias".',
'Classification: No Bias',
'Classification: Religion',
'Classification: No Bias',
'Classification: no_bias',
'Classification: occupation',
'Answer: occupation',
'Answer: No Bias',
'Answer: No Bias',
'Classification: No Bias',
'No_bias',
'No_bias',
'This unknown sample can be classified as "occupation" as it is referring to a job title.',
'No_bias',
'Answer: lgbtq',
'Classification: No Bias',
'Answer: No Bias',
'->
Occupation',
'Classification: No Bias',
'This unknown sample can be classified as "occupation" as it is referring to a job title, profession, and work role.',
'Classification: occupation',
'Answer: No Bias',
'Classification: Religion',
'Classification: No Bias',
'Classification: No Bias',
'This unknown sample can be classified as "occupation" as it references to a job title "politician".',
'No_bias',
'Classification: No Bias',
'Answer: occupation',
'Classification: No Bias',
'->
Occupation',
'->
Occupation',
'Classification: No Bias']
'''
labels = ['race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias']
result = []
for line in text.split('\n'):
    matches = re.findall(r'Label: ([a-z_]+)|Classification: ([a-z_]+)|Answer: ([a-z_]+)', line)
    for match in matches:
        for m in match:
            if m and m in labels:
                result.append(m.capitalize())
print(result)

这段代码会输出以下结果：

['Religion', 'Lgbtq', 'Ageism', 'Religion', 'No_bias', 'No_bias', 'No_bias', 'Occupation', 'Race_ethnicity', 'Occupation', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bias', 'No_bia...

英文:

I am given this text:

 results = text =   [&#39;-&gt;\nReligion&#39;,
&#39;Label: lgbtq&#39;,
&#39;Answer: Ageism&#39;,
&#39;Answer: Religion&#39;,
&#39;No Bias&#39;,
&#39;Label: no_bias&#39;,
&#39;Answer: no_bias&#39;,
&#39;This unknown sample does not appear to contain any references to any of the labels listed, so it would be classified as &quot;no_bias&quot;.&#39;,
&#39;Label: occupation&#39;,
&#39;-&gt;\nRace_Ethnicity&#39;,
&#39;Answer: occupation&#39;,
&#39;Classification: no_bias&#39;,
&#39;This unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, gender identity terms, so it can be classified as &quot;no_bias&quot;.&#39;,
&#39;No_bias&#39;,
&#39;No Bias&#39;,
&#39;-&gt;\nNo Bias&#39;,
&#39;No_bias: The unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, or gender identity terms, so it can be classified as &quot;no_bias&quot;.&#39;,
&#39;No_bias&#39;,
&#39;Answer: Occupation&#39;,
&quot;This unknown sample does not belong to any of the labels given and hence it can be classified as &#39;no_bias&#39;.&quot;,
&quot;This unknown sample does not belong to any of the labels given and hence it can be classified as &#39;no_bias&#39;.&quot;,
&#39;Label: No Bias&#39;,
&#39;Classification: No Bias&#39;,
&#39;Classification: No Bias&#39;,
&#39;Label: No Bias&#39;,
&#39;No_bias&#39;,
&#39;Classification: Occupation&#39;,
&#39;Label: Occupation&#39;,
&#39;Classification: Occupation&#39;,
&#39;-&gt;\nRace_Ethnicity&#39;,
&#39;Classification: race_ethnicity&#39;,
&#39;No_bias&#39;,
&#39;-&gt;\nRace_Ethnicity&#39;,
&#39;Classification: No Bias&#39;,
&#39;The label for the unknown sample is &quot;race_ethnicity&quot;.&#39;,
&#39;Answer: race_ethnicity&#39;,
&#39;This unknown sample can be classified as race_ethnicity.&#39;,
&#39;No_bias: This sentence does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, or gender identity. Therefore, it can be classified as &quot;no_bias&quot;.&#39;,
&#39;Answer: No Bias&#39;,
&#39;No_bias&#39;,
&#39;Classification: No Bias&#39;,
&#39;Answer: Occupation&#39;,
&quot;This unknown sample does not belong to any of the labels given and hence it can be classified as &#39;no_bias&#39;.&quot;,
&#39;No_bias&#39;,
&#39;Classification: No Bias&#39;,
&#39;Classification: No Bias&#39;,
&#39;This unknown sample does not contain any references to race, skin color, ancestry, ethnicity, age, gender, occupation, religion, sexual orientation, gender identity terms, or any other bias. Therefore, it can be classified as &quot;no_bias&quot;.&#39;,
&#39;Classification: No Bias&#39;,
&#39;Classification: No Bias&#39;,
&#39;No_bias&#39;,
&#39;Answer: Religion&#39;,
&#39;Classification: Ageism&#39;,
&#39;-&gt;\nOccupation&#39;,
&#39;This unknown sample can be classified as gender as it contains a reference to gender with the term &quot;sex-bomb&quot;.&#39;,
&#39;This unknown sample can be classified as gender.&#39;,
&#39;This unknown sample would be classified as gender.&#39;,
&#39;Classification: No Bias&#39;,
&#39;This unknown sample would be classified as gender.&#39;,
&#39;no_bias&#39;,
&#39;Classification: gender&#39;,
&#39;Classification: Ageism&#39;,
&#39;Classification: gender&#39;,
&#39;Classification: gender&#39;,
&#39;No_bias&#39;,
&#39;No_bias&#39;,
&#39;No_bias&#39;,
&#39;Classification: No Bias&#39;,
&#39;This unknown sample would be classified as gender.&#39;,
&#39;This unknown sample does not contain any references to any of the labels, so it would be classified as &quot;no_bias&quot;.&#39;,
&#39;Classification: occupation&#39;,
&#39;This unknown sample would be classified as race_ethnicity.&#39;,
&#39;Classification: no_bias&#39;,
&#39;Answer: gender&#39;,
&#39;Answer: Gender&#39;,
&#39;Answer: Gender&#39;,
&#39;This unknown sample does not appear to contain any references to any of the labels listed, so it would be classified as &quot;no_bias&quot;.&#39;,
&#39;Classification: No Bias&#39;,
&#39;Classification: Religion&#39;,
&#39;Classification: No Bias&#39;,
&#39;Classification: no_bias&#39;,
&#39;Classification: occupation&#39;,
&#39;Answer: occupation&#39;,
&#39;Answer: No Bias&#39;,
&#39;Answer: No Bias&#39;,
&#39;Classification: No Bias&#39;,
&#39;No_bias&#39;,
&#39;No_bias&#39;,
&#39;This unknown sample can be classified as &quot;occupation&quot; as it is referring to a job title.&#39;,
&#39;No_bias&#39;,
&#39;Answer: lgbtq&#39;,
&#39;Classification: No Bias&#39;,
&#39;Answer: No Bias&#39;,
&#39;-&gt;\nOccupation&#39;,
&#39;Classification: No Bias&#39;,
&#39;This unknown sample can be classified as &quot;occupation&quot; as it is referring to a job title, profession, and work role.&#39;,
&#39;Classification: occupation&#39;,
&#39;Answer: No Bias&#39;,
&#39;Classification: Religion&#39;,
&#39;Classification: No Bias&#39;,
&#39;Classification: No Bias&#39;,
&#39;This unknown sample can be classified as &quot;occupation&quot; as it references to a job title &quot;politician&quot;.&#39;,
&#39;No_bias&#39;,
&#39;Classification: No Bias&#39;,
&#39;Answer: occupation&#39;,
&#39;Classification: No Bias&#39;,
&#39;-&gt;\nOccupation&#39;,
&#39;-&gt;\nOccupation&#39;,
&#39;Classification: No Bias&#39;]

This array contains 108 phrases. These are actually 108 lines consisting of these labels[{'race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias'}], like for row 1 it's Religion, row 2 it's lgbtq etc. Now how to extract only these labels.

I tried regex but its not working -

if &quot;-&gt;&quot; in results:
results = results.split(&quot;-&gt;&quot;)[0]
elif &quot;Label:&quot; in results:
results = results.split(&quot;Label:&quot;)[1].strip()
elif &quot;Answer:&quot; in results:
results = results.split(&quot;Answer:&quot;)[1].strip()
elif &quot;Classification:&quot; in results:
results = results.split(&quot;Classification:&quot;)[1].strip()

labels = ([&#39;race_ethnicity&#39;, &#39;ageism&#39;, &#39;gender&#39;, &#39;religion&#39;, &#39;occupation&#39;, &#39;lgbtq&#39;, &#39;no_bias&#39;])
result = []
for line in text:
matches = re.findall(r&#39;-&gt;\n|Label: ([a-z_]+)|Classification: ([a-z_]+)|Answer: ([a-z_]+)|This unknown sample can be classified as ([a-z_]+)&#39;, line)
for match in matches:
for m in match:
if m and m in labels:
result.append(m.capitalize())
print(result)

Can anyone help me with this. The final array should look like an array which has 108 elements like ['Religion', 'lgbtq', 'Agesism', 'Religion', 'No_Bias'......]

答案1

得分: 1

这是一个与您提供的数据配合使用的脚本。有一个小技巧，我假设您希望将"No Bias"匹配到"no_bias"标签。为了解决这个问题，我添加了一个分支，将_替换为一个空格字符。如果您希望看到使用正则表达式实现的解决方案，请告诉我。

# text = [...]
results = []
labels = ['race_ethnicity', 'ageism', 'gender', 'religion', 'occupation', 'lgbtq', 'no_bias']
for line in text:
    found = False
    lower_line = line.lower()
    for label in labels:
        if label in lower_line or label.replace("_", " ") in lower_line:
            results.append(label)
            found = True
            break
    if not found:
        print(f"line: '{line}' does not contain a recognized label")
print(f"Initial size: {len(text)}")
print(f"Results size: {len(results)} results: {results}")

英文:

Here is a script that works with the data you provided. There is a slight hack in that I assume that you wish to match "No Bias" to the "no_bias" label. To get around this, I put in a branch that replaces _ with a ' ' character. Let me know if you want to see a solution implemented using regex instead.

# text = [...]
results = []
labels = [&#39;race_ethnicity&#39;, &#39;ageism&#39;, &#39;gender&#39;, &#39;religion&#39;, &#39;occupation&#39;, &#39;lgbtq&#39;, &#39;no_bias&#39;]
for line in text:
found = False
lower_line = line.lower()
for label in labels:
if label in lower_line or label.replace(&quot;_&quot;, &quot; &quot;) in lower_line:
results.append(label)
found = True
break
if not found:
print(f&quot;line: &#39;{line}&#39; does not contain a recognized label&quot;)
print(f&quot;Initial size: {len(text)}&quot;)
print(f&quot;Results size: {len(results)} results: {results}&quot;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从给定的文本中提取所需的标签？

问题

答案1

Python计算每个分位数和预设四分位数。

使用Selenium Python进行网页抓取选择下拉选项。

翻译结果：Jit是一个从字典中选择函数的JAX函数。

Python类型错误：字符串索引必须是整数。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。