英文:
How can I add a delimiter to my "findall" result when only one match is found for a given cell?
问题
你的输出看起来是这样的:
0 XXX-1000ZZZ-9999|ZZZ-8000
1 XXX-2000|PPP-2200
2 XXX-3000|XXX-1000
但我希望它看起来像这样:
0 XXX-1000|ZZZ-9999|ZZZ-8000
1 XXX-2000|PPP-2200
2 XXX-3000|XXX-1000
所以,我添加了上面的最后两行来尝试添加竖线字符。这不起作用,并且给我以下错误:ValueError: Series的真值是模棱两可的。使用a.empty、a.bool()、a.item()、a.any()或a.all()。
我知道这是因为程序期望一个布尔值,但我不知道如何解决它。
英文:
I'm trying to extract substrings containing equipment names from the cells in a dataframe. Because of the way the data was created, these substrings can be in any cell. I created this program which uses "findall" and some regex to create a list of all the equipment found in the cells in a given row.
The problem is, the output isn't exactly as I need it. For example, if "findall" matches only one substring in the cell, my script does not add a delimiter afterwards. When the program continues to the next column, it joins the first column match with the second column matches, without a delimiter between the results. And I need the delimiter so I can explode the list later on.
Here is the code:
import pandas as pd
# IMPORT FILE AND CREATE DATAFRAME
d = {'Cause':['Consider checking XXX-1000 for deficiencies prior to train switch', 'XXX-2000 AND PPP-2200 to be taken out of service', 'Need to check XXX-3000 and potentially XXX-1000 for degradation'], 'Mitigation':['ZZZ-9999 is dependent on ZZZ-8000', 'These equipment will be out of service in 2025, not applicable', 'No further comments']}
df = pd.DataFrame(data=d)
# Trying the findall technique
df['new_eq'] = ""
for column in df.columns:
df['equipment'] = df['equipment'] + df[column].str.findall(r'\s*(\w{3,}-\d{4}\D*?) ').str.join('|')
if df['equipment'].str.contains('|') == False:
df['equipment'] += '|'
My output looks like this:
0 XXX-1000ZZZ-9999|ZZZ-8000
1 XXX-2000|PPP-2200
2 XXX-3000|XXX-1000
But I want it to look like this:
0 XXX-1000|ZZZ-9999|ZZZ-8000
1 XXX-2000|PPP-2200
2 XXX-3000|XXX-1000
So I added the last two lines of above to try to add the pipe character. It doesn't work and is giving me the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I know this is because the program expects a boolean value but I can't figure out how to fix it.
答案1
得分: 0
我建议使用这个解决方案:
import pandas as pd
导入文件并创建数据框
d = {'Cause':['在切换火车之前,考虑检查XXX-1000是否存在缺陷', 'XXX-2000和PPP-2200需要停止使用', '需要检查XXX-3000,可能还有XXX-1000是否有退化'], 'Mitigation':['ZZZ-9999依赖于ZZZ-8000', '这些设备将在2025年停止使用,不适用', '没有进一步的评论']}
df = pd.DataFrame(data=d)
df['equipment'] = (df['Cause'] + ' ' + df['Mitigation']).str.findall(r'(\w{3,}-\d{4})').apply(lambda x: '|'.join(x))
df['equipment'] = df['equipment'].apply(lambda x: x.rstrip('|') if x.endswith('|') else x)
for i in df['equipment']:
print(i)
它会返回:
XXX-1000|ZZZ-9999|ZZZ-8000
XXX-2000|PPP-2200
XXX-3000|XXX-1000
或者简单地使用
df['equipment']
返回
0 XXX-1000|ZZZ-9999|ZZZ-8000
1 XXX-2000|PPP-2200
2 XXX-3000|XXX-1000
Name: equipment, dtype: object
<details>
<summary>英文:</summary>
I suggest this solution:
import pandas as pd
IMPORT FILE AND CREATE DATAFRAME
d = {'Cause':['Consider checking XXX-1000 for deficiencies prior to train switch', 'XXX-2000 AND PPP-2200 to be taken out of service', 'Need to check XXX-3000 and potentially XXX-1000 for degradation'], 'Mitigation':['ZZZ-9999 is dependent on ZZZ-8000', 'These equipment will be out of service in 2025, not applicable', 'No further comments']}
df = pd.DataFrame(data=d)
df['equipment'] = (df['Cause'] + ' ' + df['Mitigation']).str.findall(r'(\w{3,}-\d{4})').apply(lambda x: '|'.join(x))
df['equipment'] = df['equipment'].apply(lambda x: x.rstrip('|') if x.endswith('|') else x)
for i in df['equipment']:
print(i)
which returns:
XXX-1000|ZZZ-9999|ZZZ-8000
XXX-2000|PPP-2200
XXX-3000|XXX-1000
or simply
df['equipment]
giving
0 XXX-1000|ZZZ-9999|ZZZ-8000
1 XXX-2000|PPP-2200
2 XXX-3000|XXX-1000
Name: equipment, dtype: object
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论