英文:
Pandas split corresponding rows based on separator in two columns duplicating everything else
问题
I have an excel sheet
Col1 Col2 Col3 Col4
John English\nMaths 34\n33 Pass
Sam Science 40 Pass
Jack English\nHistory\nGeography 89\n07\n98 Pass
Need to convert it to
Col1 Col2 Col3 Col4
John English 34 Pass
John Maths 33 Pass
Sam Science 40 Pass
Jack English 89 Pass
Jack History 07 Pass
Jack Geography 98 Pass
The excel sheet has \n
as separator for corresponding Col2 and col3 column. Just need to pull each subject in a new row with its corresponding marks and copy all the other column contents as it is.
英文:
I have an excel sheet
Col1 Col2 Col3 Col4
John English\nMaths 34\n33 Pass
Sam Science 40 Pass
Jack English\nHistory\nGeography 89\n07\n98 Pass
Need to convert it to
Col1 Col2 Col3 Col4
John English 34 Pass
John Maths 33 Pass
Sam Science 40 Pass
Jack English 89 Pass
Jack History 07 Pass
Jack Geography 98 Pass
The excel sheet has \n
as separator for corresponding Col2 and col3 column. Just need to pull each subject in a new row with its corresponding marks and copy all the other column contents as it is.
Tried
split_cols = ['Col2', 'Col3']
# loop over the columns and split them
separator = '\n'
for col in split_cols:
df[[f'{col}_Split1', f'{col}_Split2']] = df[col].str.split(separator, n=1, expand=True).fillna('')
# create two new dataframes with the desired columns
df1 = df[['Col1', 'Col2_Split1', 'Col3_Split1', 'Col4']].rename(columns={'Col2_Split1': 'D', 'Col3_Split1': 'C'})
df2 = df[['Col1', 'Col2_Split2', 'Col3_Split2', 'Col4']].rename(columns={'Col2_Split2': 'D', 'Col3_Split2': 'C'})
# concatenate the two dataframes
final_df = pd.concat([df1, df2], ignore_index=True)
# print the final dataframe
print(final_df)
答案1
得分: 3
以下是您要翻译的内容:
# First pass
out = (df.assign(Col2=df['Col2'].str.split('\n'),
Col3=df['Col3'].str.split('\n')))
# Fix unbalanced lists
def pad(sr):
n = max(sr.str.len())
sr['Col2'] = np.pad(sr['Col2'], (0, n-len(sr['Col2'])))
sr['Col3'] = np.pad(sr['Col3'], (0, n-len(sr['Col3']))
return sr
m = out['Col2'].str.len() != out['Col3'].str.len()
out.loc[m, ['Col2', 'Col3']] = out.loc[m, ['Col2', 'Col3']].apply(pad, axis=1)
# Second pass
out = out.explode(['Col2', 'Col3'], ignore_index=True)
print(out)
输入数据框:
import pandas as pd
import numpy as np
data = {'Col1': ['John', 'Sam', 'Jack', 'Ryan'],
'Col2': ['English\nMaths', 'Science', 'English\nHistory\nGeography', 'Maths\nScience\nHistory'],
'Col3': ['34\n33', '40', '89\n07\n98', '12\n10'],
'Col4': ['Pass', 'Pass', 'Pass', 'Failed']}
df = pd.DataFrame(data)
print(df)
输出:
Col1 Col2 Col3 Col4
0 John English\nMaths 34\n33 Pass
1 Sam Science 40 Pass
2 Jack English\nHistory\nGeography 89\n07\n98 Pass
3 Ryan Maths\nScience\nHistory 12\n10 Failed
英文:
You can explode on multiple columns (with a recent version of Pandas >= 1.3) after exploding each string into list:
# First pass
out = (df.assign(Col2=df['Col2'].str.split('\n'),
Col3=df['Col3'].str.split('\n')))
# Fix unbalanced lists
def pad(sr):
n = max(sr.str.len())
sr['Col2'] = np.pad(sr['Col2'], (0, n-len(sr['Col2'])))
sr['Col3'] = np.pad(sr['Col3'], (0, n-len(sr['Col3'])))
return sr
m = out['Col2'].str.len() != out['Col3'].str.len()
out.loc[m, ['Col2', 'Col3']] = out.loc[m, ['Col2', 'Col3']].apply(pad, axis=1)
# Second pass
out = out.explode(['Col2', 'Col3'], ignore_index=True)
print(out)
# Output
Col1 Col2 Col3 Col4
0 John English 34 Pass
1 John Maths 33 Pass
2 Sam Science 40 Pass
3 Jack English 89 Pass
4 Jack History 07 Pass
5 Jack Geography 98 Pass
6 Ryan Maths 12 Failed
7 Ryan Science 10 Failed
8 Ryan History 0 Failed
Input dataframe:
import pandas as pd
import numpy as np
data = {'Col1': ['John', 'Sam', 'Jack', 'Ryan'],
'Col2': ['English\nMaths', 'Science', 'English\nHistory\nGeography', 'Maths\nScience\nHistory'],
'Col3': ['34\n33', '40', '89\n07\n98', '12\n10'],
'Col4': ['Pass', 'Pass', 'Pass', 'Failed']}
df = pd.DataFrame(data)
print(df)
# Output
Col1 Col2 Col3 Col4
0 John English\nMaths 34\n33 Pass
1 Sam Science 40 Pass
2 Jack English\nHistory\nGeography 89\n07\n98 Pass
3 Ryan Maths\nScience\nHistory 12\n10 Failed
答案2
得分: 1
你可以使用.str.split
+ .explode
方法来实现你的目标。
import pandas
df = pandas.DataFrame([
["John", "English\nMaths", "34\n33", "Pass"],
["Sam", "Science", "40", "Pass"],
["Jack", "English\nHistory\nGeography", "89\n07\n98", "Pass"],
])
df[1] = df[1].str.split("\n")
df[2] = df[2].str.split("\n")
df = df.explode([1, 2])
print(df)
英文:
EDITED.
You can achieve your goals using .str.split
+ .explode
methods.
import pandas
df = pandas.DataFrame([
["John", "English\nMaths", "34\n33", "Pass"],
["Sam", "Science", "40", "Pass"],
["Jack", "English\nHistory\nGeography", "89\n07\n98", "Pass"],
])
df[1] = df[1].str.split("\n")
df[2] = df[2].str.split("\n")
df = df.explode([1, 2])
print(df)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论