Pandas根据两列中的分隔符拆分对应的行,并复制其他所有内容。

huangapple go评论57阅读模式
英文:

Pandas split corresponding rows based on separator in two columns duplicating everything else

问题

I have an excel sheet

Col1    Col2                          Col3            Col4
John    English\nMaths                34\n33          Pass
Sam     Science                       40              Pass
Jack    English\nHistory\nGeography   89\n07\n98      Pass

Need to convert it to

Col1    Col2      Col3    Col4
John    English   34      Pass
John    Maths     33      Pass
Sam     Science   40      Pass
Jack    English   89      Pass
Jack    History   07      Pass     
Jack    Geography 98      Pass

The excel sheet has \n as separator for corresponding Col2 and col3 column. Just need to pull each subject in a new row with its corresponding marks and copy all the other column contents as it is.

英文:

I have an excel sheet

Col1    Col2                          Col3            Col4
John    English\nMaths                34\n33          Pass
Sam     Science                       40              Pass
Jack    English\nHistory\nGeography   89\n07\n98      Pass

Need to convert it to

Col1    Col2      Col3    Col4
John    English   34      Pass
John    Maths     33      Pass
Sam     Science   40      Pass
Jack    English   89      Pass
Jack    History   07      Pass     
Jack    Geography 98      Pass

The excel sheet has \n as separator for corresponding Col2 and col3 column. Just need to pull each subject in a new row with its corresponding marks and copy all the other column contents as it is.

Tried

split_cols = ['Col2', 'Col3']

# loop over the columns and split them
separator = '\n'
for col in split_cols:
    df[[f'{col}_Split1', f'{col}_Split2']] = df[col].str.split(separator, n=1, expand=True).fillna('')

# create two new dataframes with the desired columns
df1 = df[['Col1', 'Col2_Split1', 'Col3_Split1', 'Col4']].rename(columns={'Col2_Split1': 'D', 'Col3_Split1': 'C'})
df2 = df[['Col1', 'Col2_Split2', 'Col3_Split2', 'Col4']].rename(columns={'Col2_Split2': 'D', 'Col3_Split2': 'C'})

# concatenate the two dataframes
final_df = pd.concat([df1, df2], ignore_index=True)

# print the final dataframe
print(final_df)

答案1

得分: 3

以下是您要翻译的内容:

# First pass
out = (df.assign(Col2=df['Col2'].str.split('\n'), 
                 Col3=df['Col3'].str.split('\n')))

# Fix unbalanced lists
def pad(sr):
    n = max(sr.str.len())
    sr['Col2'] = np.pad(sr['Col2'], (0, n-len(sr['Col2'])))
    sr['Col3'] = np.pad(sr['Col3'], (0, n-len(sr['Col3']))
    return sr

m = out['Col2'].str.len() != out['Col3'].str.len()
out.loc[m, ['Col2', 'Col3']] = out.loc[m, ['Col2', 'Col3']].apply(pad, axis=1)

# Second pass
out = out.explode(['Col2', 'Col3'], ignore_index=True)
print(out)

输入数据框:

import pandas as pd
import numpy as np

data = {'Col1': ['John', 'Sam', 'Jack', 'Ryan'],
        'Col2': ['English\nMaths', 'Science', 'English\nHistory\nGeography', 'Maths\nScience\nHistory'],
        'Col3': ['34\n33', '40', '89\n07\n98', '12\n10'],
        'Col4': ['Pass', 'Pass', 'Pass', 'Failed']}
df = pd.DataFrame(data)
print(df)

输出:

   Col1                         Col2        Col3    Col4
0  John               English\nMaths      34\n33    Pass
1   Sam                      Science          40    Pass
2  Jack  English\nHistory\nGeography  89\n07\n98    Pass
3  Ryan      Maths\nScience\nHistory      12\n10  Failed
英文:

You can explode on multiple columns (with a recent version of Pandas >= 1.3) after exploding each string into list:

# First pass
out = (df.assign(Col2=df['Col2'].str.split('\n'), 
                 Col3=df['Col3'].str.split('\n')))

# Fix unbalanced lists
def pad(sr):
    n = max(sr.str.len())
    sr['Col2'] = np.pad(sr['Col2'], (0, n-len(sr['Col2'])))
    sr['Col3'] = np.pad(sr['Col3'], (0, n-len(sr['Col3'])))
    return sr

m = out['Col2'].str.len() != out['Col3'].str.len()
out.loc[m, ['Col2', 'Col3']] = out.loc[m, ['Col2', 'Col3']].apply(pad, axis=1)

# Second pass
out = out.explode(['Col2', 'Col3'], ignore_index=True)
print(out)

# Output
   Col1       Col2 Col3    Col4
0  John    English   34    Pass
1  John      Maths   33    Pass
2   Sam    Science   40    Pass
3  Jack    English   89    Pass
4  Jack    History   07    Pass
5  Jack  Geography   98    Pass
6  Ryan      Maths   12  Failed
7  Ryan    Science   10  Failed
8  Ryan    History    0  Failed

Input dataframe:

import pandas as pd
import numpy as np

data = {'Col1': ['John', 'Sam', 'Jack', 'Ryan'],
        'Col2': ['English\nMaths', 'Science', 'English\nHistory\nGeography', 'Maths\nScience\nHistory'],
        'Col3': ['34\n33', '40', '89\n07\n98', '12\n10'],
        'Col4': ['Pass', 'Pass', 'Pass', 'Failed']}
df = pd.DataFrame(data)
print(df)

# Output
   Col1                         Col2        Col3    Col4
0  John               English\nMaths      34\n33    Pass
1   Sam                      Science          40    Pass
2  Jack  English\nHistory\nGeography  89\n07\n98    Pass
3  Ryan      Maths\nScience\nHistory      12\n10  Failed

答案2

得分: 1

你可以使用.str.split + .explode方法来实现你的目标。

import pandas

df = pandas.DataFrame([
  ["John", "English\nMaths", "34\n33", "Pass"],
  ["Sam", "Science", "40", "Pass"],
  ["Jack", "English\nHistory\nGeography", "89\n07\n98", "Pass"],
])

df[1] = df[1].str.split("\n")
df[2] = df[2].str.split("\n")
df = df.explode([1, 2])
print(df)
英文:

EDITED.

You can achieve your goals using .str.split + .explode methods.

import pandas

df = pandas.DataFrame([
  ["John", "English\nMaths", "34\n33", "Pass"],
  ["Sam", "Science", "40", "Pass"],
  ["Jack", "English\nHistory\nGeography", "89\n07\n98", "Pass"],
])

df[1] = df[1].str.split("\n")
df[2] = df[2].str.split("\n")
df = df.explode([1, 2])
print(df)

huangapple
  • 本文由 发表于 2023年3月1日 13:23:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75599861.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定