Pandas根据两列中的分隔符拆分对应的行,并复制其他所有内容。

huangapple go评论88阅读模式
英文:

Pandas split corresponding rows based on separator in two columns duplicating everything else

问题

I have an excel sheet

  1. Col1 Col2 Col3 Col4
  2. John English\nMaths 34\n33 Pass
  3. Sam Science 40 Pass
  4. Jack English\nHistory\nGeography 89\n07\n98 Pass

Need to convert it to

  1. Col1 Col2 Col3 Col4
  2. John English 34 Pass
  3. John Maths 33 Pass
  4. Sam Science 40 Pass
  5. Jack English 89 Pass
  6. Jack History 07 Pass
  7. Jack Geography 98 Pass

The excel sheet has \n as separator for corresponding Col2 and col3 column. Just need to pull each subject in a new row with its corresponding marks and copy all the other column contents as it is.

英文:

I have an excel sheet

  1. Col1 Col2 Col3 Col4
  2. John English\nMaths 34\n33 Pass
  3. Sam Science 40 Pass
  4. Jack English\nHistory\nGeography 89\n07\n98 Pass

Need to convert it to

  1. Col1 Col2 Col3 Col4
  2. John English 34 Pass
  3. John Maths 33 Pass
  4. Sam Science 40 Pass
  5. Jack English 89 Pass
  6. Jack History 07 Pass
  7. Jack Geography 98 Pass

The excel sheet has \n as separator for corresponding Col2 and col3 column. Just need to pull each subject in a new row with its corresponding marks and copy all the other column contents as it is.

Tried

  1. split_cols = ['Col2', 'Col3']
  2. # loop over the columns and split them
  3. separator = '\n'
  4. for col in split_cols:
  5. df[[f'{col}_Split1', f'{col}_Split2']] = df[col].str.split(separator, n=1, expand=True).fillna('')
  6. # create two new dataframes with the desired columns
  7. df1 = df[['Col1', 'Col2_Split1', 'Col3_Split1', 'Col4']].rename(columns={'Col2_Split1': 'D', 'Col3_Split1': 'C'})
  8. df2 = df[['Col1', 'Col2_Split2', 'Col3_Split2', 'Col4']].rename(columns={'Col2_Split2': 'D', 'Col3_Split2': 'C'})
  9. # concatenate the two dataframes
  10. final_df = pd.concat([df1, df2], ignore_index=True)
  11. # print the final dataframe
  12. print(final_df)

答案1

得分: 3

以下是您要翻译的内容:

  1. # First pass
  2. out = (df.assign(Col2=df['Col2'].str.split('\n'),
  3. Col3=df['Col3'].str.split('\n')))
  4. # Fix unbalanced lists
  5. def pad(sr):
  6. n = max(sr.str.len())
  7. sr['Col2'] = np.pad(sr['Col2'], (0, n-len(sr['Col2'])))
  8. sr['Col3'] = np.pad(sr['Col3'], (0, n-len(sr['Col3']))
  9. return sr
  10. m = out['Col2'].str.len() != out['Col3'].str.len()
  11. out.loc[m, ['Col2', 'Col3']] = out.loc[m, ['Col2', 'Col3']].apply(pad, axis=1)
  12. # Second pass
  13. out = out.explode(['Col2', 'Col3'], ignore_index=True)
  14. print(out)

输入数据框:

  1. import pandas as pd
  2. import numpy as np
  3. data = {'Col1': ['John', 'Sam', 'Jack', 'Ryan'],
  4. 'Col2': ['English\nMaths', 'Science', 'English\nHistory\nGeography', 'Maths\nScience\nHistory'],
  5. 'Col3': ['34\n33', '40', '89\n07\n98', '12\n10'],
  6. 'Col4': ['Pass', 'Pass', 'Pass', 'Failed']}
  7. df = pd.DataFrame(data)
  8. print(df)

输出:

  1. Col1 Col2 Col3 Col4
  2. 0 John English\nMaths 34\n33 Pass
  3. 1 Sam Science 40 Pass
  4. 2 Jack English\nHistory\nGeography 89\n07\n98 Pass
  5. 3 Ryan Maths\nScience\nHistory 12\n10 Failed
英文:

You can explode on multiple columns (with a recent version of Pandas >= 1.3) after exploding each string into list:

  1. # First pass
  2. out = (df.assign(Col2=df['Col2'].str.split('\n'),
  3. Col3=df['Col3'].str.split('\n')))
  4. # Fix unbalanced lists
  5. def pad(sr):
  6. n = max(sr.str.len())
  7. sr['Col2'] = np.pad(sr['Col2'], (0, n-len(sr['Col2'])))
  8. sr['Col3'] = np.pad(sr['Col3'], (0, n-len(sr['Col3'])))
  9. return sr
  10. m = out['Col2'].str.len() != out['Col3'].str.len()
  11. out.loc[m, ['Col2', 'Col3']] = out.loc[m, ['Col2', 'Col3']].apply(pad, axis=1)
  12. # Second pass
  13. out = out.explode(['Col2', 'Col3'], ignore_index=True)
  14. print(out)
  15. # Output
  16. Col1 Col2 Col3 Col4
  17. 0 John English 34 Pass
  18. 1 John Maths 33 Pass
  19. 2 Sam Science 40 Pass
  20. 3 Jack English 89 Pass
  21. 4 Jack History 07 Pass
  22. 5 Jack Geography 98 Pass
  23. 6 Ryan Maths 12 Failed
  24. 7 Ryan Science 10 Failed
  25. 8 Ryan History 0 Failed

Input dataframe:

  1. import pandas as pd
  2. import numpy as np
  3. data = {'Col1': ['John', 'Sam', 'Jack', 'Ryan'],
  4. 'Col2': ['English\nMaths', 'Science', 'English\nHistory\nGeography', 'Maths\nScience\nHistory'],
  5. 'Col3': ['34\n33', '40', '89\n07\n98', '12\n10'],
  6. 'Col4': ['Pass', 'Pass', 'Pass', 'Failed']}
  7. df = pd.DataFrame(data)
  8. print(df)
  9. # Output
  10. Col1 Col2 Col3 Col4
  11. 0 John English\nMaths 34\n33 Pass
  12. 1 Sam Science 40 Pass
  13. 2 Jack English\nHistory\nGeography 89\n07\n98 Pass
  14. 3 Ryan Maths\nScience\nHistory 12\n10 Failed

答案2

得分: 1

你可以使用.str.split + .explode方法来实现你的目标。

  1. import pandas
  2. df = pandas.DataFrame([
  3. ["John", "English\nMaths", "34\n33", "Pass"],
  4. ["Sam", "Science", "40", "Pass"],
  5. ["Jack", "English\nHistory\nGeography", "89\n07\n98", "Pass"],
  6. ])
  7. df[1] = df[1].str.split("\n")
  8. df[2] = df[2].str.split("\n")
  9. df = df.explode([1, 2])
  10. print(df)
英文:

EDITED.

You can achieve your goals using .str.split + .explode methods.

  1. import pandas
  2. df = pandas.DataFrame([
  3. ["John", "English\nMaths", "34\n33", "Pass"],
  4. ["Sam", "Science", "40", "Pass"],
  5. ["Jack", "English\nHistory\nGeography", "89\n07\n98", "Pass"],
  6. ])
  7. df[1] = df[1].str.split("\n")
  8. df[2] = df[2].str.split("\n")
  9. df = df.explode([1, 2])
  10. print(df)

huangapple
  • 本文由 发表于 2023年3月1日 13:23:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75599861.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定