如何使用pandas重新排序数据

huangapple go评论48阅读模式
英文:

How to reorder data with pandas

问题

I'm trying to reorder from raw data to a report form but I don't know how to do it. Below is my sample.

import pandas as pd
d = {'Customer Name': ['A', 'A', 'B', 'C', 'C', 'C'], 
     'ID_NO': ['AA', 'AA', 'BB', 'CC', 'CC', 'CC'],
     'Account Number': ['123456', '123457', '234567', '345678', '345679', '345680'],
     'Limit': [1000, 1000, 900, 1200, 1200, 1200], 
     'Balance': [400, 600, 800, 300, 400, 200]}
df = pd.DataFrame(data=d)
df

But I need something as below:

d2 = {'Customer Name': ['A', '123456', '123457', 'B', '234567', 'C', '345678', '345679', '345680'], 
     'ID_NO': ['AA', '', '', 'BB', '', 'CC', '', '', ''],
     'Limit': [1000, '', '', 900, '', 1200, '', '', ''], 
     'Balance': [1000, 600, 400, 800, 800, 900, 300, 400, 200]}
df2 = pd.DataFrame(data=d2)
df2

What should I do in this case. I appreciate all the suggestions and help.

One more question that if I have two Limit numbers for one customer as below:

import pandas as pd
d = {'Customer Name': ['A', 'A', 'B', 'C', 'C', 'C'], 
     'ID_NO': ['AA', 'AA', 'BB', 'CC', 'CC', 'CC'],
     'Limit Number': ['AAA', 'AAA', 'BBB', 'CCC', 'CCC', 'CCD'],
     'Account Number': ['123456', '123457', '234567', '345678', '345679', '345680'],
     'Limit': [1000, 1000, 900, 1200, 1200, 1500], 
     'Balance': [400, 600, 800, 300, 400, 600]}
df = pd.DataFrame(data=d)

If I use the code below:

out = (pd.concat([df.reset_index()
                    .groupby(['Customer Name','ID_NO', 'Limit Number'], as_index=False)
                    .agg({'Balance':'sum','Limit':'first','index':'first'})
                    .set_index('index'), 
                 df[['Account Number','Balance']]
                      .rename(columns={'Account Number':'Customer Name'})])
         .sort_index(kind='stable', ignore_index=True)
         [['Customer Name','ID_NO','Limit','Balance']])
out

I will get the below result:

Below is my expected Output:

英文:

I'm trying to reorder from raw data to a report form but I don't know how to do it. Below is my sample.

import pandas as pd
d = {'Customer Name': ['A', 'A', 'B', 'C', 'C', 'C'], 
     'ID_NO': ['AA', 'AA', 'BB', 'CC', 'CC', 'CC'],
     'Account Number': ['123456', '123457', '234567', '345678', '345679', '345680'],
     'Limit': [1000, 1000, 900, 1200, 1200, 1200], 
     'Balance': [400, 600, 800, 300, 400, 200]}
df = pd.DataFrame(data=d)
df

如何使用pandas重新排序数据

But I need something as below:

d2 = {'Customer Name': ['A', '123456', '123457', 'B', '234567', 'C', '345678', '345679', '345680'], 
     'ID_NO': ['AA', '', '', 'BB', '', 'CC', '', '', ''],
     'Limit': [1000, '', '', 900, '', 1200, '', '', ''], 
     'Balance': [1000, 600, 400, 800, 800, 900, 300, 400, 200]}
df2 = pd.DataFrame(data=d2)
df2

如何使用pandas重新排序数据

What should I do in this case. I appreciate all the suggestions and help.

One more question that if I have two Limit number for one customer as below:

import pandas as pd
d = {'Customer Name': ['A', 'A', 'B', 'C', 'C', 'C'], 
     'ID_NO': ['AA', 'AA', 'BB', 'CC', 'CC', 'CC'],
     'Limit Number': ['AAA', 'AAA', 'BBB', 'CCC', 'CCC', 'CCD'],
     'Account Number': ['123456', '123457', '234567', '345678', '345679', '345680'],
     'Limit': [1000, 1000, 900, 1200, 1200, 1500], 
     'Balance': [400, 600, 800, 300, 400, 600]}
df = pd.DataFrame(data=d)

如何使用pandas重新排序数据

If I used this below code:

out = (pd.concat([df.reset_index()
                    .groupby(['Customer Name','ID_NO', 'Limit Number'], as_index=False)
                    .agg({'Balance':'sum','Limit':'first','index':'first'})
                    .set_index('index'), 
                 df[['Account Number','Balance']]
                      .rename(columns={'Account Number':'Customer Name'})])
         .sort_index(kind='stable', ignore_index=True)
         [['Customer Name','ID_NO','Limit','Balance']])
out

I will get below result:

如何使用pandas重新排序数据

Below is my expected Output:

如何使用pandas重新排序数据

答案1

得分: 2

以下是翻译好的部分:

你可以使用concat来合并数据,并通过删除重复值的方式使用第一个数据框,删除重复值的依据是Customer NameID_NO,同时删除Account Number列,并将Balance列设置为两列的总和,然后再使用rename选择必要的列,最后使用DataFrame.sort_index进行排序:

要求默认索引

# 如果需要的话
# df = df.reset_index(drop=True)

out = (pd.concat([df.drop_duplicates(['Customer Name','ID_NO'])
                    .drop('Account Number', axis=1)
                    .assign(Balance=df.groupby(['Customer Name','ID_NO'])['Balance']
                    .transform('sum')), 
       df[['Account Number','Balance']]
                    .rename(columns={'Account Number':'Customer Name'})])
         .sort_index(kind='stable', ignore_index=True))
print(out)
  Customer Name ID_NO   Limit  Balance
0             A    AA  1000.0     1000
1        123456   NaN     NaN      400
2        123457   NaN     NaN      600
3             B    BB   900.0      800
4        234567   NaN     NaN      800
5             C    CC  1200.0      900
6        345678   NaN     NaN      300
7        345679   NaN     NaN      400
8        345680   NaN     NaN      200

或者你可以使用sumGroupBy.first进行聚合,只需在第一个数据框中避免原始索引,所以使用了reset_indexset_index的技巧:

要求默认索引

# 如果需要的话
# df = df.reset_index(drop=True)

out = (pd.concat([df.reset_index()
                    .groupby(['Customer Name','ID_NO'], as_index=False)
                    .agg({'Balance':'sum','Limit':'first','index':'first'})
                    .set_index('index'), 
                 df[['Account Number','Balance']]
                      .rename(columns={'Account Number':'Customer Name'})])
         .sort_index(kind='stable', ignore_index=True)
         [['Customer Name','ID_NO','Limit','Balance']])
print(out)
  Customer Name ID_NO   Limit  Balance
0             A    AA  1000.0     1000
1        123456   NaN     NaN      400
2        123457   NaN     NaN      600
3             B    BB   900.0      800
4        234567   NaN     NaN      800
5             C    CC  1200.0      900
6        345678   NaN     NaN      300
7        345679   NaN     NaN      400
8        345680   NaN     NaN      200

编辑:为了对Limit列的唯一值进行求和,可以使用自定义函数:

out = (pd.concat([df.reset_index()
                    .groupby(['Customer Name','ID_NO'], as_index=False)
                    .agg({'Balance':'sum',
                          'Limit':lambda x: x.unique().sum(),
                          'index':'first'})
                    .set_index('index'), 
                 df[['Account Number','Balance']]
                      .rename(columns={'Account Number':'Customer Name'})])
         .sort_index(kind='stable', ignore_index=True)
         [['Customer Name','ID_NO','Limit','Balance']])
    
print(out)
  Customer Name ID_NO   Limit  Balance
0             A    AA  1000.0     1000
1        123456   NaN     NaN      400
2        123457   NaN     NaN      600
3             B    BB   900.0      800
4        234567   NaN     NaN      800
5             C    CC  2700.0     1300
6        345678   NaN     NaN      300
7        345679   NaN     NaN      400
8        345680   NaN     NaN      600
英文:

You can use concat and pass first DataFrame created by remove repeated values by Customer Name and ID_NO with delete Account Number column with set Balance to sum per both columns, then second by seelct necessary columns with rename and last use DataFrame.sort_index:

Requirement is default index.

#if necessary
#df = df.reset_index(drop=True)

out = (pd.concat([df.drop_duplicates(['Customer Name','ID_NO'])
                    .drop('Account Number', axis=1)
                    .assign(Balance=df.groupby(['Customer Name','ID_NO'])['Balance']
                    .transform('sum')), 
       df[['Account Number','Balance']]
                    .rename(columns={'Account Number':'Customer Name'})])
         .sort_index(kind='stable', ignore_index=True))
print (out)
  Customer Name ID_NO   Limit  Balance
0             A    AA  1000.0     1000
1        123456   NaN     NaN      400
2        123457   NaN     NaN      600
3             B    BB   900.0      800
4        234567   NaN     NaN      800
5             C    CC  1200.0      900
6        345678   NaN     NaN      300
7        345679   NaN     NaN      400
8        345680   NaN     NaN      200

Or you can use aggreagtion by sum and GroupBy.first, only necessary avoid original index in first DataFrame, so used reset_index with set_index trick:

Requirement is default index.

#if necessary
#df = df.reset_index(drop=True)

out = (pd.concat([df.reset_index()
                    .groupby(['Customer Name','ID_NO'], as_index=False)
                    .agg({'Balance':'sum','Limit':'first','index':'first'})
                    .set_index('index'), 
                 df[['Account Number','Balance']]
                      .rename(columns={'Account Number':'Customer Name'})])
         .sort_index(kind='stable', ignore_index=True)
         [['Customer Name','ID_NO','Limit','Balance']])
print (out)
  Customer Name ID_NO   Limit  Balance
0             A    AA  1000.0     1000
1        123456   NaN     NaN      400
2        123457   NaN     NaN      600
3             B    BB   900.0      800
4        234567   NaN     NaN      800
5             C    CC  1200.0      900
6        345678   NaN     NaN      300
7        345679   NaN     NaN      400
8        345680   NaN     NaN      200

EDIT: For sum unique values of Limit column use custom function:

out = (pd.concat([df.reset_index()
                    .groupby(['Customer Name','ID_NO'], as_index=False)
                    .agg({'Balance':'sum',
                          'Limit':lambda x: x.unique().sum(),
                          'index':'first'})
                    .set_index('index'), 
                 df[['Account Number','Balance']]
                      .rename(columns={'Account Number':'Customer Name'})])
         .sort_index(kind='stable', ignore_index=True)
         [['Customer Name','ID_NO','Limit','Balance']])

print (out)
  Customer Name ID_NO   Limit  Balance
0             A    AA  1000.0     1000
1        123456   NaN     NaN      400
2        123457   NaN     NaN      600
3             B    BB   900.0      800
4        234567   NaN     NaN      800
5             C    CC  2700.0     1300
6        345678   NaN     NaN      300
7        345679   NaN     NaN      400
8        345680   NaN     NaN      600

huangapple
  • 本文由 发表于 2023年6月16日 12:51:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76487038.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定