英文:
How to reorder data with pandas
问题
I'm trying to reorder from raw data to a report form but I don't know how to do it. Below is my sample.
import pandas as pd
d = {'Customer Name': ['A', 'A', 'B', 'C', 'C', 'C'],
'ID_NO': ['AA', 'AA', 'BB', 'CC', 'CC', 'CC'],
'Account Number': ['123456', '123457', '234567', '345678', '345679', '345680'],
'Limit': [1000, 1000, 900, 1200, 1200, 1200],
'Balance': [400, 600, 800, 300, 400, 200]}
df = pd.DataFrame(data=d)
df
But I need something as below:
d2 = {'Customer Name': ['A', '123456', '123457', 'B', '234567', 'C', '345678', '345679', '345680'],
'ID_NO': ['AA', '', '', 'BB', '', 'CC', '', '', ''],
'Limit': [1000, '', '', 900, '', 1200, '', '', ''],
'Balance': [1000, 600, 400, 800, 800, 900, 300, 400, 200]}
df2 = pd.DataFrame(data=d2)
df2
What should I do in this case. I appreciate all the suggestions and help.
One more question that if I have two Limit numbers for one customer as below:
import pandas as pd
d = {'Customer Name': ['A', 'A', 'B', 'C', 'C', 'C'],
'ID_NO': ['AA', 'AA', 'BB', 'CC', 'CC', 'CC'],
'Limit Number': ['AAA', 'AAA', 'BBB', 'CCC', 'CCC', 'CCD'],
'Account Number': ['123456', '123457', '234567', '345678', '345679', '345680'],
'Limit': [1000, 1000, 900, 1200, 1200, 1500],
'Balance': [400, 600, 800, 300, 400, 600]}
df = pd.DataFrame(data=d)
If I use the code below:
out = (pd.concat([df.reset_index()
.groupby(['Customer Name','ID_NO', 'Limit Number'], as_index=False)
.agg({'Balance':'sum','Limit':'first','index':'first'})
.set_index('index'),
df[['Account Number','Balance']]
.rename(columns={'Account Number':'Customer Name'})])
.sort_index(kind='stable', ignore_index=True)
[['Customer Name','ID_NO','Limit','Balance']])
out
I will get the below result:
Below is my expected Output:
英文:
I'm trying to reorder from raw data to a report form but I don't know how to do it. Below is my sample.
import pandas as pd
d = {'Customer Name': ['A', 'A', 'B', 'C', 'C', 'C'],
'ID_NO': ['AA', 'AA', 'BB', 'CC', 'CC', 'CC'],
'Account Number': ['123456', '123457', '234567', '345678', '345679', '345680'],
'Limit': [1000, 1000, 900, 1200, 1200, 1200],
'Balance': [400, 600, 800, 300, 400, 200]}
df = pd.DataFrame(data=d)
df
But I need something as below:
d2 = {'Customer Name': ['A', '123456', '123457', 'B', '234567', 'C', '345678', '345679', '345680'],
'ID_NO': ['AA', '', '', 'BB', '', 'CC', '', '', ''],
'Limit': [1000, '', '', 900, '', 1200, '', '', ''],
'Balance': [1000, 600, 400, 800, 800, 900, 300, 400, 200]}
df2 = pd.DataFrame(data=d2)
df2
What should I do in this case. I appreciate all the suggestions and help.
One more question that if I have two Limit number for one customer as below:
import pandas as pd
d = {'Customer Name': ['A', 'A', 'B', 'C', 'C', 'C'],
'ID_NO': ['AA', 'AA', 'BB', 'CC', 'CC', 'CC'],
'Limit Number': ['AAA', 'AAA', 'BBB', 'CCC', 'CCC', 'CCD'],
'Account Number': ['123456', '123457', '234567', '345678', '345679', '345680'],
'Limit': [1000, 1000, 900, 1200, 1200, 1500],
'Balance': [400, 600, 800, 300, 400, 600]}
df = pd.DataFrame(data=d)
If I used this below code:
out = (pd.concat([df.reset_index()
.groupby(['Customer Name','ID_NO', 'Limit Number'], as_index=False)
.agg({'Balance':'sum','Limit':'first','index':'first'})
.set_index('index'),
df[['Account Number','Balance']]
.rename(columns={'Account Number':'Customer Name'})])
.sort_index(kind='stable', ignore_index=True)
[['Customer Name','ID_NO','Limit','Balance']])
out
I will get below result:
Below is my expected Output:
答案1
得分: 2
以下是翻译好的部分:
你可以使用concat
来合并数据,并通过删除重复值的方式使用第一个数据框,删除重复值的依据是Customer Name
和ID_NO
,同时删除Account Number
列,并将Balance
列设置为两列的总和,然后再使用rename
选择必要的列,最后使用DataFrame.sort_index
进行排序:
要求默认索引。
# 如果需要的话
# df = df.reset_index(drop=True)
out = (pd.concat([df.drop_duplicates(['Customer Name','ID_NO'])
.drop('Account Number', axis=1)
.assign(Balance=df.groupby(['Customer Name','ID_NO'])['Balance']
.transform('sum')),
df[['Account Number','Balance']]
.rename(columns={'Account Number':'Customer Name'})])
.sort_index(kind='stable', ignore_index=True))
print(out)
Customer Name ID_NO Limit Balance
0 A AA 1000.0 1000
1 123456 NaN NaN 400
2 123457 NaN NaN 600
3 B BB 900.0 800
4 234567 NaN NaN 800
5 C CC 1200.0 900
6 345678 NaN NaN 300
7 345679 NaN NaN 400
8 345680 NaN NaN 200
或者你可以使用sum
和GroupBy.first
进行聚合,只需在第一个数据框中避免原始索引,所以使用了reset_index
和set_index
的技巧:
要求默认索引。
# 如果需要的话
# df = df.reset_index(drop=True)
out = (pd.concat([df.reset_index()
.groupby(['Customer Name','ID_NO'], as_index=False)
.agg({'Balance':'sum','Limit':'first','index':'first'})
.set_index('index'),
df[['Account Number','Balance']]
.rename(columns={'Account Number':'Customer Name'})])
.sort_index(kind='stable', ignore_index=True)
[['Customer Name','ID_NO','Limit','Balance']])
print(out)
Customer Name ID_NO Limit Balance
0 A AA 1000.0 1000
1 123456 NaN NaN 400
2 123457 NaN NaN 600
3 B BB 900.0 800
4 234567 NaN NaN 800
5 C CC 1200.0 900
6 345678 NaN NaN 300
7 345679 NaN NaN 400
8 345680 NaN NaN 200
编辑:为了对Limit
列的唯一值进行求和,可以使用自定义函数:
out = (pd.concat([df.reset_index()
.groupby(['Customer Name','ID_NO'], as_index=False)
.agg({'Balance':'sum',
'Limit':lambda x: x.unique().sum(),
'index':'first'})
.set_index('index'),
df[['Account Number','Balance']]
.rename(columns={'Account Number':'Customer Name'})])
.sort_index(kind='stable', ignore_index=True)
[['Customer Name','ID_NO','Limit','Balance']])
print(out)
Customer Name ID_NO Limit Balance
0 A AA 1000.0 1000
1 123456 NaN NaN 400
2 123457 NaN NaN 600
3 B BB 900.0 800
4 234567 NaN NaN 800
5 C CC 2700.0 1300
6 345678 NaN NaN 300
7 345679 NaN NaN 400
8 345680 NaN NaN 600
英文:
You can use concat
and pass first DataFrame created by remove repeated values by Customer Name
and ID_NO
with delete Account Number
column with set Balance
to sum
per both columns, then second by seelct necessary columns with rename
and last use DataFrame.sort_index
:
Requirement is default index.
#if necessary
#df = df.reset_index(drop=True)
out = (pd.concat([df.drop_duplicates(['Customer Name','ID_NO'])
.drop('Account Number', axis=1)
.assign(Balance=df.groupby(['Customer Name','ID_NO'])['Balance']
.transform('sum')),
df[['Account Number','Balance']]
.rename(columns={'Account Number':'Customer Name'})])
.sort_index(kind='stable', ignore_index=True))
print (out)
Customer Name ID_NO Limit Balance
0 A AA 1000.0 1000
1 123456 NaN NaN 400
2 123457 NaN NaN 600
3 B BB 900.0 800
4 234567 NaN NaN 800
5 C CC 1200.0 900
6 345678 NaN NaN 300
7 345679 NaN NaN 400
8 345680 NaN NaN 200
Or you can use aggreagtion by sum
and GroupBy.first
, only necessary avoid original index in first DataFrame
, so used reset_index
with set_index
trick:
Requirement is default index.
#if necessary
#df = df.reset_index(drop=True)
out = (pd.concat([df.reset_index()
.groupby(['Customer Name','ID_NO'], as_index=False)
.agg({'Balance':'sum','Limit':'first','index':'first'})
.set_index('index'),
df[['Account Number','Balance']]
.rename(columns={'Account Number':'Customer Name'})])
.sort_index(kind='stable', ignore_index=True)
[['Customer Name','ID_NO','Limit','Balance']])
print (out)
Customer Name ID_NO Limit Balance
0 A AA 1000.0 1000
1 123456 NaN NaN 400
2 123457 NaN NaN 600
3 B BB 900.0 800
4 234567 NaN NaN 800
5 C CC 1200.0 900
6 345678 NaN NaN 300
7 345679 NaN NaN 400
8 345680 NaN NaN 200
EDIT: For sum unique
values of Limit
column use custom function:
out = (pd.concat([df.reset_index()
.groupby(['Customer Name','ID_NO'], as_index=False)
.agg({'Balance':'sum',
'Limit':lambda x: x.unique().sum(),
'index':'first'})
.set_index('index'),
df[['Account Number','Balance']]
.rename(columns={'Account Number':'Customer Name'})])
.sort_index(kind='stable', ignore_index=True)
[['Customer Name','ID_NO','Limit','Balance']])
print (out)
Customer Name ID_NO Limit Balance
0 A AA 1000.0 1000
1 123456 NaN NaN 400
2 123457 NaN NaN 600
3 B BB 900.0 800
4 234567 NaN NaN 800
5 C CC 2700.0 1300
6 345678 NaN NaN 300
7 345679 NaN NaN 400
8 345680 NaN NaN 600
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论