英文:
Column Pair-wise aggregation and reorganization in Pandas
问题
我正在将一个csv文件导入到一个pandas dataframe中,如下所示:
df = pd.DataFrame( {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: '净成本', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: '费用描述', 1: '附加费 A', 2: '折扣 X', 3: '折扣 X', 4: '折扣 X', 5: '附加费 A'}, 3: {0: '费用金额', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: '费用描述', 1: '折扣 X', 2: '', 3: '', 4: '附加费 B', 5: '折扣 X'}, 5: {0: '费用金额', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: '费用描述', 1: '折扣 Y', 2: '', 3: '', 4: '', 5: '附加费 B'}, 7: {0: '费用金额', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: '费用描述', 1: '附加费 B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: '费用金额', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}} )
第一行是包含列名的标题,其中“费用描述”和“费用金额”成对出现多次。
期望的输出是一个df,每个描述都有一个唯一的列,重新组织的列按字母顺序排序,并且NaN显示为0:
| ID | 净成本 | 附加费 A | 附加费 B | 折扣 X | 折扣 Y |
|-----:|-------:|----------:|----------:|-------:|-------:|
| 1 | 30 | 9.5 | 2.5 | -11.5 | -3.25 |
| 2 | 40 | 0 | 0 | -12.5 | 0 |
| 3 | 50 | 0 | 0 | -11.5 | 0 |
| 4 | 35 | 0 | 3.5 | -5.5 | 0 |
| 5 | 45 | 9.5 | 4.5 | -10.5 | 0 |
英文:
I am importing a csv file into a pandas dataframe such as:
df = pd.DataFrame( {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}} )
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
ID | Net Cost | Charge Description | Charge Amount | Charge Description | Charge Amount | Charge Description | Charge Amount | Charge Description | Charge Amount |
1 | 30 | Surcharge A | 9.5 | Discount X | -11.5 | Discount Y | -3.25 | Surcharge B | 2.5 |
2 | 40 | Discount X | -12.5 | ||||||
3 | 50 | Discount X | -11.5 | ||||||
4 | 35 | Discount X | -5.5 | Surcharge B | 3.5 | ||||
5 | 45 | Surcharge A | 9.5 | Discount X | -10.5 | Surcharge B | 4.5 |
The first row are the headers with column names Charge Description and Charge Amount forming pairs and appearing multiple times.
Desired output is a df with a unique column for each description, with the reorganized columns sorted alphabetically and NaNs showing as 0:
ID | Net Cost | Surcharge A | Surcharge B | Discount X | Discount Y |
---|---|---|---|---|---|
1 | 30 | 9.5 | 2.5 | -11.5 | -3.25 |
2 | 40 | 0 | 0 | -12.5 | 0 |
3 | 50 | 0 | 0 | -11.5 | 0 |
4 | 35 | 0 | 3.5 | -5.5 | 0 |
5 | 45 | 9.5 | 4.5 | -10.5 | 0 |
This post looks like a good starting point but then I need a column for each Charge Description and only a single row per ID.
答案1
得分: 2
I used the file you shared, and edited the columns with the initial dataframe df
shared (Pandas automatically adds suffixes to columns to make them unique) to keep the non uniqueness:
invoice = pd.read_csv('Downloads/Example Invoice.csv')
invoice.columns = ['ID', 'Net Cost', 'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount']
print(invoice)
ID Net Cost Charge Description Charge Amount ... Charge Description Charge Amount Charge Description Charge Amount
0 1 30 Surcharge A 9.5 ... Discount Y -3.25 Surcharge B 2.5
1 2 40 Discount X -12.5 ... NaN NaN NaN NaN
2 3 50 Discount X -11.5 ... NaN NaN NaN NaN
3 4 35 Discount X -5.5 ... NaN NaN NaN NaN
4 5 45 Surcharge A 9.5 ... Surcharge B 4.50 NaN NaN
First step is to transform to long form with pivot_longer from pyjanitor - in this case we take advantage of the fact that charge description
is followed by charge amount
- we can safely pair them and reshape into two columns. After that is done, we flip back to wide form - getting Surcharge
and Discount
values as headers. Thankfully, the index is unique, so a pivot works without extras. I used pivot_wider here, primarily for convenience - the same can be achieved with pivot
, with just a few cleanup steps - under the hood pivot_wider uses pd.pivot
.
# pip install pyjanitor
import pandas as pd
import janitor
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
(invoice
.pivot_longer(
index = index,
names_to = arr,
names_pattern = arr,
dropna=True)
.pivot_wider(
index=index,
names_from='Charge Description',
values_from='Charge Amount')
.fillna(0)
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0.00 0.0 0.0
2 3 50 -11.5 0.00 0.0 0.0
3 4 35 -5.5 0.00 0.0 3.5
4 5 45 -10.5 0.00 9.5 4.5
Another option - since the data is fairly consistent with the ordering, you can dump down into numpy, reshape into a two column array, keep track of the ID
and Net Cost
columns (ensure they are correctly paired), and then pivot to get your final data:
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
invoice = invoice.set_index(index)
out = invoice.to_numpy().reshape(-1, 2)
out = pd.DataFrame(out, columns = arr)
# reshape above is in order `C` - default
# so we can safely repeat the index
# with a value of 4
# which is what you get ->
# invoice.columns.size // 2
# to correctly pair the index with the new dataframe
out.index = invoice.index.repeat(invoice.columns.size//2)
# get rid of nulls, and flip to wide form
(out
.dropna(how='all')
.set_index('Charge Description', append=True)
.squeeze()
.unstack('Charge Description', fill_value=0)
.rename_axis(columns = None)
.reset_index()
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
You can convert the data dtypes for Discount
to numeric
英文:
I used the file you shared, and edited the columns with the initial dataframe df
shared (Pandas automatically adds suffixes to columns to make them unique) to keep the non uniqueness:
invoice = pd.read_csv('Downloads/Example Invoice.csv')
invoice.columns = ['ID', 'Net Cost', 'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount']
print(invoice)
ID Net Cost Charge Description Charge Amount ... Charge Description Charge Amount Charge Description Charge Amount
0 1 30 Surcharge A 9.5 ... Discount Y -3.25 Surcharge B 2.5
1 2 40 Discount X -12.5 ... NaN NaN NaN NaN
2 3 50 Discount X -11.5 ... NaN NaN NaN NaN
3 4 35 Discount X -5.5 ... NaN NaN NaN NaN
4 5 45 Surcharge A 9.5 ... Surcharge B 4.50 NaN NaN
First step is to transform to long form with pivot_longer from pyjanitor - in this case we take advantage of the fact that charge description
is followed by charge amount
- we can safely pair them and reshape into two columns. After that is done, we flip back to wide form - getting Surcharge
and Discount
values as headers. Thankfully, the index is unique, so a pivot works without extras. I used pivot_wider here, primarily for convenience - the same can be achieved with pivot
, with just a few cleanup steps - under the hood pivot_wider uses pd.pivot
.
# pip install pyjanitor
import pandas as pd
import janitor
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
(invoice
.pivot_longer(
index = index,
names_to = arr,
names_pattern = arr,
dropna=True)
.pivot_wider(
index=index,
names_from='Charge Description',
values_from='Charge Amount')
.fillna(0)
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0.00 0.0 0.0
2 3 50 -11.5 0.00 0.0 0.0
3 4 35 -5.5 0.00 0.0 3.5
4 5 45 -10.5 0.00 9.5 4.5
Another option - since the data is fairly consistent with the ordering, you can dump down into numpy, reshape into a two column array, keep track of the ID
and Net Cost
columns (ensure they are correctly paired), and then pivot to get your final data:
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
invoice = invoice.set_index(index)
out = invoice.to_numpy().reshape(-1, 2)
out = pd.DataFrame(out, columns = arr)
# reshape above is in order `C` - default
# so we can safely repeat the index
# with a value of 4
# which is what you get ->
# invoice.columns.size // 2
# to correctly pair the index with the new dataframe
out.index = invoice.index.repeat(invoice.columns.size//2)
# get rid of nulls, and flip to wide form
(out
.dropna(how='all')
.set_index('Charge Description', append=True)
.squeeze()
.unstack('Charge Description', fill_value=0)
.rename_axis(columns = None)
.reset_index()
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
You can convert the data dtypes for Discount
to numeric
答案2
得分: 0
你可以首先使用 melt
函数来展开你的数据框,然后在清理后使用 pivot_table
函数来重新塑造它:
# 第一步
out = (pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].tolist())
.melt(['ID', 'Net Cost'], ignore_index=False))
m = out['variable'] == 'Charge Description'
# 第二步
out = (pd.concat([out[m].reset_index(drop=True).add_prefix('_'),
out[~m].reset_index(drop=True)], axis=1)
.query("_value != ''")
.pivot_table(index=['ID', 'Net Cost'], columns='_value',
values='value', aggfunc='first')
.rename_axis(columns=None).reset_index().fillna(0))
输出:
>>> out
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
英文:
You can flatten your dataframe first with melt
then reshape with pivot_table
after cleaning it up:
# 1st pass
out = (pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].tolist())
.melt(['ID', 'Net Cost'], ignore_index=False))
m = out['variable'] == 'Charge Description'
# 2nd pass
out = (pd.concat([out[m].reset_index(drop=True).add_prefix('_'),
out[~m].reset_index(drop=True)], axis=1)
.query("_value != ''")
.pivot_table(index=['ID', 'Net Cost'], columns='_value',
values='value', aggfunc='first')
.rename_axis(columns=None).reset_index().fillna(0))
Output:
>>> out
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
答案3
得分: 0
您可以在连接成对的数据后使用 pivot_table
:
import pandas as pd
df = pd.DataFrame.from_dict(
{0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}})
# 设置第一行为标题
df.columns = df.iloc[0, :]
df.drop(index=0, inplace=True)
df = pd.concat([df.iloc[:, [0, 1, i, i + 1]] for i in range(2, len(df.columns), 2)]).replace('', 0)
print(df[df['Charge Description'] != 0]
.pivot_table(columns='Charge Description', values='Charge Amount', index=['ID', 'Net Cost'])
.fillna(0))
输出:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 0.00 0.0 0.0
3 50 -11.5 0.00 0.0 0.0
4 35 -5.5 0.00 0.0 3.5
5 45 -10.5 0.00 9.5 4.5
英文:
You can use pivot_table
after concatenating pair-wise:
import pandas as pd
df = pd.DataFrame.from_dict(
{0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}})
# setting first row as header
df.columns = df.iloc[0, :]
df.drop(index=0, inplace=True)
df = pd.concat([df.iloc[:, [0,1,i,i+1]] for i in range(2, len(df.columns), 2)]).replace('', 0)
print(df[df['Charge Description']!=0]
.pivot_table(columns='Charge Description', values='Charge Amount', index=['ID', 'Net Cost'])
.fillna(0))
Output:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 0.00 0.0 0.0
3 50 -11.5 0.00 0.0 0.0
4 35 -5.5 0.00 0.0 3.5
5 45 -10.5 0.00 9.5 4.5
答案4
得分: 0
我将使用melt
函数来堆叠具有相同名称的列,然后使用pivot
函数来创建你想要的结果。
# 确保第一行现在是列名,然后删除第一行。
df.columns = df.iloc[0]
df = df[1:]
# 创建两个已经melt的数据框,并按索引将它们连接起来。
df1 = df.melt(['ID', 'Net Cost'], ['Charge Description']).sort_values(by='ID').reset_index(drop=True)
df2 = df.melt(['ID', 'Net Cost'], ['Charge Amount']).sort_values(by='ID').reset_index(drop=True)
df1['Charge Amount'] = df2['value']
# 进行一些清理工作,重命名df1中添加的'value'列。
df1 = df1.drop(columns=[0]).rename(columns={'value': 'Charge Description'})
df1 = df1.dropna()
# 对数据进行透视。
df1 = df1.pivot(index=['ID', 'Net Cost'], columns='Charge Description', values='Charge Amount')
df1的结果如下:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 NaN NaN NaN
3 50 -11.5 NaN NaN NaN
4 35 -5.5 NaN NaN 3.5
5 45 -10.5 NaN 9.5 4.5
英文:
I would use melt
to stack the identically named columns, then pivot to create the outcome you want.
# Ensure the first line is now the column names, and then delete the first line.
df.columns = df.iloc[0]
df = df[1:]
# Create two melted df's, and join them on index.
df1 = df.melt(['ID', 'Net Cost'], ['Charge Description']).sort_values(by='ID').reset_index(drop=True)
df2 = df.melt(['ID', 'Net Cost'], ['Charge Amount']).sort_values(by='ID').reset_index(drop=True)
df1['Charge Amount'] = df2['value']
# Clean up a little, rename the added 'value' column from df1.
df1 = df1.drop(columns=[0]).rename(columns={'value': 'Charge Description'})
df1 = df1.dropna()
# Pivot the data.
df1 = df1.pivot(index=['ID', 'Net Cost'], columns='Charge Description', values='Charge Amount')
Result of df1:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 NaN NaN NaN
3 50 -11.5 NaN NaN NaN
4 35 -5.5 NaN NaN 3.5
5 45 -10.5 NaN 9.5 4.5`
答案5
得分: 0
以下是翻译好的部分:
"My first thought was to read the data out into a list of dictionaries representing each Row (making both the keys and values from the data values), then form a new dataframe from that."(我最初的想法是将数据读取到一个字典列表中,每个字典代表一行(从数据值中同时生成键和值),然后从中创建一个新的数据框。)
"For your example, that would make..."(对于你的示例,这将产生...)
"For the SMALL sample dataset, using comprehensions appears to be quite quick for that..."(对于小规模示例数据集,使用推导似乎相当快速...)
"To sort the column names, add the following..."(要对列名进行排序,请添加以下内容...)
"df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]]"(df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]])
英文:
My first thought was to read the data out in to a list of dictionaries representing each Row (making both the keys and values from the data values), then form a new dataframe from that.
For your example, that would make...
[
{
'ID': '1',
'Net Cost': '30',
'Discount X': '-11.5',
'Discount Y': '-3.25',
'Surcharge A': '9.5',
'Surcharge B': '2.5',
},
{
'ID': '2',
'Net Cost': '40',
'Discount X': '-12.5',
},
{
'ID': '3',
'Net Cost': '50',
'Discount X': '-11.5',
},
{
'ID': '4',
'Net Cost': '35',
'Discount X': '-5.5',
'Surcharge B': '3.5',
},
{
'ID': '5',
'Net Cost': '45',
'Discount X': '-10.5',
'Surcharge A': '9.5',
'Surcharge B': '4.5',
},
]
For the SMALL sample dataset, using comprehensions appears to be quite quick for that...
import pandas as pd
from itertools import chain
rows = [
{
name: value
for name, value in chain(
[
("ID", row[0]),
("Net Cost", row[1]),
],
zip(row[2::2], row[3::2]) # pairs of columns: (2,3), (4,5), etc
)
if name
}
for ix, row in df.iloc[1:].iterrows() # Skips the row with the column headers
]
df2 = pd.DataFrame(rows).fillna(0)
Demo (including timings of this and three other answers):
EDIT:
To sort the column names, add the following...
df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论