英文:
Column Pair-wise aggregation and reorganization in Pandas
问题
我正在将一个csv文件导入到一个pandas dataframe中,如下所示:
df = pd.DataFrame( {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: '净成本', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: '费用描述', 1: '附加费 A', 2: '折扣 X', 3: '折扣 X', 4: '折扣 X', 5: '附加费 A'}, 3: {0: '费用金额', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: '费用描述', 1: '折扣 X', 2: '', 3: '', 4: '附加费 B', 5: '折扣 X'}, 5: {0: '费用金额', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: '费用描述', 1: '折扣 Y', 2: '', 3: '', 4: '', 5: '附加费 B'}, 7: {0: '费用金额', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: '费用描述', 1: '附加费 B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: '费用金额', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}} )
第一行是包含列名的标题,其中“费用描述”和“费用金额”成对出现多次。
期望的输出是一个df,每个描述都有一个唯一的列,重新组织的列按字母顺序排序,并且NaN显示为0:
|   ID |   净成本 |   附加费 A |   附加费 B |   折扣 X |   折扣 Y |
|-----:|-------:|----------:|----------:|-------:|-------:|
|    1 |     30 |       9.5 |       2.5 |  -11.5 |  -3.25 |
|    2 |     40 |         0 |         0 |  -12.5 |      0 |
|    3 |     50 |         0 |         0 |  -11.5 |      0 |
|    4 |     35 |         0 |       3.5 |   -5.5 |      0 |
|    5 |     45 |       9.5 |       4.5 |  -10.5 |      0 |
英文:
I am importing a csv file into a pandas dataframe such as:
df = pd.DataFrame( {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}} )
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 
|---|---|---|---|---|---|---|---|---|---|
| ID | Net Cost | Charge Description | Charge Amount | Charge Description | Charge Amount | Charge Description | Charge Amount | Charge Description | Charge Amount | 
| 1 | 30 | Surcharge A | 9.5 | Discount X | -11.5 | Discount Y | -3.25 | Surcharge B | 2.5 | 
| 2 | 40 | Discount X | -12.5 | ||||||
| 3 | 50 | Discount X | -11.5 | ||||||
| 4 | 35 | Discount X | -5.5 | Surcharge B | 3.5 | ||||
| 5 | 45 | Surcharge A | 9.5 | Discount X | -10.5 | Surcharge B | 4.5 | 
The first row are the headers with column names Charge Description and Charge Amount forming pairs and appearing multiple times.
Desired output is a df with a unique column for each description, with the reorganized columns sorted alphabetically and NaNs showing as 0:
| ID | Net Cost | Surcharge A | Surcharge B | Discount X | Discount Y | 
|---|---|---|---|---|---|
| 1 | 30 | 9.5 | 2.5 | -11.5 | -3.25 | 
| 2 | 40 | 0 | 0 | -12.5 | 0 | 
| 3 | 50 | 0 | 0 | -11.5 | 0 | 
| 4 | 35 | 0 | 3.5 | -5.5 | 0 | 
| 5 | 45 | 9.5 | 4.5 | -10.5 | 0 | 
This post looks like a good starting point but then I need a column for each Charge Description and only a single row per ID.
答案1
得分: 2
I used the file you shared, and edited the columns with the initial dataframe df shared (Pandas automatically adds suffixes to columns to make them unique) to keep the non uniqueness:
invoice = pd.read_csv('Downloads/Example Invoice.csv')
invoice.columns = ['ID', 'Net Cost', 'Charge Description', 'Charge Amount',
                   'Charge Description', 'Charge Amount', 
                   'Charge Description', 'Charge Amount', 
                   'Charge Description', 'Charge Amount']
print(invoice)
   ID  Net Cost Charge Description  Charge Amount  ... Charge Description  Charge Amount Charge Description  Charge Amount
0   1        30        Surcharge A            9.5  ...         Discount Y          -3.25        Surcharge B            2.5
1   2        40         Discount X          -12.5  ...                NaN            NaN                NaN            NaN
2   3        50         Discount X          -11.5  ...                NaN            NaN                NaN            NaN
3   4        35         Discount X           -5.5  ...                NaN            NaN                NaN            NaN
4   5        45        Surcharge A            9.5  ...        Surcharge B           4.50                NaN            NaN
First step is to transform to long form with pivot_longer from pyjanitor - in this case we take advantage of the fact that charge description is followed by charge amount - we can safely pair them and reshape into two columns. After that is done, we flip back to wide form - getting Surcharge and Discount values as headers. Thankfully, the index is unique, so a pivot works without extras. I used pivot_wider here, primarily for convenience - the same can be achieved with pivot, with just a few cleanup steps - under the hood pivot_wider uses pd.pivot.
# pip install pyjanitor
import pandas as pd
import janitor
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
(invoice
.pivot_longer(
    index = index, 
    names_to = arr, 
    names_pattern = arr, 
    dropna=True)
.pivot_wider(
    index=index,
    names_from='Charge Description', 
    values_from='Charge Amount')
.fillna(0)
)
   ID  Net Cost  Discount X  Discount Y  Surcharge A  Surcharge B
0   1        30       -11.5       -3.25          9.5          2.5
1   2        40       -12.5        0.00          0.0          0.0
2   3        50       -11.5        0.00          0.0          0.0
3   4        35        -5.5        0.00          0.0          3.5
4   5        45       -10.5        0.00          9.5          4.5
Another option - since the data is fairly consistent with the ordering, you can dump down into numpy, reshape into a two column array, keep track of the ID and Net Cost columns (ensure they are correctly paired), and then pivot to get your final data:
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
invoice = invoice.set_index(index)
out = invoice.to_numpy().reshape(-1, 2)
out = pd.DataFrame(out, columns = arr)
# reshape above is in order `C` - default
# so we can safely repeat the index
# with a value of 4
# which is what you get -> 
# invoice.columns.size // 2
# to correctly pair the index with the new dataframe
out.index = invoice.index.repeat(invoice.columns.size//2)
# get rid of nulls, and flip to wide form
(out
.dropna(how='all')
.set_index('Charge Description', append=True)
.squeeze()
.unstack('Charge Description', fill_value=0)
.rename_axis(columns = None)
.reset_index()
)
   ID  Net Cost Discount X Discount Y Surcharge A Surcharge B
0   1        30      -11.5      -3.25         9.5         2.5
1   2        40      -12.5          0           0           0
2   3        50      -11.5          0           0           0
3   4        35       -5.5          0           0         3.5
4   5        45      -10.5          0         9.5         4.5
You can convert the data dtypes for Discount to numeric
英文:
I used the file you shared, and edited the columns with the initial dataframe df shared (Pandas automatically adds suffixes to columns to make them unique) to keep the non uniqueness:
invoice = pd.read_csv('Downloads/Example Invoice.csv')
invoice.columns = ['ID', 'Net Cost', 'Charge Description', 'Charge Amount',
                   'Charge Description', 'Charge Amount', 
                   'Charge Description', 'Charge Amount', 
                   'Charge Description', 'Charge Amount']
print(invoice)
   ID  Net Cost Charge Description  Charge Amount  ... Charge Description  Charge Amount Charge Description  Charge Amount
0   1        30        Surcharge A            9.5  ...         Discount Y          -3.25        Surcharge B            2.5
1   2        40         Discount X          -12.5  ...                NaN            NaN                NaN            NaN
2   3        50         Discount X          -11.5  ...                NaN            NaN                NaN            NaN
3   4        35         Discount X           -5.5  ...                NaN            NaN                NaN            NaN
4   5        45        Surcharge A            9.5  ...        Surcharge B           4.50                NaN            NaN
First step is to transform to long form with pivot_longer from pyjanitor - in this case we take advantage of the fact that charge description is followed by charge amount - we can safely pair them and reshape into two columns. After that is done, we flip back to wide form - getting Surcharge and Discount values as headers. Thankfully, the index is unique, so a pivot works without extras. I used pivot_wider here, primarily for convenience - the same can be achieved with pivot, with just a few cleanup steps - under the hood pivot_wider uses pd.pivot.
# pip install pyjanitor
import pandas as pd
import janitor
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
(invoice
.pivot_longer(
    index = index, 
    names_to = arr, 
    names_pattern = arr, 
    dropna=True)
.pivot_wider(
    index=index,
    names_from='Charge Description', 
    values_from='Charge Amount')
.fillna(0)
)
   ID  Net Cost  Discount X  Discount Y  Surcharge A  Surcharge B
0   1        30       -11.5       -3.25          9.5          2.5
1   2        40       -12.5        0.00          0.0          0.0
2   3        50       -11.5        0.00          0.0          0.0
3   4        35        -5.5        0.00          0.0          3.5
4   5        45       -10.5        0.00          9.5          4.5
Another option - since the data is fairly consistent with the ordering, you can dump down into numpy, reshape into a two column array, keep track of the ID and Net Cost columns (ensure they are correctly paired), and then pivot to get your final data:
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
invoice = invoice.set_index(index)
out = invoice.to_numpy().reshape(-1, 2)
out = pd.DataFrame(out, columns = arr)
# reshape above is in order `C` - default
# so we can safely repeat the index
# with a value of 4
# which is what you get -> 
# invoice.columns.size // 2
# to correctly pair the index with the new dataframe
out.index = invoice.index.repeat(invoice.columns.size//2)
# get rid of nulls, and flip to wide form
(out
.dropna(how='all')
.set_index('Charge Description', append=True)
.squeeze()
.unstack('Charge Description', fill_value=0)
.rename_axis(columns = None)
.reset_index()
)
   ID  Net Cost Discount X Discount Y Surcharge A Surcharge B
0   1        30      -11.5      -3.25         9.5         2.5
1   2        40      -12.5          0           0           0
2   3        50      -11.5          0           0           0
3   4        35       -5.5          0           0         3.5
4   5        45      -10.5          0         9.5         4.5
You can convert the data dtypes for Discount to numeric
答案2
得分: 0
你可以首先使用 melt 函数来展开你的数据框,然后在清理后使用 pivot_table 函数来重新塑造它:
# 第一步
out = (pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].tolist())
         .melt(['ID', 'Net Cost'], ignore_index=False))
m = out['variable'] == 'Charge Description'
# 第二步
out = (pd.concat([out[m].reset_index(drop=True).add_prefix('_'),
                  out[~m].reset_index(drop=True)], axis=1)
         .query("_value != ''")
         .pivot_table(index=['ID', 'Net Cost'], columns='_value',
                      values='value', aggfunc='first')
         .rename_axis(columns=None).reset_index().fillna(0))
输出:
>>> out
  ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0  1       30      -11.5      -3.25         9.5         2.5
1  2       40      -12.5          0           0           0
2  3       50      -11.5          0           0           0
3  4       35       -5.5          0           0         3.5
4  5       45      -10.5          0         9.5         4.5
英文:
You can flatten your dataframe first with melt then reshape with pivot_table after cleaning it up:
# 1st pass
out = (pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].tolist())
         .melt(['ID', 'Net Cost'], ignore_index=False))
m = out['variable'] == 'Charge Description'
# 2nd pass
out = (pd.concat([out[m].reset_index(drop=True).add_prefix('_'),
                  out[~m].reset_index(drop=True)], axis=1)
         .query("_value != ''")
         .pivot_table(index=['ID', 'Net Cost'], columns='_value',
                      values='value', aggfunc='first')
         .rename_axis(columns=None).reset_index().fillna(0))
Output:
>>> out
  ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0  1       30      -11.5      -3.25         9.5         2.5
1  2       40      -12.5          0           0           0
2  3       50      -11.5          0           0           0
3  4       35       -5.5          0           0         3.5
4  5       45      -10.5          0         9.5         4.5
答案3
得分: 0
您可以在连接成对的数据后使用 pivot_table:
import pandas as pd
df = pd.DataFrame.from_dict(
    {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}})
# 设置第一行为标题
df.columns = df.iloc[0, :]
df.drop(index=0, inplace=True)
df = pd.concat([df.iloc[:, [0, 1, i, i + 1]] for i in range(2, len(df.columns), 2)]).replace('', 0)
print(df[df['Charge Description'] != 0]
    .pivot_table(columns='Charge Description', values='Charge Amount', index=['ID', 'Net Cost'])
    .fillna(0))
输出:
Charge Description  Discount X  Discount Y  Surcharge A  Surcharge B
ID Net Cost                                                         
1  30                    -11.5       -3.25          9.5          2.5
2  40                    -12.5        0.00          0.0          0.0
3  50                    -11.5        0.00          0.0          0.0
4  35                     -5.5        0.00          0.0          3.5
5  45                    -10.5        0.00          9.5          4.5
英文:
You can use pivot_table after concatenating pair-wise:
import pandas as pd
df = pd.DataFrame.from_dict(
{0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}})
# setting first row as header
df.columns = df.iloc[0, :]
df.drop(index=0, inplace=True)
df = pd.concat([df.iloc[:, [0,1,i,i+1]] for i in range(2, len(df.columns), 2)]).replace('', 0)
print(df[df['Charge Description']!=0]
.pivot_table(columns='Charge Description', values='Charge Amount', index=['ID', 'Net Cost'])
.fillna(0))
Output:
Charge Description  Discount X  Discount Y  Surcharge A  Surcharge B
ID Net Cost                                                         
1  30                    -11.5       -3.25          9.5          2.5
2  40                    -12.5        0.00          0.0          0.0
3  50                    -11.5        0.00          0.0          0.0
4  35                     -5.5        0.00          0.0          3.5
5  45                    -10.5        0.00          9.5          4.5
答案4
得分: 0
我将使用melt函数来堆叠具有相同名称的列,然后使用pivot函数来创建你想要的结果。
# 确保第一行现在是列名,然后删除第一行。
df.columns = df.iloc[0]
df = df[1:]
# 创建两个已经melt的数据框,并按索引将它们连接起来。
df1 = df.melt(['ID', 'Net Cost'], ['Charge Description']).sort_values(by='ID').reset_index(drop=True)
df2 = df.melt(['ID', 'Net Cost'], ['Charge Amount']).sort_values(by='ID').reset_index(drop=True)
df1['Charge Amount'] = df2['value']
# 进行一些清理工作,重命名df1中添加的'value'列。
df1 = df1.drop(columns=[0]).rename(columns={'value': 'Charge Description'})
df1 = df1.dropna()
# 对数据进行透视。
df1 = df1.pivot(index=['ID', 'Net Cost'], columns='Charge Description', values='Charge Amount')
df1的结果如下:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost                                                     
1  30                   -11.5      -3.25         9.5         2.5
2  40                   -12.5        NaN         NaN         NaN
3  50                   -11.5        NaN         NaN         NaN
4  35                    -5.5        NaN         NaN         3.5
5  45                   -10.5        NaN         9.5         4.5
英文:
I would use melt to stack the identically named columns, then pivot to create the outcome you want.
# Ensure the first line is now the column names, and then delete the first line.
df.columns = df.iloc[0]
df = df[1:]
# Create two melted df's, and join them on index.
df1 = df.melt(['ID', 'Net Cost'], ['Charge Description']).sort_values(by='ID').reset_index(drop=True)
df2 = df.melt(['ID', 'Net Cost'], ['Charge Amount']).sort_values(by='ID').reset_index(drop=True)
df1['Charge Amount'] = df2['value']
# Clean up a little, rename the added 'value' column from df1. 
df1 = df1.drop(columns=[0]).rename(columns={'value': 'Charge Description'})
df1 = df1.dropna()
# Pivot the data.
df1 = df1.pivot(index=['ID', 'Net Cost'], columns='Charge Description', values='Charge Amount')
Result of df1:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost                                                     
1  30                   -11.5      -3.25         9.5         2.5
2  40                   -12.5        NaN         NaN         NaN
3  50                   -11.5        NaN         NaN         NaN
4  35                    -5.5        NaN         NaN         3.5
5  45                   -10.5        NaN         9.5         4.5`
答案5
得分: 0
以下是翻译好的部分:
"My first thought was to read the data out into a list of dictionaries representing each Row (making both the keys and values from the data values), then form a new dataframe from that."(我最初的想法是将数据读取到一个字典列表中,每个字典代表一行(从数据值中同时生成键和值),然后从中创建一个新的数据框。)
"For your example, that would make..."(对于你的示例,这将产生...)
"For the SMALL sample dataset, using comprehensions appears to be quite quick for that..."(对于小规模示例数据集,使用推导似乎相当快速...)
"To sort the column names, add the following..."(要对列名进行排序,请添加以下内容...)
"df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]]"(df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]])
英文:
My first thought was to read the data out in to a list of dictionaries representing each Row (making both the keys and values from the data values), then form a new dataframe from that.
For your example, that would make...
[
{
'ID': '1',
'Net Cost': '30',
'Discount X': '-11.5',
'Discount Y': '-3.25',
'Surcharge A': '9.5',
'Surcharge B': '2.5',
},
{
'ID': '2',
'Net Cost': '40',
'Discount X': '-12.5',
},
{
'ID': '3',
'Net Cost': '50',
'Discount X': '-11.5',
},
{
'ID': '4',
'Net Cost': '35',
'Discount X': '-5.5',
'Surcharge B': '3.5',
},
{
'ID': '5',
'Net Cost': '45',
'Discount X': '-10.5',
'Surcharge A': '9.5',
'Surcharge B': '4.5',
},
]
For the SMALL sample dataset, using comprehensions appears to be quite quick for that...
import pandas as pd
from itertools import chain
rows = [
{
name: value
for name, value in chain(
[
("ID", row[0]),
("Net Cost", row[1]),
],
zip(row[2::2], row[3::2]) # pairs of columns: (2,3), (4,5), etc
)
if name
}
for ix, row in df.iloc[1:].iterrows() # Skips the row with the column headers
]
df2 = pd.DataFrame(rows).fillna(0)
Demo (including timings of this and three other answers):
EDIT:
To sort the column names, add the following...
df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论