列逐对聚合和重组在Pandas中

huangapple go评论72阅读模式
英文:

Column Pair-wise aggregation and reorganization in Pandas

问题

我正在将一个csv文件导入到一个pandas dataframe中,如下所示:

df = pd.DataFrame( {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: '净成本', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: '费用描述', 1: '附加费 A', 2: '折扣 X', 3: '折扣 X', 4: '折扣 X', 5: '附加费 A'}, 3: {0: '费用金额', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: '费用描述', 1: '折扣 X', 2: '', 3: '', 4: '附加费 B', 5: '折扣 X'}, 5: {0: '费用金额', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: '费用描述', 1: '折扣 Y', 2: '', 3: '', 4: '', 5: '附加费 B'}, 7: {0: '费用金额', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: '费用描述', 1: '附加费 B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: '费用金额', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}} )

第一行是包含列名的标题,其中“费用描述”和“费用金额”成对出现多次。

期望的输出是一个df,每个描述都有一个唯一的列,重新组织的列按字母顺序排序,并且NaN显示为0:

|   ID |   净成本 |   附加费 A |   附加费 B |   折扣 X |   折扣 Y |
|-----:|-------:|----------:|----------:|-------:|-------:|
|    1 |     30 |       9.5 |       2.5 |  -11.5 |  -3.25 |
|    2 |     40 |         0 |         0 |  -12.5 |      0 |
|    3 |     50 |         0 |         0 |  -11.5 |      0 |
|    4 |     35 |         0 |       3.5 |   -5.5 |      0 |
|    5 |     45 |       9.5 |       4.5 |  -10.5 |      0 |
英文:

I am importing a csv file into a pandas dataframe such as:

df = pd.DataFrame( {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}} )
0 1 2 3 4 5 6 7 8 9
ID Net Cost Charge Description Charge Amount Charge Description Charge Amount Charge Description Charge Amount Charge Description Charge Amount
1 30 Surcharge A 9.5 Discount X -11.5 Discount Y -3.25 Surcharge B 2.5
2 40 Discount X -12.5
3 50 Discount X -11.5
4 35 Discount X -5.5 Surcharge B 3.5
5 45 Surcharge A 9.5 Discount X -10.5 Surcharge B 4.5

The first row are the headers with column names Charge Description and Charge Amount forming pairs and appearing multiple times.

Desired output is a df with a unique column for each description, with the reorganized columns sorted alphabetically and NaNs showing as 0:

ID Net Cost Surcharge A Surcharge B Discount X Discount Y
1 30 9.5 2.5 -11.5 -3.25
2 40 0 0 -12.5 0
3 50 0 0 -11.5 0
4 35 0 3.5 -5.5 0
5 45 9.5 4.5 -10.5 0

This post looks like a good starting point but then I need a column for each Charge Description and only a single row per ID.

答案1

得分: 2

I used the file you shared, and edited the columns with the initial dataframe df shared (Pandas automatically adds suffixes to columns to make them unique) to keep the non uniqueness:

invoice = pd.read_csv('Downloads/Example Invoice.csv')
invoice.columns = ['ID', 'Net Cost', 'Charge Description', 'Charge Amount',
                   'Charge Description', 'Charge Amount', 
                   'Charge Description', 'Charge Amount', 
                   'Charge Description', 'Charge Amount']
print(invoice)
   ID  Net Cost Charge Description  Charge Amount  ... Charge Description  Charge Amount Charge Description  Charge Amount
0   1        30        Surcharge A            9.5  ...         Discount Y          -3.25        Surcharge B            2.5
1   2        40         Discount X          -12.5  ...                NaN            NaN                NaN            NaN
2   3        50         Discount X          -11.5  ...                NaN            NaN                NaN            NaN
3   4        35         Discount X           -5.5  ...                NaN            NaN                NaN            NaN
4   5        45        Surcharge A            9.5  ...        Surcharge B           4.50                NaN            NaN

First step is to transform to long form with pivot_longer from pyjanitor - in this case we take advantage of the fact that charge description is followed by charge amount - we can safely pair them and reshape into two columns. After that is done, we flip back to wide form - getting Surcharge and Discount values as headers. Thankfully, the index is unique, so a pivot works without extras. I used pivot_wider here, primarily for convenience - the same can be achieved with pivot, with just a few cleanup steps - under the hood pivot_wider uses pd.pivot.

# pip install pyjanitor
import pandas as pd
import janitor

index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
(invoice
.pivot_longer(
    index = index, 
    names_to = arr, 
    names_pattern = arr, 
    dropna=True)
.pivot_wider(
    index=index,
    names_from='Charge Description', 
    values_from='Charge Amount')
.fillna(0)
)

   ID  Net Cost  Discount X  Discount Y  Surcharge A  Surcharge B
0   1        30       -11.5       -3.25          9.5          2.5
1   2        40       -12.5        0.00          0.0          0.0
2   3        50       -11.5        0.00          0.0          0.0
3   4        35        -5.5        0.00          0.0          3.5
4   5        45       -10.5        0.00          9.5          4.5

Another option - since the data is fairly consistent with the ordering, you can dump down into numpy, reshape into a two column array, keep track of the ID and Net Cost columns (ensure they are correctly paired), and then pivot to get your final data:

index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
invoice = invoice.set_index(index)
out = invoice.to_numpy().reshape(-1, 2)
out = pd.DataFrame(out, columns = arr)
# reshape above is in order `C` - default
# so we can safely repeat the index
# with a value of 4
# which is what you get -> 
# invoice.columns.size // 2
# to correctly pair the index with the new dataframe
out.index = invoice.index.repeat(invoice.columns.size//2)
# get rid of nulls, and flip to wide form
(out
.dropna(how='all')
.set_index('Charge Description', append=True)
.squeeze()
.unstack('Charge Description', fill_value=0)
.rename_axis(columns = None)
.reset_index()
)

   ID  Net Cost Discount X Discount Y Surcharge A Surcharge B
0   1        30      -11.5      -3.25         9.5         2.5
1   2        40      -12.5          0           0           0
2   3        50      -11.5          0           0           0
3   4        35       -5.5          0           0         3.5
4   5        45      -10.5          0         9.5         4.5

You can convert the data dtypes for Discount to numeric

英文:

I used the file you shared, and edited the columns with the initial dataframe df shared (Pandas automatically adds suffixes to columns to make them unique) to keep the non uniqueness:

invoice = pd.read_csv('Downloads/Example Invoice.csv')
invoice.columns = ['ID', 'Net Cost', 'Charge Description', 'Charge Amount',
                   'Charge Description', 'Charge Amount', 
                   'Charge Description', 'Charge Amount', 
                   'Charge Description', 'Charge Amount']
print(invoice)
   ID  Net Cost Charge Description  Charge Amount  ... Charge Description  Charge Amount Charge Description  Charge Amount
0   1        30        Surcharge A            9.5  ...         Discount Y          -3.25        Surcharge B            2.5
1   2        40         Discount X          -12.5  ...                NaN            NaN                NaN            NaN
2   3        50         Discount X          -11.5  ...                NaN            NaN                NaN            NaN
3   4        35         Discount X           -5.5  ...                NaN            NaN                NaN            NaN
4   5        45        Surcharge A            9.5  ...        Surcharge B           4.50                NaN            NaN

First step is to transform to long form with pivot_longer from pyjanitor - in this case we take advantage of the fact that charge description is followed by charge amount - we can safely pair them and reshape into two columns. After that is done, we flip back to wide form - getting Surcharge and Discount values as headers. Thankfully, the index is unique, so a pivot works without extras. I used pivot_wider here, primarily for convenience - the same can be achieved with pivot, with just a few cleanup steps - under the hood pivot_wider uses pd.pivot.

# pip install pyjanitor
import pandas as pd
import janitor

index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
(invoice
.pivot_longer(
    index = index, 
    names_to = arr, 
    names_pattern = arr, 
    dropna=True)
.pivot_wider(
    index=index,
    names_from='Charge Description', 
    values_from='Charge Amount')
.fillna(0)
)

   ID  Net Cost  Discount X  Discount Y  Surcharge A  Surcharge B
0   1        30       -11.5       -3.25          9.5          2.5
1   2        40       -12.5        0.00          0.0          0.0
2   3        50       -11.5        0.00          0.0          0.0
3   4        35        -5.5        0.00          0.0          3.5
4   5        45       -10.5        0.00          9.5          4.5

Another option - since the data is fairly consistent with the ordering, you can dump down into numpy, reshape into a two column array, keep track of the ID and Net Cost columns (ensure they are correctly paired), and then pivot to get your final data:

index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
invoice = invoice.set_index(index)
out = invoice.to_numpy().reshape(-1, 2)
out = pd.DataFrame(out, columns = arr)
# reshape above is in order `C` - default
# so we can safely repeat the index
# with a value of 4
# which is what you get -> 
# invoice.columns.size // 2
# to correctly pair the index with the new dataframe
out.index = invoice.index.repeat(invoice.columns.size//2)
# get rid of nulls, and flip to wide form
(out
.dropna(how='all')
.set_index('Charge Description', append=True)
.squeeze()
.unstack('Charge Description', fill_value=0)
.rename_axis(columns = None)
.reset_index()
)

   ID  Net Cost Discount X Discount Y Surcharge A Surcharge B
0   1        30      -11.5      -3.25         9.5         2.5
1   2        40      -12.5          0           0           0
2   3        50      -11.5          0           0           0
3   4        35       -5.5          0           0         3.5
4   5        45      -10.5          0         9.5         4.5

You can convert the data dtypes for Discount to numeric

答案2

得分: 0

你可以首先使用 melt 函数来展开你的数据框,然后在清理后使用 pivot_table 函数来重新塑造它:

# 第一步
out = (pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].tolist())
         .melt(['ID', 'Net Cost'], ignore_index=False))

m = out['variable'] == 'Charge Description'

# 第二步
out = (pd.concat([out[m].reset_index(drop=True).add_prefix('_'),
                  out[~m].reset_index(drop=True)], axis=1)
         .query("_value != ''")
         .pivot_table(index=['ID', 'Net Cost'], columns='_value',
                      values='value', aggfunc='first')
         .rename_axis(columns=None).reset_index().fillna(0))

输出:

>>> out
  ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0  1       30      -11.5      -3.25         9.5         2.5
1  2       40      -12.5          0           0           0
2  3       50      -11.5          0           0           0
3  4       35       -5.5          0           0         3.5
4  5       45      -10.5          0         9.5         4.5
英文:

You can flatten your dataframe first with melt then reshape with pivot_table after cleaning it up:

# 1st pass
out = (pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].tolist())
         .melt(['ID', 'Net Cost'], ignore_index=False))

m = out['variable'] == 'Charge Description'

# 2nd pass
out = (pd.concat([out[m].reset_index(drop=True).add_prefix('_'),
                  out[~m].reset_index(drop=True)], axis=1)
         .query("_value != ''")
         .pivot_table(index=['ID', 'Net Cost'], columns='_value',
                      values='value', aggfunc='first')
         .rename_axis(columns=None).reset_index().fillna(0))

Output:

>>> out
  ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0  1       30      -11.5      -3.25         9.5         2.5
1  2       40      -12.5          0           0           0
2  3       50      -11.5          0           0           0
3  4       35       -5.5          0           0         3.5
4  5       45      -10.5          0         9.5         4.5

答案3

得分: 0

您可以在连接成对的数据后使用 pivot_table

import pandas as pd

df = pd.DataFrame.from_dict(
    {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}})

# 设置第一行为标题
df.columns = df.iloc[0, :]
df.drop(index=0, inplace=True)

df = pd.concat([df.iloc[:, [0, 1, i, i + 1]] for i in range(2, len(df.columns), 2)]).replace('', 0)

print(df[df['Charge Description'] != 0]
    .pivot_table(columns='Charge Description', values='Charge Amount', index=['ID', 'Net Cost'])
    .fillna(0))

输出:

Charge Description  Discount X  Discount Y  Surcharge A  Surcharge B
ID Net Cost                                                         
1  30                    -11.5       -3.25          9.5          2.5
2  40                    -12.5        0.00          0.0          0.0
3  50                    -11.5        0.00          0.0          0.0
4  35                     -5.5        0.00          0.0          3.5
5  45                    -10.5        0.00          9.5          4.5
英文:

You can use pivot_table after concatenating pair-wise:

import pandas as pd
df = pd.DataFrame.from_dict(
{0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}})
# setting first row as header
df.columns = df.iloc[0, :]
df.drop(index=0, inplace=True)
df = pd.concat([df.iloc[:, [0,1,i,i+1]] for i in range(2, len(df.columns), 2)]).replace('', 0)
print(df[df['Charge Description']!=0]
.pivot_table(columns='Charge Description', values='Charge Amount', index=['ID', 'Net Cost'])
.fillna(0))

Output:

Charge Description  Discount X  Discount Y  Surcharge A  Surcharge B
ID Net Cost                                                         
1  30                    -11.5       -3.25          9.5          2.5
2  40                    -12.5        0.00          0.0          0.0
3  50                    -11.5        0.00          0.0          0.0
4  35                     -5.5        0.00          0.0          3.5
5  45                    -10.5        0.00          9.5          4.5

答案4

得分: 0

我将使用melt函数来堆叠具有相同名称的列,然后使用pivot函数来创建你想要的结果。

# 确保第一行现在是列名,然后删除第一行。
df.columns = df.iloc[0]
df = df[1:]

# 创建两个已经melt的数据框,并按索引将它们连接起来。
df1 = df.melt(['ID', 'Net Cost'], ['Charge Description']).sort_values(by='ID').reset_index(drop=True)
df2 = df.melt(['ID', 'Net Cost'], ['Charge Amount']).sort_values(by='ID').reset_index(drop=True)
df1['Charge Amount'] = df2['value']

# 进行一些清理工作,重命名df1中添加的'value'列。
df1 = df1.drop(columns=[0]).rename(columns={'value': 'Charge Description'})
df1 = df1.dropna()

# 对数据进行透视。
df1 = df1.pivot(index=['ID', 'Net Cost'], columns='Charge Description', values='Charge Amount')

df1的结果如下:

Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost                                                     
1  30                   -11.5      -3.25         9.5         2.5
2  40                   -12.5        NaN         NaN         NaN
3  50                   -11.5        NaN         NaN         NaN
4  35                    -5.5        NaN         NaN         3.5
5  45                   -10.5        NaN         9.5         4.5
英文:

I would use melt to stack the identically named columns, then pivot to create the outcome you want.

# Ensure the first line is now the column names, and then delete the first line.
df.columns = df.iloc[0]
df = df[1:]
# Create two melted df's, and join them on index.
df1 = df.melt(['ID', 'Net Cost'], ['Charge Description']).sort_values(by='ID').reset_index(drop=True)
df2 = df.melt(['ID', 'Net Cost'], ['Charge Amount']).sort_values(by='ID').reset_index(drop=True)
df1['Charge Amount'] = df2['value']
# Clean up a little, rename the added 'value' column from df1. 
df1 = df1.drop(columns=[0]).rename(columns={'value': 'Charge Description'})
df1 = df1.dropna()
# Pivot the data.
df1 = df1.pivot(index=['ID', 'Net Cost'], columns='Charge Description', values='Charge Amount')

Result of df1:

Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost                                                     
1  30                   -11.5      -3.25         9.5         2.5
2  40                   -12.5        NaN         NaN         NaN
3  50                   -11.5        NaN         NaN         NaN
4  35                    -5.5        NaN         NaN         3.5
5  45                   -10.5        NaN         9.5         4.5`

答案5

得分: 0

以下是翻译好的部分:

"My first thought was to read the data out into a list of dictionaries representing each Row (making both the keys and values from the data values), then form a new dataframe from that."(我最初的想法是将数据读取到一个字典列表中,每个字典代表一行(从数据值中同时生成键和值),然后从中创建一个新的数据框。)

"For your example, that would make..."(对于你的示例,这将产生...)

"For the SMALL sample dataset, using comprehensions appears to be quite quick for that..."(对于小规模示例数据集,使用推导似乎相当快速...)

"To sort the column names, add the following..."(要对列名进行排序,请添加以下内容...)

"df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]]"(df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]])

英文:

My first thought was to read the data out in to a list of dictionaries representing each Row (making both the keys and values from the data values), then form a new dataframe from that.

For your example, that would make...

[
{
'ID': '1',
'Net Cost': '30',
'Discount X': '-11.5',
'Discount Y': '-3.25',
'Surcharge A': '9.5',
'Surcharge B': '2.5',
},
{
'ID': '2',
'Net Cost': '40',
'Discount X': '-12.5',
},
{
'ID': '3',
'Net Cost': '50',
'Discount X': '-11.5',
},
{
'ID': '4',
'Net Cost': '35',
'Discount X': '-5.5',
'Surcharge B': '3.5',
},
{
'ID': '5',
'Net Cost': '45',
'Discount X': '-10.5',
'Surcharge A': '9.5',
'Surcharge B': '4.5',
},
]

For the SMALL sample dataset, using comprehensions appears to be quite quick for that...

import pandas as pd
from itertools import chain
rows = [
{
name: value
for name, value in chain(
[
("ID", row[0]),
("Net Cost", row[1]),
],
zip(row[2::2], row[3::2]) # pairs of columns: (2,3), (4,5), etc
)
if name
}
for ix, row in df.iloc[1:].iterrows() # Skips the row with the column headers
]
df2 = pd.DataFrame(rows).fillna(0)

Demo (including timings of this and three other answers):

EDIT:

To sort the column names, add the following...

df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]]

huangapple
  • 本文由 发表于 2023年2月16日 03:24:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75464568.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定