从 pandas 透视表复制到数据框以计算小计

huangapple go评论61阅读模式
英文:

Copying from pandas pivot table to dataframe to compute subtotals

问题

以下是您要翻译的代码部分:

# Your original code...
pivot = df.pivot_table(
    values=['val1', 'val2', 'val3'], 
    index=['col1', 'col2', 'col3'], 
    aggfunc=np.sum, fill_value=0)

qry = f"Total == 'Y' & val1 == 0 & val2 == 0 & val3 == 0 & col2 != 0"

df.loc[df.eval(qry), ['val1', 'val2', 'val3']] = pivot.loc[(df['col1'], df['col2'], df['col3'])]

请注意,这是原始代码的翻译,不包括代码的解释或其他内容。如果您有任何进一步的问题或需要进一步的帮助,请随时提出。

英文:

(27-feb: edit 1, see below)

A question about pandas pivot tables and accessing information from this table.

My dataset is (simplified) as follows:

col1 col2 col3 total val1 val2 val3
   1    0    0     Y  246  912 1578
   1    1    0     Y  123  456  789
   1    1    1     N   61  228  394
   1    1    2     N   62  228  395
   1    2    0     Y  123  456  789
   1    2    1     N   61  228  394
   1    2    2     N   62  228  395

Explanation: a subtotal line is indicated by a Y which should add up to the running total of the underlying N lines. Columns 1,2,3 represent a hierarchy, so 1.1.1 plus 1.1.2 roll up to 1.1.0 and above that, 1.1.0 and 1.2.0 roll up to the end total of 1.0.0.

My problem: sometimes the subtotal lines are not filled. That results in an input of (after fillna(0)):

   1    0    0     Y  246  912 1578
   1    1    0     Y    0    0    0
   1    1    1     N   61  228  394
   1    1    2     N   62  228  395
   1    2    0     Y    0    0    0
   1    2    1     N   61  228  394
   1    2    2     N   62  228  395

What I thought would be a good way around this - or rather, to complete the dataframe since I need a dataframe that is completely filled - is to make a pivot table to compute the totals and then copy those values over to the main dataframe where total = Y but value = 0.

My attempt:

pivot = df.pivot_table(
    values=['val1', 'val2', 'val3'], 
    index=['col1', 'col2', 'col3'], 
    aggfunc=np.sum, fill_value=0)

# in reality there are more columns, so a mask of Total = Y only doesn't suffice
qry = f"Total == 'Y' & val1 == 0 & val2 == 0 & val3 == 0 & col2 != 0"

df.loc[df.eval(qry), ['val1', 'val2', 'val3']] = pivot.loc[(df['col1'], df['col2'], df['col3'])]

But no. I get a ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

Any help on getting this to work is most appreciated. Also, if there is a better way to fill the zeros with the totals from the row below, let me know.

Thanks,
Chris


Edit:
While I am personally more attracted to the oneliner-ness of rhug123, I could not get it to work. It yields an InvalidIndexError without further explanation. @rhug123, did I fail to adjust your example correctly?

columns = ['Value1', 'Value2', 'Value3', 'Value4', 'Value5']
index = ['Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6']
df = (df.set_index(index).fillna(df.loc[df['Total'].eq('N')].groupby(index)[columns].sum()).reset_index())

The code snippet by Laurent worked first try and it seems like it is almost there. In the simplified df, it does its job, but in a real world example it sums too many rows. Or rather, the code snippet does not seem to take col1, col2 and col3 into account (those forming the unique id). @Laurent, how should I best add to your code to only sum the lines with the correct identifiying column values?

Real life sample from df: (col1-col3 are actually col1-col6 and val1-val3 are actually val1-val5)

, Article name, Year, Column1, Column2, Column3, Column4, Column5, Column6, Total, Description, Value1, Value2, Value3, Value4, Value5
(snip)
22, Wetgeving en controle TK, 2022, 2A, 3, 0, 4, 2, U, N, Onderzoeksbudget, 2383, 2383, -225, 0, 2158
23, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 0, U, J, Materiële uitgaven, , , , , 
24, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 1, U, N, Drukwerk, 1929, 1929, 79, 0, 2008
25, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 2, U, N, Fractiekosten, 38367, 41742, 1136, 0, 42878
26, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 3, U, N, Uitzending leden, 465, 465, 19, 0, 484
27, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 4, U, N, Parlementaire enquêtes, 2070, 3892, 67, 0, 3959
28, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 5, U, N, Bijdrage ProDemos, 2120, 2120, 81, 0, 2201
29, Wetgeving en controle TK, 2022, 2A, 3, 99, 0, 0, O, N, Ontvangsten, 3639, 3639, 0, 7700, 11339
30, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 0, 0, V, N, Verplichtingen, 1535, 1535, 39, 0, 1574
31, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 0, 0, U, J, Uitgaven, 1535, 1535, 39, 0, 1574
32, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 8, 0, U, J, Materiële uitgaven, , , , , 
33, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 8, 1, U, N, Interparlementaire betrekkingen, 1535, 1535, 39, 0, 1574
(snip)

In this sample, Laurents code fills row 23 with the sum of rows 24 through 31 (where it finds the next NA row at 32). For a large part of the dataset, this is a perfect solution. But not always unfortunately. In this example, it should take col1-col6 into account and only sum rows 24 through 28 because those rows have the same value for col1-col4 and col6 (with row 23 having col5=0)

答案1

得分: 1

以下是已翻译的代码部分:

  1. 第一个代码块中,您提供了一个名为df的数据框(DataFrame)的示例。
  2. 第二个代码块中,对数据框df进行了分割,将第一行为NA的子数据框进行求和,并将它们连接起来,生成了一个新的数据框new_df

第三个代码块中,您提供了另一个数据框df的示例,然后使用类似的方法将具有第一行为NA的子数据框进行求和,并将它们连接成一个新的数据框new_df

希望这有助于您理解代码的逻辑和功能。如果您有任何问题或需要进一步的帮助,请告诉我。

英文:

WIth the dataframe you provided:

import pandas as pd

df = pd.DataFrame(
    {
        "col1": [1, 1, 1, 1, 1, 1, 1],
        "col2": [0, 1, 1, 1, 2, 2, 2],
        "col3": [0, 0, 1, 2, 0, 1, 2],
        "total": ["Y", "Y", "N", "N", "Y", "N", "N"],
        "val1": [246, pd.NA, 61, 62, pd.NA, 61, 62],
        "val2": [912, pd.NA, 228, 228, pd.NA, 228, 228],
        "val3": [1578, pd.NA, 394, 395, pd.NA, 394, 395],
    }
)

print(df)
# Output
   col1  col2  col3 total  val1  val2  val3
0     1     0     0     Y   246   912  1578
1     1     1     0     Y  <NA>  <NA>  <NA>
2     1     1     1     N    61   228   394
3     1     1     2     N    62   228   395
4     1     2     0     Y  <NA>  <NA>  <NA>
5     1     2     1     N    61   228   394
6     1     2     2     N    62   228   395

Here is another way to do it:

# Slice df in sub dataframes, in which first row is NA
# and the following are to be summed up
na_rows = df.loc[df[["val1", "val2", "val3"]].isna().all(axis=1), :].index
dfs = []
for i, _ in enumerate(na_rows):
    try:
        tmp = df.loc[na_rows[i] : na_rows[i + 1] - 1, :]
        tmp.loc[na_rows[i], ["val1", "val2", "val3"]] = tmp[
            ["val1", "val2", "val3"]
        ].sum()
        dfs.append(tmp)
    except IndexError:
        tmp = df.loc[na_rows[i] :, :]
        tmp.loc[na_rows[i], ["val1", "val2", "val3"]] = tmp[
            ["val1", "val2", "val3"]
        ].sum()
        dfs.append(tmp)

# Concatenate sub dataframes and avoid duplicated rows with df
tmp = pd.concat(dfs)
new_df = pd.concat([df[~df.index.isin(tmp.index)], tmp]).sort_index()

Then:

print(df)
# Output
   col1  col2  col3 total val1 val2  val3
0     1     0     0     Y  246  912  1578
1     1     1     0     Y  123  456   789
2     1     1     1     N   61  228   394
3     1     1     2     N   62  228   395
4     1     2     0     Y  123  456   789
5     1     2     1     N   61  228   394
6     1     2     2     N   62  228   395

As for your extended question, with a shortened version of your real life sample:

df = pd.DataFrame({'Column1': ['2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A'], 'Column2': ['3', '3', '3', '3', '3', '3', '3', '3', '4', '4', '4', '4'], 'Column3': ['0', '0', '0', '0', '0', '0', '0', '99', '0', '0', '0', '0'], 'Column4': ['4', '8', '8', '8', '8', '8', '8', '0', '0', '0', '8', '8'], 'Column5': ['2', '0', '1', '2', '3', '4', '5', 'O', '0', '0', '0', '1'], 'Column6': ['U', 'U', 'U', 'U', 'U', 'U', 'U', 'N', 'V', 'U', 'U', 'U'], 'Value1': [2383, <NA>, 1929, 38367, 465, 2070, 2120, 3639, 1535, 1535, <NA>, 1535], 'Value2': [2383, <NA>, 1929, 41742, 465, 3892, 2120, 0, 1535, 1535, <NA>, 1535], 'Value3': [-225, <NA>, 79, 1136, 19, 67, 81, 7700, 39, 39, <NA>, 39], 'Value4': [0, <NA>, 0, 0, 0, 0, 0, 11339, 0, 0, <NA>, 0], 'Value5': [2158.0, <NA>, 2008.0, 42878.0, 484.0, 3959.0, 2201.0, <NA>, 1574.0, 1574.0, <NA>, 1574.0]})

Here is one way to deal with it:

# Slice df in sub dataframes, in which first row is NA
# and the following are to be summed up
df = df.set_index(["Column1", "Column2", "Column3", "Column4"])
dfs = []
for idx in df.index.unique():
    tmp = df.loc[idx, :]
    if tmp.isna().any(axis=1).any() and tmp.shape[0] > 1:
        tmp.loc[
            tmp.isna().any(axis=1), ["Value1", "Value2", "Value3", "Value4", "Value5"]
        ] = (
            tmp[["Value1", "Value2", "Value3", "Value4", "Value5"]]
            .fillna(0)
            .sum()
            .tolist()
        )
        dfs.append(tmp)

# Concatenate sub dataframes and avoid duplicated rows with df
new_df = pd.concat(dfs)
new_df = (
    pd.concat([df[~df.index.isin(new_df.index)], new_df]).sort_index().reset_index()
)

Then:

   Column1 Column2 Column3 Column4 Column5 Column6   Value1   Value2  Value3  \
0       2A       3       0       4       2       U     2383     2383    -225   
1       2A       3       0       8       0       U  44951.0  50148.0  1382.0   
2       2A       3       0       8       1       U     1929     1929      79   
3       2A       3       0       8       2       U    38367    41742    1136   
4       2A       3       0       8       3       U      465      465      19   
5       2A       3       0       8       4       U     2070     3892      67   
6       2A       3       0       8       5       U     2120     2120      81   
7       2A       3      99       0       O       N     3639        0    7700   
8       2A       4       0       0       0       V     1535     1535      39   
9       2A       4       0       0       0       U     1535     1535      39   
10      2A       4       0       8       0       U   1535.0   1535.0    39.0   
11      2A       4       0       8       1       U     1535     1535      39   

   Value4   Value5  
0       0   2158.0  
1     0.0  51530.0  
2       0   2008.0  
3       0  42878.0  
4       0    484.0  
5       0   3959.0  
6       0   2201.0  
7   11339     <NA>  
8       0   1574.0  
9       0   1574.0  
10    0.0   1574.0  
11      0   1574.0  

答案2

得分: 0

尝试:

(df.set_index(['col1','col2'])
.fillna(df.loc[df['total'].eq('N')]
.groupby(['col1','col2'])[['val1','val2','val3']].sum())
.reset_index())

输出:

       col1  col2  col3 total   val1   val2   val3
    0     1     0     0     Y    246    912   1578
    1     1     1     0     Y  123.0  456.0  789.0
    2     1     1     1     N     61    228    394
    3     1     1     2     N     62    228    395
    4     1     2     0     Y  123.0  456.0  789.0
    5     1     2     1     N     61    228    394
    6     1     2     2     N     62    228    395
英文:

Try:

(df.set_index(['col1','col2'])
.fillna(df.loc[df['total'].eq('N')]
.groupby(['col1','col2'])[['val1','val2','val3']].sum())
.reset_index())

Output:

   col1  col2  col3 total   val1   val2   val3
0     1     0     0     Y    246    912   1578
1     1     1     0     Y  123.0  456.0  789.0
2     1     1     1     N     61    228    394
3     1     1     2     N     62    228    395
4     1     2     0     Y  123.0  456.0  789.0
5     1     2     1     N     61    228    394
6     1     2     2     N     62    228    395

huangapple
  • 本文由 发表于 2023年2月24日 16:54:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/75554436.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定