2023年2月24日 16:54:00go评论92阅读模式

英文:

Copying from pandas pivot table to dataframe to compute subtotals

问题

以下是您要翻译的代码部分：

# Your original code...
pivot = df.pivot_table(
    values=['val1', 'val2', 'val3'], 
    index=['col1', 'col2', 'col3'], 
    aggfunc=np.sum, fill_value=0)
qry = f"Total == 'Y' & val1 == 0 & val2 == 0 & val3 == 0 & col2 != 0"
df.loc[df.eval(qry), ['val1', 'val2', 'val3']] = pivot.loc[(df['col1'], df['col2'], df['col3'])]

请注意，这是原始代码的翻译，不包括代码的解释或其他内容。如果您有任何进一步的问题或需要进一步的帮助，请随时提出。

英文:

(27-feb: edit 1, see below)

A question about pandas pivot tables and accessing information from this table.

My dataset is (simplified) as follows:

col1 col2 col3 total val1 val2 val3
   1    0    0     Y  246  912 1578
   1    1    0     Y  123  456  789
   1    1    1     N   61  228  394
   1    1    2     N   62  228  395
   1    2    0     Y  123  456  789
   1    2    1     N   61  228  394
   1    2    2     N   62  228  395

Explanation: a subtotal line is indicated by a Y which should add up to the running total of the underlying N lines. Columns 1,2,3 represent a hierarchy, so 1.1.1 plus 1.1.2 roll up to 1.1.0 and above that, 1.1.0 and 1.2.0 roll up to the end total of 1.0.0.

My problem: sometimes the subtotal lines are not filled. That results in an input of (after fillna(0)):

   1    0    0     Y  246  912 1578
   1    1    0     Y    0    0    0
   1    1    1     N   61  228  394
   1    1    2     N   62  228  395
   1    2    0     Y    0    0    0
   1    2    1     N   61  228  394
   1    2    2     N   62  228  395

What I thought would be a good way around this - or rather, to complete the dataframe since I need a dataframe that is completely filled - is to make a pivot table to compute the totals and then copy those values over to the main dataframe where total = Y but value = 0.

My attempt:

pivot = df.pivot_table(
    values=[&#39;val1&#39;, &#39;val2&#39;, &#39;val3&#39;], 
    index=[&#39;col1&#39;, &#39;col2&#39;, &#39;col3&#39;], 
    aggfunc=np.sum, fill_value=0)
# in reality there are more columns, so a mask of Total = Y only doesn&#39;t suffice
qry = f&quot;Total == &#39;Y&#39; &amp; val1 == 0 &amp; val2 == 0 &amp; val3 == 0 &amp; col2 != 0&quot;
df.loc[df.eval(qry), [&#39;val1&#39;, &#39;val2&#39;, &#39;val3&#39;]] = pivot.loc[(df[&#39;col1&#39;], df[&#39;col2&#39;], df[&#39;col3&#39;])]

But no. I get a ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

Any help on getting this to work is most appreciated. Also, if there is a better way to fill the zeros with the totals from the row below, let me know.

Thanks,
Chris

Edit:
While I am personally more attracted to the oneliner-ness of rhug123, I could not get it to work. It yields an InvalidIndexError without further explanation. @rhug123, did I fail to adjust your example correctly?

columns = [&#39;Value1&#39;, &#39;Value2&#39;, &#39;Value3&#39;, &#39;Value4&#39;, &#39;Value5&#39;]
index = [&#39;Column1&#39;, &#39;Column2&#39;, &#39;Column3&#39;, &#39;Column4&#39;, &#39;Column5&#39;, &#39;Column6&#39;]
df = (df.set_index(index).fillna(df.loc[df[&#39;Total&#39;].eq(&#39;N&#39;)].groupby(index)[columns].sum()).reset_index())

The code snippet by Laurent worked first try and it seems like it is almost there. In the simplified df, it does its job, but in a real world example it sums too many rows. Or rather, the code snippet does not seem to take col1, col2 and col3 into account (those forming the unique id). @Laurent, how should I best add to your code to only sum the lines with the correct identifiying column values?

Real life sample from df: (col1-col3 are actually col1-col6 and val1-val3 are actually val1-val5)

, Article name, Year, Column1, Column2, Column3, Column4, Column5, Column6, Total, Description, Value1, Value2, Value3, Value4, Value5
(snip)
22, Wetgeving en controle TK, 2022, 2A, 3, 0, 4, 2, U, N, Onderzoeksbudget, 2383, 2383, -225, 0, 2158
23, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 0, U, J, Materi&#235;le uitgaven, , , , , 
24, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 1, U, N, Drukwerk, 1929, 1929, 79, 0, 2008
25, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 2, U, N, Fractiekosten, 38367, 41742, 1136, 0, 42878
26, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 3, U, N, Uitzending leden, 465, 465, 19, 0, 484
27, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 4, U, N, Parlementaire enqu&#234;tes, 2070, 3892, 67, 0, 3959
28, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 5, U, N, Bijdrage ProDemos, 2120, 2120, 81, 0, 2201
29, Wetgeving en controle TK, 2022, 2A, 3, 99, 0, 0, O, N, Ontvangsten, 3639, 3639, 0, 7700, 11339
30, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 0, 0, V, N, Verplichtingen, 1535, 1535, 39, 0, 1574
31, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 0, 0, U, J, Uitgaven, 1535, 1535, 39, 0, 1574
32, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 8, 0, U, J, Materi&#235;le uitgaven, , , , , 
33, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 8, 1, U, N, Interparlementaire betrekkingen, 1535, 1535, 39, 0, 1574
(snip)

In this sample, Laurents code fills row 23 with the sum of rows 24 through 31 (where it finds the next NA row at 32). For a large part of the dataset, this is a perfect solution. But not always unfortunately. In this example, it should take col1-col6 into account and only sum rows 24 through 28 because those rows have the same value for col1-col4 and col6 (with row 23 having col5=0)

答案1

得分: 1

以下是已翻译的代码部分：

第一个代码块中，您提供了一个名为df的数据框(DataFrame)的示例。
第二个代码块中，对数据框df进行了分割，将第一行为NA的子数据框进行求和，并将它们连接起来，生成了一个新的数据框new_df。

第三个代码块中，您提供了另一个数据框df的示例，然后使用类似的方法将具有第一行为NA的子数据框进行求和，并将它们连接成一个新的数据框new_df。

希望这有助于您理解代码的逻辑和功能。如果您有任何问题或需要进一步的帮助，请告诉我。

英文:

WIth the dataframe you provided:

import pandas as pd
df = pd.DataFrame(
    {
        &quot;col1&quot;: [1, 1, 1, 1, 1, 1, 1],
        &quot;col2&quot;: [0, 1, 1, 1, 2, 2, 2],
        &quot;col3&quot;: [0, 0, 1, 2, 0, 1, 2],
        &quot;total&quot;: [&quot;Y&quot;, &quot;Y&quot;, &quot;N&quot;, &quot;N&quot;, &quot;Y&quot;, &quot;N&quot;, &quot;N&quot;],
        &quot;val1&quot;: [246, pd.NA, 61, 62, pd.NA, 61, 62],
        &quot;val2&quot;: [912, pd.NA, 228, 228, pd.NA, 228, 228],
        &quot;val3&quot;: [1578, pd.NA, 394, 395, pd.NA, 394, 395],
    }
)
print(df)
# Output
   col1  col2  col3 total  val1  val2  val3
0     1     0     0     Y   246   912  1578
1     1     1     0     Y  &lt;NA&gt;  &lt;NA&gt;  &lt;NA&gt;
2     1     1     1     N    61   228   394
3     1     1     2     N    62   228   395
4     1     2     0     Y  &lt;NA&gt;  &lt;NA&gt;  &lt;NA&gt;
5     1     2     1     N    61   228   394
6     1     2     2     N    62   228   395

Here is another way to do it:

# Slice df in sub dataframes, in which first row is NA
# and the following are to be summed up
na_rows = df.loc[df[[&quot;val1&quot;, &quot;val2&quot;, &quot;val3&quot;]].isna().all(axis=1), :].index
dfs = []
for i, _ in enumerate(na_rows):
    try:
        tmp = df.loc[na_rows[i] : na_rows[i + 1] - 1, :]
        tmp.loc[na_rows[i], [&quot;val1&quot;, &quot;val2&quot;, &quot;val3&quot;]] = tmp[
            [&quot;val1&quot;, &quot;val2&quot;, &quot;val3&quot;]
        ].sum()
        dfs.append(tmp)
    except IndexError:
        tmp = df.loc[na_rows[i] :, :]
        tmp.loc[na_rows[i], [&quot;val1&quot;, &quot;val2&quot;, &quot;val3&quot;]] = tmp[
            [&quot;val1&quot;, &quot;val2&quot;, &quot;val3&quot;]
        ].sum()
        dfs.append(tmp)
# Concatenate sub dataframes and avoid duplicated rows with df
tmp = pd.concat(dfs)
new_df = pd.concat([df[~df.index.isin(tmp.index)], tmp]).sort_index()

Then:

print(df)
# Output
   col1  col2  col3 total val1 val2  val3
0     1     0     0     Y  246  912  1578
1     1     1     0     Y  123  456   789
2     1     1     1     N   61  228   394
3     1     1     2     N   62  228   395
4     1     2     0     Y  123  456   789
5     1     2     1     N   61  228   394
6     1     2     2     N   62  228   395

As for your extended question, with a shortened version of your real life sample:

df = pd.DataFrame({&#39;Column1&#39;: [&#39;2A&#39;, &#39;2A&#39;, &#39;2A&#39;, &#39;2A&#39;, &#39;2A&#39;, &#39;2A&#39;, &#39;2A&#39;, &#39;2A&#39;, &#39;2A&#39;, &#39;2A&#39;, &#39;2A&#39;, &#39;2A&#39;], &#39;Column2&#39;: [&#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;4&#39;, &#39;4&#39;, &#39;4&#39;, &#39;4&#39;], &#39;Column3&#39;: [&#39;0&#39;, &#39;0&#39;, &#39;0&#39;, &#39;0&#39;, &#39;0&#39;, &#39;0&#39;, &#39;0&#39;, &#39;99&#39;, &#39;0&#39;, &#39;0&#39;, &#39;0&#39;, &#39;0&#39;], &#39;Column4&#39;: [&#39;4&#39;, &#39;8&#39;, &#39;8&#39;, &#39;8&#39;, &#39;8&#39;, &#39;8&#39;, &#39;8&#39;, &#39;0&#39;, &#39;0&#39;, &#39;0&#39;, &#39;8&#39;, &#39;8&#39;], &#39;Column5&#39;: [&#39;2&#39;, &#39;0&#39;, &#39;1&#39;, &#39;2&#39;, &#39;3&#39;, &#39;4&#39;, &#39;5&#39;, &#39;O&#39;, &#39;0&#39;, &#39;0&#39;, &#39;0&#39;, &#39;1&#39;], &#39;Column6&#39;: [&#39;U&#39;, &#39;U&#39;, &#39;U&#39;, &#39;U&#39;, &#39;U&#39;, &#39;U&#39;, &#39;U&#39;, &#39;N&#39;, &#39;V&#39;, &#39;U&#39;, &#39;U&#39;, &#39;U&#39;], &#39;Value1&#39;: [2383, &lt;NA&gt;, 1929, 38367, 465, 2070, 2120, 3639, 1535, 1535, &lt;NA&gt;, 1535], &#39;Value2&#39;: [2383, &lt;NA&gt;, 1929, 41742, 465, 3892, 2120, 0, 1535, 1535, &lt;NA&gt;, 1535], &#39;Value3&#39;: [-225, &lt;NA&gt;, 79, 1136, 19, 67, 81, 7700, 39, 39, &lt;NA&gt;, 39], &#39;Value4&#39;: [0, &lt;NA&gt;, 0, 0, 0, 0, 0, 11339, 0, 0, &lt;NA&gt;, 0], &#39;Value5&#39;: [2158.0, &lt;NA&gt;, 2008.0, 42878.0, 484.0, 3959.0, 2201.0, &lt;NA&gt;, 1574.0, 1574.0, &lt;NA&gt;, 1574.0]})

Here is one way to deal with it:

# Slice df in sub dataframes, in which first row is NA
# and the following are to be summed up
df = df.set_index([&quot;Column1&quot;, &quot;Column2&quot;, &quot;Column3&quot;, &quot;Column4&quot;])
dfs = []
for idx in df.index.unique():
    tmp = df.loc[idx, :]
    if tmp.isna().any(axis=1).any() and tmp.shape[0] &gt; 1:
        tmp.loc[
            tmp.isna().any(axis=1), [&quot;Value1&quot;, &quot;Value2&quot;, &quot;Value3&quot;, &quot;Value4&quot;, &quot;Value5&quot;]
        ] = (
            tmp[[&quot;Value1&quot;, &quot;Value2&quot;, &quot;Value3&quot;, &quot;Value4&quot;, &quot;Value5&quot;]]
            .fillna(0)
            .sum()
            .tolist()
        )
        dfs.append(tmp)
# Concatenate sub dataframes and avoid duplicated rows with df
new_df = pd.concat(dfs)
new_df = (
    pd.concat([df[~df.index.isin(new_df.index)], new_df]).sort_index().reset_index()
)

Then:

   Column1 Column2 Column3 Column4 Column5 Column6   Value1   Value2  Value3  \
0       2A       3       0       4       2       U     2383     2383    -225   
1       2A       3       0       8       0       U  44951.0  50148.0  1382.0   
2       2A       3       0       8       1       U     1929     1929      79   
3       2A       3       0       8       2       U    38367    41742    1136   
4       2A       3       0       8       3       U      465      465      19   
5       2A       3       0       8       4       U     2070     3892      67   
6       2A       3       0       8       5       U     2120     2120      81   
7       2A       3      99       0       O       N     3639        0    7700   
8       2A       4       0       0       0       V     1535     1535      39   
9       2A       4       0       0       0       U     1535     1535      39   
10      2A       4       0       8       0       U   1535.0   1535.0    39.0   
11      2A       4       0       8       1       U     1535     1535      39   
   Value4   Value5  
0       0   2158.0  
1     0.0  51530.0  
2       0   2008.0  
3       0  42878.0  
4       0    484.0  
5       0   3959.0  
6       0   2201.0  
7   11339     &lt;NA&gt;  
8       0   1574.0  
9       0   1574.0  
10    0.0   1574.0  
11      0   1574.0

答案2

得分: 0

尝试:

(df.set_index(['col1','col2'])
.fillna(df.loc[df['total'].eq('N')]
.groupby(['col1','col2'])[['val1','val2','val3']].sum())
.reset_index())

输出:

       col1  col2  col3 total   val1   val2   val3
    0     1     0     0     Y    246    912   1578
    1     1     1     0     Y  123.0  456.0  789.0
    2     1     1     1     N     61    228    394
    3     1     1     2     N     62    228    395
    4     1     2     0     Y  123.0  456.0  789.0
    5     1     2     1     N     61    228    394
    6     1     2     2     N     62    228    395

英文:

Try:

(df.set_index([&#39;col1&#39;,&#39;col2&#39;])
.fillna(df.loc[df[&#39;total&#39;].eq(&#39;N&#39;)]
.groupby([&#39;col1&#39;,&#39;col2&#39;])[[&#39;val1&#39;,&#39;val2&#39;,&#39;val3&#39;]].sum())
.reset_index())

Output:

   col1  col2  col3 total   val1   val2   val3
0     1     0     0     Y    246    912   1578
1     1     1     0     Y  123.0  456.0  789.0
2     1     1     1     N     61    228    394
3     1     1     2     N     62    228    395
4     1     2     0     Y  123.0  456.0  789.0
5     1     2     1     N     61    228    394
6     1     2     2     N     62    228    395

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从 pandas 透视表复制到数据框以计算小计

问题

答案1

答案2

删除包含在某一列中的特定字符串的行。

两个不同的Pandas数据框和列中的常用词

从 tenacity retry_state.outcome.result() 获取错误消息会导致程序终止。

如何通过向现有列表追加元素在Python中创建嵌套列表

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。