从 pandas 透视表复制到数据框以计算小计

huangapple go评论92阅读模式
英文:

Copying from pandas pivot table to dataframe to compute subtotals

问题

以下是您要翻译的代码部分:

  1. # Your original code...
  2. pivot = df.pivot_table(
  3. values=['val1', 'val2', 'val3'],
  4. index=['col1', 'col2', 'col3'],
  5. aggfunc=np.sum, fill_value=0)
  6. qry = f"Total == 'Y' & val1 == 0 & val2 == 0 & val3 == 0 & col2 != 0"
  7. df.loc[df.eval(qry), ['val1', 'val2', 'val3']] = pivot.loc[(df['col1'], df['col2'], df['col3'])]

请注意,这是原始代码的翻译,不包括代码的解释或其他内容。如果您有任何进一步的问题或需要进一步的帮助,请随时提出。

英文:

(27-feb: edit 1, see below)

A question about pandas pivot tables and accessing information from this table.

My dataset is (simplified) as follows:

  1. col1 col2 col3 total val1 val2 val3
  2. 1 0 0 Y 246 912 1578
  3. 1 1 0 Y 123 456 789
  4. 1 1 1 N 61 228 394
  5. 1 1 2 N 62 228 395
  6. 1 2 0 Y 123 456 789
  7. 1 2 1 N 61 228 394
  8. 1 2 2 N 62 228 395

Explanation: a subtotal line is indicated by a Y which should add up to the running total of the underlying N lines. Columns 1,2,3 represent a hierarchy, so 1.1.1 plus 1.1.2 roll up to 1.1.0 and above that, 1.1.0 and 1.2.0 roll up to the end total of 1.0.0.

My problem: sometimes the subtotal lines are not filled. That results in an input of (after fillna(0)):

  1. 1 0 0 Y 246 912 1578
  2. 1 1 0 Y 0 0 0
  3. 1 1 1 N 61 228 394
  4. 1 1 2 N 62 228 395
  5. 1 2 0 Y 0 0 0
  6. 1 2 1 N 61 228 394
  7. 1 2 2 N 62 228 395

What I thought would be a good way around this - or rather, to complete the dataframe since I need a dataframe that is completely filled - is to make a pivot table to compute the totals and then copy those values over to the main dataframe where total = Y but value = 0.

My attempt:

  1. pivot = df.pivot_table(
  2. values=['val1', 'val2', 'val3'],
  3. index=['col1', 'col2', 'col3'],
  4. aggfunc=np.sum, fill_value=0)
  5. # in reality there are more columns, so a mask of Total = Y only doesn't suffice
  6. qry = f"Total == 'Y' & val1 == 0 & val2 == 0 & val3 == 0 & col2 != 0"
  7. df.loc[df.eval(qry), ['val1', 'val2', 'val3']] = pivot.loc[(df['col1'], df['col2'], df['col3'])]

But no. I get a ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

Any help on getting this to work is most appreciated. Also, if there is a better way to fill the zeros with the totals from the row below, let me know.

Thanks,
Chris


Edit:
While I am personally more attracted to the oneliner-ness of rhug123, I could not get it to work. It yields an InvalidIndexError without further explanation. @rhug123, did I fail to adjust your example correctly?

  1. columns = ['Value1', 'Value2', 'Value3', 'Value4', 'Value5']
  2. index = ['Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6']
  3. df = (df.set_index(index).fillna(df.loc[df['Total'].eq('N')].groupby(index)[columns].sum()).reset_index())

The code snippet by Laurent worked first try and it seems like it is almost there. In the simplified df, it does its job, but in a real world example it sums too many rows. Or rather, the code snippet does not seem to take col1, col2 and col3 into account (those forming the unique id). @Laurent, how should I best add to your code to only sum the lines with the correct identifiying column values?

Real life sample from df: (col1-col3 are actually col1-col6 and val1-val3 are actually val1-val5)

  1. , Article name, Year, Column1, Column2, Column3, Column4, Column5, Column6, Total, Description, Value1, Value2, Value3, Value4, Value5
  2. (snip)
  3. 22, Wetgeving en controle TK, 2022, 2A, 3, 0, 4, 2, U, N, Onderzoeksbudget, 2383, 2383, -225, 0, 2158
  4. 23, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 0, U, J, Materiële uitgaven, , , , ,
  5. 24, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 1, U, N, Drukwerk, 1929, 1929, 79, 0, 2008
  6. 25, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 2, U, N, Fractiekosten, 38367, 41742, 1136, 0, 42878
  7. 26, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 3, U, N, Uitzending leden, 465, 465, 19, 0, 484
  8. 27, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 4, U, N, Parlementaire enquêtes, 2070, 3892, 67, 0, 3959
  9. 28, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 5, U, N, Bijdrage ProDemos, 2120, 2120, 81, 0, 2201
  10. 29, Wetgeving en controle TK, 2022, 2A, 3, 99, 0, 0, O, N, Ontvangsten, 3639, 3639, 0, 7700, 11339
  11. 30, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 0, 0, V, N, Verplichtingen, 1535, 1535, 39, 0, 1574
  12. 31, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 0, 0, U, J, Uitgaven, 1535, 1535, 39, 0, 1574
  13. 32, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 8, 0, U, J, Materiële uitgaven, , , , ,
  14. 33, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 8, 1, U, N, Interparlementaire betrekkingen, 1535, 1535, 39, 0, 1574
  15. (snip)

In this sample, Laurents code fills row 23 with the sum of rows 24 through 31 (where it finds the next NA row at 32). For a large part of the dataset, this is a perfect solution. But not always unfortunately. In this example, it should take col1-col6 into account and only sum rows 24 through 28 because those rows have the same value for col1-col4 and col6 (with row 23 having col5=0)

答案1

得分: 1

以下是已翻译的代码部分:

  1. 第一个代码块中,您提供了一个名为df的数据框(DataFrame)的示例。
  2. 第二个代码块中,对数据框df进行了分割,将第一行为NA的子数据框进行求和,并将它们连接起来,生成了一个新的数据框new_df

第三个代码块中,您提供了另一个数据框df的示例,然后使用类似的方法将具有第一行为NA的子数据框进行求和,并将它们连接成一个新的数据框new_df

希望这有助于您理解代码的逻辑和功能。如果您有任何问题或需要进一步的帮助,请告诉我。

英文:

WIth the dataframe you provided:

  1. import pandas as pd
  2. df = pd.DataFrame(
  3. {
  4. "col1": [1, 1, 1, 1, 1, 1, 1],
  5. "col2": [0, 1, 1, 1, 2, 2, 2],
  6. "col3": [0, 0, 1, 2, 0, 1, 2],
  7. "total": ["Y", "Y", "N", "N", "Y", "N", "N"],
  8. "val1": [246, pd.NA, 61, 62, pd.NA, 61, 62],
  9. "val2": [912, pd.NA, 228, 228, pd.NA, 228, 228],
  10. "val3": [1578, pd.NA, 394, 395, pd.NA, 394, 395],
  11. }
  12. )
  13. print(df)
  14. # Output
  15. col1 col2 col3 total val1 val2 val3
  16. 0 1 0 0 Y 246 912 1578
  17. 1 1 1 0 Y <NA> <NA> <NA>
  18. 2 1 1 1 N 61 228 394
  19. 3 1 1 2 N 62 228 395
  20. 4 1 2 0 Y <NA> <NA> <NA>
  21. 5 1 2 1 N 61 228 394
  22. 6 1 2 2 N 62 228 395

Here is another way to do it:

  1. # Slice df in sub dataframes, in which first row is NA
  2. # and the following are to be summed up
  3. na_rows = df.loc[df[["val1", "val2", "val3"]].isna().all(axis=1), :].index
  4. dfs = []
  5. for i, _ in enumerate(na_rows):
  6. try:
  7. tmp = df.loc[na_rows[i] : na_rows[i + 1] - 1, :]
  8. tmp.loc[na_rows[i], ["val1", "val2", "val3"]] = tmp[
  9. ["val1", "val2", "val3"]
  10. ].sum()
  11. dfs.append(tmp)
  12. except IndexError:
  13. tmp = df.loc[na_rows[i] :, :]
  14. tmp.loc[na_rows[i], ["val1", "val2", "val3"]] = tmp[
  15. ["val1", "val2", "val3"]
  16. ].sum()
  17. dfs.append(tmp)
  18. # Concatenate sub dataframes and avoid duplicated rows with df
  19. tmp = pd.concat(dfs)
  20. new_df = pd.concat([df[~df.index.isin(tmp.index)], tmp]).sort_index()

Then:

  1. print(df)
  2. # Output
  3. col1 col2 col3 total val1 val2 val3
  4. 0 1 0 0 Y 246 912 1578
  5. 1 1 1 0 Y 123 456 789
  6. 2 1 1 1 N 61 228 394
  7. 3 1 1 2 N 62 228 395
  8. 4 1 2 0 Y 123 456 789
  9. 5 1 2 1 N 61 228 394
  10. 6 1 2 2 N 62 228 395

As for your extended question, with a shortened version of your real life sample:

  1. df = pd.DataFrame({'Column1': ['2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A'], 'Column2': ['3', '3', '3', '3', '3', '3', '3', '3', '4', '4', '4', '4'], 'Column3': ['0', '0', '0', '0', '0', '0', '0', '99', '0', '0', '0', '0'], 'Column4': ['4', '8', '8', '8', '8', '8', '8', '0', '0', '0', '8', '8'], 'Column5': ['2', '0', '1', '2', '3', '4', '5', 'O', '0', '0', '0', '1'], 'Column6': ['U', 'U', 'U', 'U', 'U', 'U', 'U', 'N', 'V', 'U', 'U', 'U'], 'Value1': [2383, <NA>, 1929, 38367, 465, 2070, 2120, 3639, 1535, 1535, <NA>, 1535], 'Value2': [2383, <NA>, 1929, 41742, 465, 3892, 2120, 0, 1535, 1535, <NA>, 1535], 'Value3': [-225, <NA>, 79, 1136, 19, 67, 81, 7700, 39, 39, <NA>, 39], 'Value4': [0, <NA>, 0, 0, 0, 0, 0, 11339, 0, 0, <NA>, 0], 'Value5': [2158.0, <NA>, 2008.0, 42878.0, 484.0, 3959.0, 2201.0, <NA>, 1574.0, 1574.0, <NA>, 1574.0]})

Here is one way to deal with it:

  1. # Slice df in sub dataframes, in which first row is NA
  2. # and the following are to be summed up
  3. df = df.set_index(["Column1", "Column2", "Column3", "Column4"])
  4. dfs = []
  5. for idx in df.index.unique():
  6. tmp = df.loc[idx, :]
  7. if tmp.isna().any(axis=1).any() and tmp.shape[0] > 1:
  8. tmp.loc[
  9. tmp.isna().any(axis=1), ["Value1", "Value2", "Value3", "Value4", "Value5"]
  10. ] = (
  11. tmp[["Value1", "Value2", "Value3", "Value4", "Value5"]]
  12. .fillna(0)
  13. .sum()
  14. .tolist()
  15. )
  16. dfs.append(tmp)
  17. # Concatenate sub dataframes and avoid duplicated rows with df
  18. new_df = pd.concat(dfs)
  19. new_df = (
  20. pd.concat([df[~df.index.isin(new_df.index)], new_df]).sort_index().reset_index()
  21. )

Then:

  1. Column1 Column2 Column3 Column4 Column5 Column6 Value1 Value2 Value3 \
  2. 0 2A 3 0 4 2 U 2383 2383 -225
  3. 1 2A 3 0 8 0 U 44951.0 50148.0 1382.0
  4. 2 2A 3 0 8 1 U 1929 1929 79
  5. 3 2A 3 0 8 2 U 38367 41742 1136
  6. 4 2A 3 0 8 3 U 465 465 19
  7. 5 2A 3 0 8 4 U 2070 3892 67
  8. 6 2A 3 0 8 5 U 2120 2120 81
  9. 7 2A 3 99 0 O N 3639 0 7700
  10. 8 2A 4 0 0 0 V 1535 1535 39
  11. 9 2A 4 0 0 0 U 1535 1535 39
  12. 10 2A 4 0 8 0 U 1535.0 1535.0 39.0
  13. 11 2A 4 0 8 1 U 1535 1535 39
  14. Value4 Value5
  15. 0 0 2158.0
  16. 1 0.0 51530.0
  17. 2 0 2008.0
  18. 3 0 42878.0
  19. 4 0 484.0
  20. 5 0 3959.0
  21. 6 0 2201.0
  22. 7 11339 <NA>
  23. 8 0 1574.0
  24. 9 0 1574.0
  25. 10 0.0 1574.0
  26. 11 0 1574.0

答案2

得分: 0

尝试:

  1. (df.set_index(['col1','col2'])
  2. .fillna(df.loc[df['total'].eq('N')]
  3. .groupby(['col1','col2'])[['val1','val2','val3']].sum())
  4. .reset_index())

输出:

  1. col1 col2 col3 total val1 val2 val3
  2. 0 1 0 0 Y 246 912 1578
  3. 1 1 1 0 Y 123.0 456.0 789.0
  4. 2 1 1 1 N 61 228 394
  5. 3 1 1 2 N 62 228 395
  6. 4 1 2 0 Y 123.0 456.0 789.0
  7. 5 1 2 1 N 61 228 394
  8. 6 1 2 2 N 62 228 395
英文:

Try:

  1. (df.set_index(['col1','col2'])
  2. .fillna(df.loc[df['total'].eq('N')]
  3. .groupby(['col1','col2'])[['val1','val2','val3']].sum())
  4. .reset_index())

Output:

  1. col1 col2 col3 total val1 val2 val3
  2. 0 1 0 0 Y 246 912 1578
  3. 1 1 1 0 Y 123.0 456.0 789.0
  4. 2 1 1 1 N 61 228 394
  5. 3 1 1 2 N 62 228 395
  6. 4 1 2 0 Y 123.0 456.0 789.0
  7. 5 1 2 1 N 61 228 394
  8. 6 1 2 2 N 62 228 395

huangapple
  • 本文由 发表于 2023年2月24日 16:54:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/75554436.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定