英文:
Copying from pandas pivot table to dataframe to compute subtotals
问题
以下是您要翻译的代码部分:
# Your original code...
pivot = df.pivot_table(
values=['val1', 'val2', 'val3'],
index=['col1', 'col2', 'col3'],
aggfunc=np.sum, fill_value=0)
qry = f"Total == 'Y' & val1 == 0 & val2 == 0 & val3 == 0 & col2 != 0"
df.loc[df.eval(qry), ['val1', 'val2', 'val3']] = pivot.loc[(df['col1'], df['col2'], df['col3'])]
请注意,这是原始代码的翻译,不包括代码的解释或其他内容。如果您有任何进一步的问题或需要进一步的帮助,请随时提出。
英文:
(27-feb: edit 1, see below)
A question about pandas pivot tables and accessing information from this table.
My dataset is (simplified) as follows:
col1 col2 col3 total val1 val2 val3
1 0 0 Y 246 912 1578
1 1 0 Y 123 456 789
1 1 1 N 61 228 394
1 1 2 N 62 228 395
1 2 0 Y 123 456 789
1 2 1 N 61 228 394
1 2 2 N 62 228 395
Explanation: a subtotal line is indicated by a Y which should add up to the running total of the underlying N lines. Columns 1,2,3 represent a hierarchy, so 1.1.1 plus 1.1.2 roll up to 1.1.0 and above that, 1.1.0 and 1.2.0 roll up to the end total of 1.0.0.
My problem: sometimes the subtotal lines are not filled. That results in an input of (after fillna(0)
):
1 0 0 Y 246 912 1578
1 1 0 Y 0 0 0
1 1 1 N 61 228 394
1 1 2 N 62 228 395
1 2 0 Y 0 0 0
1 2 1 N 61 228 394
1 2 2 N 62 228 395
What I thought would be a good way around this - or rather, to complete the dataframe since I need a dataframe that is completely filled - is to make a pivot table to compute the totals and then copy those values over to the main dataframe where total = Y but value = 0.
My attempt:
pivot = df.pivot_table(
values=['val1', 'val2', 'val3'],
index=['col1', 'col2', 'col3'],
aggfunc=np.sum, fill_value=0)
# in reality there are more columns, so a mask of Total = Y only doesn't suffice
qry = f"Total == 'Y' & val1 == 0 & val2 == 0 & val3 == 0 & col2 != 0"
df.loc[df.eval(qry), ['val1', 'val2', 'val3']] = pivot.loc[(df['col1'], df['col2'], df['col3'])]
But no. I get a ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
Any help on getting this to work is most appreciated. Also, if there is a better way to fill the zeros with the totals from the row below, let me know.
Thanks,
Chris
Edit:
While I am personally more attracted to the oneliner-ness of rhug123, I could not get it to work. It yields an InvalidIndexError
without further explanation. @rhug123, did I fail to adjust your example correctly?
columns = ['Value1', 'Value2', 'Value3', 'Value4', 'Value5']
index = ['Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6']
df = (df.set_index(index).fillna(df.loc[df['Total'].eq('N')].groupby(index)[columns].sum()).reset_index())
The code snippet by Laurent worked first try and it seems like it is almost there. In the simplified df, it does its job, but in a real world example it sums too many rows. Or rather, the code snippet does not seem to take col1, col2 and col3 into account (those forming the unique id). @Laurent, how should I best add to your code to only sum the lines with the correct identifiying column values?
Real life sample from df: (col1-col3 are actually col1-col6 and val1-val3 are actually val1-val5)
, Article name, Year, Column1, Column2, Column3, Column4, Column5, Column6, Total, Description, Value1, Value2, Value3, Value4, Value5
(snip)
22, Wetgeving en controle TK, 2022, 2A, 3, 0, 4, 2, U, N, Onderzoeksbudget, 2383, 2383, -225, 0, 2158
23, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 0, U, J, Materiële uitgaven, , , , ,
24, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 1, U, N, Drukwerk, 1929, 1929, 79, 0, 2008
25, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 2, U, N, Fractiekosten, 38367, 41742, 1136, 0, 42878
26, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 3, U, N, Uitzending leden, 465, 465, 19, 0, 484
27, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 4, U, N, Parlementaire enquêtes, 2070, 3892, 67, 0, 3959
28, Wetgeving en controle TK, 2022, 2A, 3, 0, 8, 5, U, N, Bijdrage ProDemos, 2120, 2120, 81, 0, 2201
29, Wetgeving en controle TK, 2022, 2A, 3, 99, 0, 0, O, N, Ontvangsten, 3639, 3639, 0, 7700, 11339
30, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 0, 0, V, N, Verplichtingen, 1535, 1535, 39, 0, 1574
31, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 0, 0, U, J, Uitgaven, 1535, 1535, 39, 0, 1574
32, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 8, 0, U, J, Materiële uitgaven, , , , ,
33, Wetgeving en controle EK en TK, 2022, 2A, 4, 0, 8, 1, U, N, Interparlementaire betrekkingen, 1535, 1535, 39, 0, 1574
(snip)
In this sample, Laurents code fills row 23 with the sum of rows 24 through 31 (where it finds the next NA row at 32). For a large part of the dataset, this is a perfect solution. But not always unfortunately. In this example, it should take col1-col6 into account and only sum rows 24 through 28 because those rows have the same value for col1-col4 and col6 (with row 23 having col5=0)
答案1
得分: 1
以下是已翻译的代码部分:
- 第一个代码块中,您提供了一个名为
df
的数据框(DataFrame)的示例。 - 第二个代码块中,对数据框
df
进行了分割,将第一行为NA的子数据框进行求和,并将它们连接起来,生成了一个新的数据框new_df
。
第三个代码块中,您提供了另一个数据框df
的示例,然后使用类似的方法将具有第一行为NA的子数据框进行求和,并将它们连接成一个新的数据框new_df
。
希望这有助于您理解代码的逻辑和功能。如果您有任何问题或需要进一步的帮助,请告诉我。
英文:
WIth the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"col1": [1, 1, 1, 1, 1, 1, 1],
"col2": [0, 1, 1, 1, 2, 2, 2],
"col3": [0, 0, 1, 2, 0, 1, 2],
"total": ["Y", "Y", "N", "N", "Y", "N", "N"],
"val1": [246, pd.NA, 61, 62, pd.NA, 61, 62],
"val2": [912, pd.NA, 228, 228, pd.NA, 228, 228],
"val3": [1578, pd.NA, 394, 395, pd.NA, 394, 395],
}
)
print(df)
# Output
col1 col2 col3 total val1 val2 val3
0 1 0 0 Y 246 912 1578
1 1 1 0 Y <NA> <NA> <NA>
2 1 1 1 N 61 228 394
3 1 1 2 N 62 228 395
4 1 2 0 Y <NA> <NA> <NA>
5 1 2 1 N 61 228 394
6 1 2 2 N 62 228 395
Here is another way to do it:
# Slice df in sub dataframes, in which first row is NA
# and the following are to be summed up
na_rows = df.loc[df[["val1", "val2", "val3"]].isna().all(axis=1), :].index
dfs = []
for i, _ in enumerate(na_rows):
try:
tmp = df.loc[na_rows[i] : na_rows[i + 1] - 1, :]
tmp.loc[na_rows[i], ["val1", "val2", "val3"]] = tmp[
["val1", "val2", "val3"]
].sum()
dfs.append(tmp)
except IndexError:
tmp = df.loc[na_rows[i] :, :]
tmp.loc[na_rows[i], ["val1", "val2", "val3"]] = tmp[
["val1", "val2", "val3"]
].sum()
dfs.append(tmp)
# Concatenate sub dataframes and avoid duplicated rows with df
tmp = pd.concat(dfs)
new_df = pd.concat([df[~df.index.isin(tmp.index)], tmp]).sort_index()
Then:
print(df)
# Output
col1 col2 col3 total val1 val2 val3
0 1 0 0 Y 246 912 1578
1 1 1 0 Y 123 456 789
2 1 1 1 N 61 228 394
3 1 1 2 N 62 228 395
4 1 2 0 Y 123 456 789
5 1 2 1 N 61 228 394
6 1 2 2 N 62 228 395
As for your extended question, with a shortened version of your real life sample:
df = pd.DataFrame({'Column1': ['2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A', '2A'], 'Column2': ['3', '3', '3', '3', '3', '3', '3', '3', '4', '4', '4', '4'], 'Column3': ['0', '0', '0', '0', '0', '0', '0', '99', '0', '0', '0', '0'], 'Column4': ['4', '8', '8', '8', '8', '8', '8', '0', '0', '0', '8', '8'], 'Column5': ['2', '0', '1', '2', '3', '4', '5', 'O', '0', '0', '0', '1'], 'Column6': ['U', 'U', 'U', 'U', 'U', 'U', 'U', 'N', 'V', 'U', 'U', 'U'], 'Value1': [2383, <NA>, 1929, 38367, 465, 2070, 2120, 3639, 1535, 1535, <NA>, 1535], 'Value2': [2383, <NA>, 1929, 41742, 465, 3892, 2120, 0, 1535, 1535, <NA>, 1535], 'Value3': [-225, <NA>, 79, 1136, 19, 67, 81, 7700, 39, 39, <NA>, 39], 'Value4': [0, <NA>, 0, 0, 0, 0, 0, 11339, 0, 0, <NA>, 0], 'Value5': [2158.0, <NA>, 2008.0, 42878.0, 484.0, 3959.0, 2201.0, <NA>, 1574.0, 1574.0, <NA>, 1574.0]})
Here is one way to deal with it:
# Slice df in sub dataframes, in which first row is NA
# and the following are to be summed up
df = df.set_index(["Column1", "Column2", "Column3", "Column4"])
dfs = []
for idx in df.index.unique():
tmp = df.loc[idx, :]
if tmp.isna().any(axis=1).any() and tmp.shape[0] > 1:
tmp.loc[
tmp.isna().any(axis=1), ["Value1", "Value2", "Value3", "Value4", "Value5"]
] = (
tmp[["Value1", "Value2", "Value3", "Value4", "Value5"]]
.fillna(0)
.sum()
.tolist()
)
dfs.append(tmp)
# Concatenate sub dataframes and avoid duplicated rows with df
new_df = pd.concat(dfs)
new_df = (
pd.concat([df[~df.index.isin(new_df.index)], new_df]).sort_index().reset_index()
)
Then:
Column1 Column2 Column3 Column4 Column5 Column6 Value1 Value2 Value3 \
0 2A 3 0 4 2 U 2383 2383 -225
1 2A 3 0 8 0 U 44951.0 50148.0 1382.0
2 2A 3 0 8 1 U 1929 1929 79
3 2A 3 0 8 2 U 38367 41742 1136
4 2A 3 0 8 3 U 465 465 19
5 2A 3 0 8 4 U 2070 3892 67
6 2A 3 0 8 5 U 2120 2120 81
7 2A 3 99 0 O N 3639 0 7700
8 2A 4 0 0 0 V 1535 1535 39
9 2A 4 0 0 0 U 1535 1535 39
10 2A 4 0 8 0 U 1535.0 1535.0 39.0
11 2A 4 0 8 1 U 1535 1535 39
Value4 Value5
0 0 2158.0
1 0.0 51530.0
2 0 2008.0
3 0 42878.0
4 0 484.0
5 0 3959.0
6 0 2201.0
7 11339 <NA>
8 0 1574.0
9 0 1574.0
10 0.0 1574.0
11 0 1574.0
答案2
得分: 0
尝试:
(df.set_index(['col1','col2'])
.fillna(df.loc[df['total'].eq('N')]
.groupby(['col1','col2'])[['val1','val2','val3']].sum())
.reset_index())
输出:
col1 col2 col3 total val1 val2 val3
0 1 0 0 Y 246 912 1578
1 1 1 0 Y 123.0 456.0 789.0
2 1 1 1 N 61 228 394
3 1 1 2 N 62 228 395
4 1 2 0 Y 123.0 456.0 789.0
5 1 2 1 N 61 228 394
6 1 2 2 N 62 228 395
英文:
Try:
(df.set_index(['col1','col2'])
.fillna(df.loc[df['total'].eq('N')]
.groupby(['col1','col2'])[['val1','val2','val3']].sum())
.reset_index())
Output:
col1 col2 col3 total val1 val2 val3
0 1 0 0 Y 246 912 1578
1 1 1 0 Y 123.0 456.0 789.0
2 1 1 1 N 61 228 394
3 1 1 2 N 62 228 395
4 1 2 0 Y 123.0 456.0 789.0
5 1 2 1 N 61 228 394
6 1 2 2 N 62 228 395
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论