英文:
Pandas groupby and sum are dropping numeric columns
问题
以下是翻译好的部分:
当这段代码运行时,会产生以下日志:
standardized_df cols are Customer ID Customer Name ... TermDaysAmountProduct DaysToCollectAmountProduct
grouped_df cols are Customer ID Amount
所以显然在groupby
过程中,TermDaysAmountProduct
和DaysToCollectAmountProduct
列(它们都是数字,应该被求和)由于某种原因被移除了。在求和后,如何保留这些列在数据框中?
英文:
I have the following Python/Pandas code:
standardized_df = get_somehow()
standardized_df['TermDaysAmountProduct'] = standardized_df['TermDays'] * standardized_df['Amount']
standardized_df['DaysToCollectAmountProduct'] = standardized_df['DaysToCollect'] * standardized_df['Amount']
logger.info("standardized_df cols are {}".format(standardized_df.head()))
grouped_df = standardized_df.groupby(["Customer ID"], as_index=False).sum()
logger.info("grouped_df cols are {}".format(grouped_df.head()))
When this runs it produces the following logs:
standardized_df cols are Customer ID Customer Name ... TermDaysAmountProduct DaysToCollectAmountProduct
grouped_df cols are Customer ID Amount
So apparently during the groupby, the TermDaysAmountProduct
and DaysToCollectAmountProduct
columns (which are both numeric and should be summed) are getting removed for some reason. How can I keep these columns in the dataframe after the sum?
答案1
得分: 1
关于 Pandas,我之前并没有注意到在应用求和时会丢弃非数值列。有趣。无论如何,一种解决方法是手动提供列名给aggregate
函数。
grouped_df = standardized_df.groupby(["Customer ID"], as_index=False).aggregate({"<col_1>": sum, "<col_2>": sum})
一般来说,你总是可以将aggregate({foo: bar})
应用到pandas.core.groupby.DataFrameGroupBy
对象上,其中foo
是列名,bar
是接受pd.Series
参数的函数。
注意,如果你有大量列并且想要对它们进行求和而不想手动输入一个大字典,你总是可以准备一个聚合字典。
aggregates = {col: sum for col in df.columns}
英文:
I hadn't noticed about pandas before that it drops non-numeric columns when applying sums. Interesting. Anyway, a workaround is to supply the column names manually to an aggregate
function.
grouped_df = standardized_df.groupby(["Customer ID"], as_index=False).aggregate({<col_1>: sum, <col_2>: sum})
In general you can always apply an aggregate({foo: bar})
to a pandas.core.groupby.DataFrameGroupBy
object where foo
is column name and bar
is a function that takes a pd.Series
argument.
Note, If you have some large number of columns and you want them all to be summed without having to type out a big long dictionary, you can always prepare the aggregate dictionary.
aggregates = {col: sum for col in df.columns}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论