英文:
describe when data is value_counts
问题
我有一个包含两列数值和计数的数据框。例如,一行具有(value,count)=(2,1000)表示值2出现1000次。我想计算最小值、最大值、中位数和百分位数,使得结果与在数据未“分组”时的df.describe()相同。
谢谢。
英文:
I have a data frame that contains two columns value, count. For example a row that has (value, count)=(2, 1000) this means that the value 2 occurence 1000. I want to compute min, max, median, percentiles so that the results would be the same as df.describe() when the data is not "grouped"
Thank you
could not find anything
答案1
得分: 0
# 聚合数据
df = pd.DataFrame({'value': [1, 2, 3], 'count': [5, 1, 4]})
# 复制行并计算统计信息
out = df.loc[df.index.repeat(df['count']), 'value'].describe()
当然,你可以根据你想计算的具体统计信息做得更好:min
/max
保持不变;mean
和 std
可以使用 numpy.average
/ statsmodels.stats.weightstats.DescrStatsW
以及它们的 weight
参数进行计算,等等。你需要自己判断你需要计算什么,并决定是否可以在不取消聚合的情况下进行计算。
输出:
count 10.000000
mean 1.900000
std 0.994429
min 1.000000
25% 1.000000
50% 1.500000
75% 3.000000
max 3.000000
Name: value, dtype: float64
英文:
The generic way would be to restore the original data, then compute the statistics:
# aggregated data
df = pd.DataFrame({'value': [1, 2, 3], 'count': [5, 1, 4]})
# replicate rows and compute statistics
out = df.loc[df.index.repeat(df['count']), 'value'].describe()
Of course, you can do better depending on which exact statistics you want to compute: min
/max
would be unchanged; mean
and std
could be computed using numpy.average
/statsmodels.stats.weightstats.DescrStatsW
and their weight
parameter, etc. You have to see for yourself what you need to compute and decide if you can do so without unaggregating.
Output:
count 10.000000
mean 1.900000
std 0.994429
min 1.000000
25% 1.000000
50% 1.500000
75% 3.000000
max 3.000000
Name: value, dtype: float64
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论