英文:
Average time difference of one column over condition
问题
我有一个像这样的数据框:
[{ 'Year': 2019, 'time_diff_new': 0, 'type1': 'Nan', 'Age': 861},
{ 'Year': 2019, 'time_diff_new': 100, 'type1': 'Nan', 'Age': 861},
{ 'Year': 2019, 'time_diff_new': 105, 'type1': 'Nan', 'Age': 861},
{ 'Year': 2019, 'time_diff_new': 113, 'type1': 'Breaking', 'Age': 861},
{ 'Year': 2019, 'time_diff_new': 127, 'type1': 'Breaking', 'Age': 861},
{ 'Year': 2020, 'time_diff_new': 299, 'type1': 'Breaking', 'Age': 861},
{ 'Year': 2020, 'time_diff_new': 462, 'type1': 'Nan', 'Age': 861},
{ 'Year': 2021, 'time_diff_new': 601, 'type1': 'Breaking', 'Age': 861},
{ 'Year': 2021, 'time_diff_new': 605, 'type1': 'Nan', 'Age': 861},
{ 'Year': 2021, 'time_diff_new': 771, 'type1': 'Breaking', 'Age': 861},
{ 'Year': 2021, 'time_diff_new': 855, 'type1': 'Nan', 'Age': 861},
{ 'Year': 2021, 'time_diff_new': 861, 'type1': 'Nan', 'Age': 861}]
time_diff_new
表示与第一个日期的天数差异。
我想要针对 Age
列进行计算,我想要检查下一个 type1
列中的 "breaking" 更改需要多少天(从 time_diff_new
列中计算)。但是对于 Age
,我希望考虑范围,而不仅仅是每个 Age
值单独计算。
有没有办法实现这个目标?我以前没有做过这样的事情,所以我不确定从哪里开始。对于任何建议或想法,我将不胜感激。
英文:
I have a dataframe like this:
[{'Year': 2019, 'time_diff_new': 0, 'type1': 'Nan', 'Age': 861},
{'Year': 2019, 'time_diff_new': 100, 'type1': 'Nan', 'Age': 861},
{'Year': 2019, 'time_diff_new': 105, 'type1': 'Nan', 'Age': 861},
{'Year': 2019, 'time_diff_new': 113, 'type1': 'Breaking', 'Age': 861},
{'Year': 2019, 'time_diff_new': 127, 'type1': 'Breaking', 'Age': 861},
{'Year': 2020, 'time_diff_new': 299, 'type1': 'Breaking', 'Age': 861},
{'Year': 2020, 'time_diff_new': 462, 'type1': 'Nan', 'Age': 861},
{'Year': 2021, 'time_diff_new': 601, 'type1': 'Breaking', 'Age': 861},
{'Year': 2021, 'time_diff_new': 605, 'type1': 'Nan', 'Age': 861},
{'Year': 2021, 'time_diff_new': 771, 'type1': 'Breaking', 'Age': 861},
{'Year': 2021, 'time_diff_new': 855, 'type1': 'Nan', 'Age': 861},
{'Year': 2021, 'time_diff_new': 861, 'type1': 'Nan', 'Age': 861}]
time_diff_new
represents the days difference from the first date.
I want to calculate over Age
column, I want to check how much time it takes in days for the next breaking
change ( from type1
column). But for Age
, I want it to consider ranges, not just every value in Age
individually.
Is there any way this could be achieved? I haven't done something like this before, so I am not sure where to start. Any suggestions or ideas would be highly appreciated.
答案1
得分: 1
# 只分析 "Breaking"
df1 = df[df.type1.eq("Breaking")]
# 对年龄进行一般分组,无论是否由您自己确定:
# 使用 pandas cut
groups = pd.cut(df1.Age, bins=3)
# 或者使用 numpy
import numpy as np
最小值 = np.min(df1.Age)
最大值 = np.max(df1.Age)
边数 = 3
groups = np.digitize(df1.Age, np.linspace(最小值, 最大值, 边数))
# 确保排序
df1.sort_values("time_diff_new", inplace=True)
# 按分组分组,并返回 "Breaking" 的平均差异
df1.groupby(groups, as_index=True)["time_diff_new"].diff().mean()
# 这里对样本数据只返回一个分组,因为所有列的 "Age" 均为861
或者一个可以返回指定年龄范围平均值的函数:
def 年龄范围均值(df, 最小年龄, 最大年龄):
# 仅保留年龄在指定范围内且 type1 为 "Breaking" 的值
df = df.loc[df.Age.between(最小年龄, 最大年龄) & df.type1.eq("Breaking")]
# 确保排序
df.sort_values("time_diff_new", inplace=True)
# 返回平均差异
return df.time_diff_new.diff().mean()
年龄范围均值(df, 860, 862)
# 164.5
英文:
As you have mentioned in the comments that you yourself don't know what the expected output of your example DataFrame, it is very difficult to know whether what we might provide as a solution is what you are looking for. From what I understand, you are looking to find the mean differences between type1="Breaking"
for different Age
ranges.
A couple of options for what you might be looking for:
# only analysis on "Breaking"
df1 = df[df.type1.eq("Breaking")]
General grouping of ages, whether determined by yourself or not:
# using pandas cut
groups = pd.cut(df1.Age, bins=3)
# or using numpy
import numpy as np
smallest = np.min(df1.Age)
largest = np.max(df1.Age)
num_edges = 3
groups = np.digitize(df1.Age, np.linspace(smallest, largest, num_edges))
# note that you could also use your own groups here, e.g.
# groups = np.digitize(df1.Age, [0, 20, 50, 100, 400, 600, 1000])
# ensure sorted
df1.sort_values("time_diff_new", inplace=True)
# group by groups and return mean difference between "Breaking"
df1.groupby(groups, as_index=True)["time_diff_new"].diff().mean()
# only 1 group returned here for sample data, as data has "Age"=861 for all columns
Or a function that returns the mean for a specified age range:
def range_mean(df, min_age, max_age):
# filter for values between ages only, and where type1 is "Breaking"
df = df.loc[df.Age.between(min_age, max_age) & df.type1.eq("Breaking")]
# ensure sorted
df.sort_values("time_diff_new", inplace=True)
# return the mean difference
return df.time_diff_new.diff().mean()
range_mean(df, 860, 862)
# 164.5
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论