一列在条件下的平均时间差

huangapple go评论64阅读模式
英文:

Average time difference of one column over condition

问题

我有一个像这样的数据框:

[{ 'Year': 2019, 'time_diff_new': 0, 'type1': 'Nan', 'Age': 861},
 { 'Year': 2019, 'time_diff_new': 100, 'type1': 'Nan', 'Age': 861},
 { 'Year': 2019, 'time_diff_new': 105, 'type1': 'Nan', 'Age': 861},
 { 'Year': 2019, 'time_diff_new': 113, 'type1': 'Breaking', 'Age': 861},
 { 'Year': 2019, 'time_diff_new': 127, 'type1': 'Breaking', 'Age': 861},
 { 'Year': 2020, 'time_diff_new': 299, 'type1': 'Breaking', 'Age': 861},
 { 'Year': 2020, 'time_diff_new': 462, 'type1': 'Nan', 'Age': 861},
 { 'Year': 2021, 'time_diff_new': 601, 'type1': 'Breaking', 'Age': 861},
 { 'Year': 2021, 'time_diff_new': 605, 'type1': 'Nan', 'Age': 861},
 { 'Year': 2021, 'time_diff_new': 771, 'type1': 'Breaking', 'Age': 861},
 { 'Year': 2021, 'time_diff_new': 855, 'type1': 'Nan', 'Age': 861},
 { 'Year': 2021, 'time_diff_new': 861, 'type1': 'Nan', 'Age': 861}]

time_diff_new 表示与第一个日期的天数差异。

我想要针对 Age 列进行计算,我想要检查下一个 type1 列中的 "breaking" 更改需要多少天(从 time_diff_new 列中计算)。但是对于 Age,我希望考虑范围,而不仅仅是每个 Age 值单独计算。

有没有办法实现这个目标?我以前没有做过这样的事情,所以我不确定从哪里开始。对于任何建议或想法,我将不胜感激。

英文:

I have a dataframe like this:

[{'Year': 2019, 'time_diff_new': 0, 'type1': 'Nan', 'Age': 861},
 {'Year': 2019, 'time_diff_new': 100, 'type1': 'Nan', 'Age': 861},
 {'Year': 2019, 'time_diff_new': 105, 'type1': 'Nan', 'Age': 861},
 {'Year': 2019, 'time_diff_new': 113, 'type1': 'Breaking', 'Age': 861},
 {'Year': 2019, 'time_diff_new': 127, 'type1': 'Breaking', 'Age': 861},
 {'Year': 2020, 'time_diff_new': 299, 'type1': 'Breaking', 'Age': 861},
 {'Year': 2020, 'time_diff_new': 462, 'type1': 'Nan', 'Age': 861},
 {'Year': 2021, 'time_diff_new': 601, 'type1': 'Breaking', 'Age': 861},
 {'Year': 2021, 'time_diff_new': 605, 'type1': 'Nan', 'Age': 861},
 {'Year': 2021, 'time_diff_new': 771, 'type1': 'Breaking', 'Age': 861},
 {'Year': 2021, 'time_diff_new': 855, 'type1': 'Nan', 'Age': 861},
 {'Year': 2021, 'time_diff_new': 861, 'type1': 'Nan', 'Age': 861}]

time_diff_new represents the days difference from the first date.

I want to calculate over Age column, I want to check how much time it takes in days for the next breaking change ( from type1 column). But for Age, I want it to consider ranges, not just every value in Age individually.

Is there any way this could be achieved? I haven't done something like this before, so I am not sure where to start. Any suggestions or ideas would be highly appreciated.

答案1

得分: 1

# 只分析 "Breaking"
df1 = df[df.type1.eq("Breaking")]

# 对年龄进行一般分组,无论是否由您自己确定:
# 使用 pandas cut
groups = pd.cut(df1.Age, bins=3)
# 或者使用 numpy
import numpy as np
最小值 = np.min(df1.Age)
最大值 = np.max(df1.Age)
边数 = 3
groups = np.digitize(df1.Age, np.linspace(最小值, 最大值, 边数))

# 确保排序
df1.sort_values("time_diff_new", inplace=True)
# 按分组分组,并返回 "Breaking" 的平均差异
df1.groupby(groups, as_index=True)["time_diff_new"].diff().mean()

# 这里对样本数据只返回一个分组,因为所有列的 "Age" 均为861

或者一个可以返回指定年龄范围平均值的函数:

def 年龄范围均值(df, 最小年龄, 最大年龄):
    # 仅保留年龄在指定范围内且 type1 为 "Breaking" 的值
    df = df.loc[df.Age.between(最小年龄, 最大年龄) & df.type1.eq("Breaking")]
    # 确保排序
    df.sort_values("time_diff_new", inplace=True)
    # 返回平均差异
    return df.time_diff_new.diff().mean()

年龄范围均值(df, 860, 862)
# 164.5
英文:

As you have mentioned in the comments that you yourself don't know what the expected output of your example DataFrame, it is very difficult to know whether what we might provide as a solution is what you are looking for. From what I understand, you are looking to find the mean differences between type1="Breaking" for different Age ranges.

A couple of options for what you might be looking for:

# only analysis on "Breaking"
df1 = df[df.type1.eq("Breaking")]

General grouping of ages, whether determined by yourself or not:

# using pandas cut
groups = pd.cut(df1.Age, bins=3)
# or using numpy
import numpy as np
smallest = np.min(df1.Age)
largest = np.max(df1.Age)
num_edges = 3
groups = np.digitize(df1.Age, np.linspace(smallest, largest, num_edges))
# note that you could also use your own groups here, e.g.
# groups = np.digitize(df1.Age, [0, 20, 50, 100, 400, 600, 1000])

# ensure sorted
df1.sort_values("time_diff_new", inplace=True)
# group by groups and return mean difference between "Breaking"
df1.groupby(groups, as_index=True)["time_diff_new"].diff().mean()
# only 1 group returned here for sample data, as data has "Age"=861 for all columns

Or a function that returns the mean for a specified age range:

def range_mean(df, min_age, max_age):
    # filter for values between ages only, and where type1 is "Breaking"
    df = df.loc[df.Age.between(min_age, max_age) & df.type1.eq("Breaking")]
    # ensure sorted
    df.sort_values("time_diff_new", inplace=True)
    # return the mean difference
    return df.time_diff_new.diff().mean()
    

range_mean(df, 860, 862)
# 164.5

huangapple
  • 本文由 发表于 2023年5月21日 07:12:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76297678.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定