2023年5月21日 07:12:26go评论100阅读模式

英文:

Average time difference of one column over condition

问题

我有一个像这样的数据框：

[{ 'Year': 2019, 'time_diff_new': 0, 'type1': 'Nan', 'Age': 861},
 { 'Year': 2019, 'time_diff_new': 100, 'type1': 'Nan', 'Age': 861},
 { 'Year': 2019, 'time_diff_new': 105, 'type1': 'Nan', 'Age': 861},
 { 'Year': 2019, 'time_diff_new': 113, 'type1': 'Breaking', 'Age': 861},
 { 'Year': 2019, 'time_diff_new': 127, 'type1': 'Breaking', 'Age': 861},
 { 'Year': 2020, 'time_diff_new': 299, 'type1': 'Breaking', 'Age': 861},
 { 'Year': 2020, 'time_diff_new': 462, 'type1': 'Nan', 'Age': 861},
 { 'Year': 2021, 'time_diff_new': 601, 'type1': 'Breaking', 'Age': 861},
 { 'Year': 2021, 'time_diff_new': 605, 'type1': 'Nan', 'Age': 861},
 { 'Year': 2021, 'time_diff_new': 771, 'type1': 'Breaking', 'Age': 861},
 { 'Year': 2021, 'time_diff_new': 855, 'type1': 'Nan', 'Age': 861},
 { 'Year': 2021, 'time_diff_new': 861, 'type1': 'Nan', 'Age': 861}]

time_diff_new 表示与第一个日期的天数差异。

我想要针对 Age 列进行计算，我想要检查下一个 type1 列中的 "breaking" 更改需要多少天（从 time_diff_new 列中计算）。但是对于 Age，我希望考虑范围，而不仅仅是每个 Age 值单独计算。

有没有办法实现这个目标？我以前没有做过这样的事情，所以我不确定从哪里开始。对于任何建议或想法，我将不胜感激。

英文:

I have a dataframe like this:

[{&#39;Year&#39;: 2019, &#39;time_diff_new&#39;: 0, &#39;type1&#39;: &#39;Nan&#39;, &#39;Age&#39;: 861},
 {&#39;Year&#39;: 2019, &#39;time_diff_new&#39;: 100, &#39;type1&#39;: &#39;Nan&#39;, &#39;Age&#39;: 861},
 {&#39;Year&#39;: 2019, &#39;time_diff_new&#39;: 105, &#39;type1&#39;: &#39;Nan&#39;, &#39;Age&#39;: 861},
 {&#39;Year&#39;: 2019, &#39;time_diff_new&#39;: 113, &#39;type1&#39;: &#39;Breaking&#39;, &#39;Age&#39;: 861},
 {&#39;Year&#39;: 2019, &#39;time_diff_new&#39;: 127, &#39;type1&#39;: &#39;Breaking&#39;, &#39;Age&#39;: 861},
 {&#39;Year&#39;: 2020, &#39;time_diff_new&#39;: 299, &#39;type1&#39;: &#39;Breaking&#39;, &#39;Age&#39;: 861},
 {&#39;Year&#39;: 2020, &#39;time_diff_new&#39;: 462, &#39;type1&#39;: &#39;Nan&#39;, &#39;Age&#39;: 861},
 {&#39;Year&#39;: 2021, &#39;time_diff_new&#39;: 601, &#39;type1&#39;: &#39;Breaking&#39;, &#39;Age&#39;: 861},
 {&#39;Year&#39;: 2021, &#39;time_diff_new&#39;: 605, &#39;type1&#39;: &#39;Nan&#39;, &#39;Age&#39;: 861},
 {&#39;Year&#39;: 2021, &#39;time_diff_new&#39;: 771, &#39;type1&#39;: &#39;Breaking&#39;, &#39;Age&#39;: 861},
 {&#39;Year&#39;: 2021, &#39;time_diff_new&#39;: 855, &#39;type1&#39;: &#39;Nan&#39;, &#39;Age&#39;: 861},
 {&#39;Year&#39;: 2021, &#39;time_diff_new&#39;: 861, &#39;type1&#39;: &#39;Nan&#39;, &#39;Age&#39;: 861}]

time_diff_new represents the days difference from the first date.

I want to calculate over Age column, I want to check how much time it takes in days for the next breaking change ( from type1 column). But for Age, I want it to consider ranges, not just every value in Age individually.

Is there any way this could be achieved? I haven't done something like this before, so I am not sure where to start. Any suggestions or ideas would be highly appreciated.

答案1

得分: 1

# 只分析 "Breaking"
df1 = df[df.type1.eq("Breaking")]
# 对年龄进行一般分组，无论是否由您自己确定：
# 使用 pandas cut
groups = pd.cut(df1.Age, bins=3)
# 或者使用 numpy
import numpy as np
最小值 = np.min(df1.Age)
最大值 = np.max(df1.Age)
边数 = 3
groups = np.digitize(df1.Age, np.linspace(最小值, 最大值, 边数))
# 确保排序
df1.sort_values("time_diff_new", inplace=True)
# 按分组分组，并返回 "Breaking" 的平均差异
df1.groupby(groups, as_index=True)["time_diff_new"].diff().mean()
# 这里对样本数据只返回一个分组，因为所有列的 "Age" 均为861

或者一个可以返回指定年龄范围平均值的函数：

def 年龄范围均值(df, 最小年龄, 最大年龄):
    # 仅保留年龄在指定范围内且 type1 为 "Breaking" 的值
    df = df.loc[df.Age.between(最小年龄, 最大年龄) & df.type1.eq("Breaking")]
    # 确保排序
    df.sort_values("time_diff_new", inplace=True)
    # 返回平均差异
    return df.time_diff_new.diff().mean()
年龄范围均值(df, 860, 862)
# 164.5

英文:

As you have mentioned in the comments that you yourself don't know what the expected output of your example DataFrame, it is very difficult to know whether what we might provide as a solution is what you are looking for. From what I understand, you are looking to find the mean differences between type1="Breaking" for different Age ranges.

A couple of options for what you might be looking for:

# only analysis on &quot;Breaking&quot;
df1 = df[df.type1.eq(&quot;Breaking&quot;)]

General grouping of ages, whether determined by yourself or not:

# using pandas cut
groups = pd.cut(df1.Age, bins=3)
# or using numpy
import numpy as np
smallest = np.min(df1.Age)
largest = np.max(df1.Age)
num_edges = 3
groups = np.digitize(df1.Age, np.linspace(smallest, largest, num_edges))
# note that you could also use your own groups here, e.g.
# groups = np.digitize(df1.Age, [0, 20, 50, 100, 400, 600, 1000])
# ensure sorted
df1.sort_values(&quot;time_diff_new&quot;, inplace=True)
# group by groups and return mean difference between &quot;Breaking&quot;
df1.groupby(groups, as_index=True)[&quot;time_diff_new&quot;].diff().mean()
# only 1 group returned here for sample data, as data has &quot;Age&quot;=861 for all columns

Or a function that returns the mean for a specified age range:

def range_mean(df, min_age, max_age):
    # filter for values between ages only, and where type1 is &quot;Breaking&quot;
    df = df.loc[df.Age.between(min_age, max_age) &amp; df.type1.eq(&quot;Breaking&quot;)]
    # ensure sorted
    df.sort_values(&quot;time_diff_new&quot;, inplace=True)
    # return the mean difference
    return df.time_diff_new.diff().mean()
    
range_mean(df, 860, 862)
# 164.5

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

一列在条件下的平均时间差

问题

答案1

在golang中接收一个pickled流

无法将NumPy数组（图像）传递给Python中的ffmpeg子进程。

从PDF文件中提取印地文文本

4D绘图，用单一颜色描述第四维，并用线连接。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。