2023年7月13日 00:25:42go评论89阅读模式

英文:

numpy dataframe get maximum difference in a single column

问题

我想在一个国家数据框中获取温度变化最大的部分。

我的第一个想法是进行分组：df.groupby('country_code')['temperature'].max()，df.groupby('country_code')['temperature'].min()，然后相减，并获取最大值。

我猜想可能有更好的方法吗？

英文:

I want to get the biggest temperature change in a dataframe of countries.

My first idea was to make groups: df.groupby('country_code')['temperature'].max(), df.groupby('country_code')['temperature'].min(), subtract them, and get the maximum.

I guess there is a better way to to that?

答案1

得分: 2

看起来将自定义的 max-min 函数传递给 DataFrameGroupBy.agg 是最快的方法。其他方法较慢：

在 agg 内使用 np.ptp；
使用 apply（参见 @MariaKozlova 的解决方案）

import pandas as pd
import numpy as np
np.random.seed(0) # 用于可重复性
# 示例数据框：10个国家，每个国家有5个温度值
data = {'country_code': np.repeat(range(10),5), 'temperature': np.random.randint(-10,50,50)}
df = pd.DataFrame(data)
# 方法1（将 lambda 函数传递给 `agg`）
out = df.groupby('country_code', sort=False)['temperature'].agg(lambda x: max(x) - min(x))
# 方法2（将 `np.ptp` 传递给 `agg`）
out2 = df.groupby('country_code', sort=False).agg({'temperature': np.ptp})
out.equals(out2['temperature'])
# True
out
country_code
0    53
1    56
2    44
3    57
4    23
5    29
6    42
7    47
8    27
9    25
Name: temperature, dtype: int64

性能比较

# 有趣的是，`np.ptp` 实际上要慢得多
%timeit df.groupby('country_code', sort=False)['temperature'].agg(lambda x: max(x) - min(x))
# 238 微秒 ± 4.35 微秒每次循环（均值 ± 7 次运行的标准偏差，每次循环 1000 次）
%timeit df.groupby('country_code', sort=False).agg({'temperature': np.ptp})
# 1.26 毫秒 ± 22 微秒每次循环（均值 ± 7 次运行的标准偏差，每次循环 1000 次）
# 添加 `apply` 的比较（由 @MariaKolzova 提供的解决方案）
def temp_range(group):
    return group.max() - group.min()
%timeit df.groupby('country_code')['temperature'].apply(temp_range)
# 434 微秒 ± 9.26 微秒每次循环（均值 ± 7 次运行的标准偏差，每次循环 1000 次）

英文:

Looks like passing a custom max-min function to DataFrameGroupBy.agg is fastest. The alternatives are slower:

using np.ptp inside agg;
using apply (see solution by @MariaKozlova)

import pandas as pd
import numpy as np
np.random.seed(0) # for reproducibility
# sample df: 10 countries with 5 temperatures
data = {&#39;country_code&#39;: np.repeat(range(10),5), &#39;temperature&#39;: np.random.randint(-10,50,50)}
df = pd.DataFrame(data)
# method1 (pass lambda function to `agg`)
out = df.groupby(&#39;country_code&#39;, sort=False)[&#39;temperature&#39;].agg(lambda x: max(x) - min(x))
# method2 (pass `np.ptp` to `agg`)
out2 = df.groupby(&#39;country_code&#39;, sort=False).agg({&#39;temperature&#39;: np.ptp})
out.equals(out2[&#39;temperature&#39;])
# True
out
country_code
0    53
1    56
2    44
3    57
4    23
5    29
6    42
7    47
8    27
9    25
Name: temperature, dtype: int64

Performance comparison

# intriguingly, `np.ptp` is actually quite a bit slower
%timeit df.groupby(&#39;country_code&#39;, sort=False)[&#39;temperature&#39;].agg(lambda x: max(x) - min(x))
# 238 &#181;s &#177; 4.35 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1000 loops each)
%timeit df.groupby(&#39;country_code&#39;, sort=False).agg({&#39;temperature&#39;: np.ptp})
# 1.26 ms &#177; 22 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1000 loops each)
# adding comparison for `apply` (solution by @MariaKolzova)
def temp_range(group):
    return group.max() - group.min()
%timeit df.groupby(&#39;country_code&#39;)[&#39;temperature&#39;].apply(temp_range)
# 434 &#181;s &#177; 9.26 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1000 loops each)

答案2

得分: 1

这里有一个稍微不同的方法，只涉及一个组集合：

def temp_range(group):
    return group.max() - group.min()
df.groupby('country_code')['temperature'].apply(temp_range)

不确定这是否更好。

英文:

Here's a slightly different approach, dealing with only one set of groups

def temp_range(group):
    return group.max() - group.min()
df.groupby(&#39;country_code&#39;)[&#39;temperature&#39;].apply(temp_range)

Not sure if it's better though

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

numpy dataframe 获取单列中的最大差异

问题

答案1

答案2

keras.losses.sparse_categorical_crossentropy的实现是怎样的？

Python Pandas – 如何在数据为长格式时，通过选择变量保留所有观测值

合并具有数组的数据框。

0/1背包问题的动态规划解法

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。