英文:
rolling apply return dict
问题
以下是您提供的代码的翻译部分:
我有一个自定义函数,它返回一个`dict`并将其存储到每个单元格的每一行:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
def custom_rolling_apply(arr):
return {'sum': np.sum(arr), 'mean': np.mean(arr)}
df['rolling_dict'] = np.NaN
df['rolling_dict'] = df['rolling_dict'].astype('object')
df['rolling_dict'] = df['A'].rolling(window=3).apply(custom_rolling_apply, raw=True)
为什么会显示以下错误:
TypeError: 必须是实数,而不是dict
pandas版本:1.5.3
英文:
I have a custom function which returns a dict
and stores it to every cell to every row:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
def custom_rolling_apply(arr):
return {'sum': np.sum(arr), 'mean': np.mean(arr)}
df['rolling_dict'] = np.NaN
df['rolling_dict'] = df['rolling_dict'].astype('object')
df['rolling_dict'] = df['A'].rolling(window=3).apply(custom_rolling_apply, raw=True)
Why does this say:
TypeError: must be real number, not dict
pandas
version : 1.5.3
答案1
得分: 2
很抱歉,rolling apply 必须产生一个单一的值:https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.apply.html
因此,我们将通过迭代滚动窗口来达到相同的目标:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
def custom_rolling_apply(arr):
q = {'sum': np.sum(arr), 'mean': np.mean(arr)}
return q
df['rolling_dict'] = [custom_rolling_apply(i) for i in df['A'].rolling(window=3)]
print(df)
英文:
Unfortunately, rolling apply must produce a single value: https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.apply.html
Therefore, we will iterate through the rolling windows to get to the same point:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
def custom_rolling_apply(arr):
q = {'sum': np.sum(arr), 'mean': np.mean(arr)}
return q
df['rolling_dict'] = [custom_rolling_apply(i) for i in df['A'].rolling(window=3)]
print(df)
答案2
得分: 2
你应该使用 rolling.aggregate
而不是 apply
;
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
df['rolling_dict'] = np.NaN
df['rolling_dict'] = df['rolling_dict'].astype('object')
df['A'].rolling(window=3).aggregate({'sum': np.sum, 'mean': np.mean}, raw=True)
输出
sum mean
0 NaN NaN
1 NaN NaN
2 6.0 2.0
3 9.0 3.0
4 12.0 4.0
从 rolling.apply
文档 中:
func 函数必须从
ndarray
输入产生单个值(如果raw=True
),或者从Series
输入产生单个值(如果raw=False
)。还可以接受具有engine='numba'
参数的 Numba JIT 函数。
请注意,如果您的数据很大,使用 apply
会带来性能开销:
import numpy as np
import timeit
import matplotlib.pyplot as plt
import pandas as pd
def custom_rolling_apply(arr):
q = {'sum': np.sum(arr), 'mean': np.mean(arr)}
return q
def rolling_with_aggregate(arr):
q = arr.rolling(window=3).aggregate({'sum': np.sum, 'mean': np.mean}, raw=True)
return q
def profile_rolling_operation(data_size):
rolling_times_1 = []
rolling_times_2 = []
data_sizes = []
for i in range(1, data_size + 1):
data_sizes.append(i)
df = pd.DataFrame({'A': np.random.randint(1, 10, i)})
elapsed_time_1 = timeit.timeit(lambda: [custom_rolling_apply(arr) for arr in df['A'].rolling(window=3)], number=2)
rolling_times_1.append(elapsed_time_1)
elapsed_time_2 = timeit.timeit(lambda: rolling_with_aggregate(df['A']), number=2)
rolling_times_2.append(elapsed_time_2)
return data_sizes, rolling_times_1, rolling_times_2
max_data_size = 1000
data_sizes, rolling_times_1, rolling_times_2 = profile_rolling_operation(max_data_size)
plt.plot(data_sizes, rolling_times_1, label='Custom Rolling Apply')
plt.plot(data_sizes, rolling_times_2, label='Rolling with Aggregate')
plt.xlabel('Data Size')
plt.ylabel('Execution Time (seconds)')
plt.title('Comparison')
plt.legend()
plt.show()
英文:
You should use rolling.aggregate
instead of apply
;
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
df['rolling_dict'] = np.NaN
df['rolling_dict'] = df['rolling_dict'].astype('object')
df['A'].rolling(window=3).aggregate({'sum': np.sum, 'mean': np.mean}, raw=True)
Output
sum mean
0 NaN NaN
1 NaN NaN
2 6.0 2.0
3 9.0 3.0
4 12.0 4.0
From the rolling.apply
documentation:
> func function Must produce a single value from an ndarray
input if
> raw=True
or a single value from a Series
if raw=False
. Can also accept
> a Numba JIT function with engine='numba'
specified
Note that apply
carries a performance penalty if your data is large:
import numpy as np
import timeit
import matplotlib.pyplot as plt
import pandas as pd
def custom_rolling_apply(arr):
q={'sum':np.sum(arr), 'mean': np.mean(arr)}
return q
def rolling_with_aggregate(arr):
q=arr.rolling(window=3).aggregate({'sum': np.sum, 'mean': np.mean}, raw=True)
return q
def profile_rolling_operation(data_size):
rolling_times_1 = []
rolling_times_2 = []
data_sizes = []
for i in range(1, data_size + 1):
data_sizes.append(i)
df = pd.DataFrame({'A': np.random.randint(1, 10, i)})
elapsed_time_1 = timeit.timeit(lambda: [custom_rolling_apply(arr) for arr in df['A'].rolling(window=3)], number=2)
rolling_times_1.append(elapsed_time_1)
elapsed_time_2 = timeit.timeit(lambda: rolling_with_aggregate(df['A']), number=2)
rolling_times_2.append(elapsed_time_2)
return data_sizes, rolling_times_1, rolling_times_2
max_data_size = 1000
data_sizes, rolling_times_1, rolling_times_2 = profile_rolling_operation(max_data_size)
plt.plot(data_sizes, rolling_times_1, label='Custom Rolling Apply')
plt.plot(data_sizes, rolling_times_2, label='Rolling with Aggregate')
plt.xlabel('Data Size')
plt.ylabel('Execution Time (seconds)')
plt.title('Comparison')
plt.legend()
plt.show()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论