英文:
transform with groupby: want to print each column name in each group in the function feed to transform, but something went strange
问题
transform与groupby一起使用时,想要在传递给transform函数的每个组中打印每列的列名,但出现了一个奇怪的输出,因为多出了一个带有'600010.SH'的额外输出,这是第一个组键。
我的数据在这里:test_data.csv
code | date | high | low |
---|---|---|---|
600010.SH | 2022/12/29 | 1.99 | 1.92 |
600010.SH | 2022/12/29 | 1.94 | 1.91 |
600028.SH | 2022/12/30 | 4.41 | 4.34 |
600028.SH | 2022/12/29 | 4.38 | 4.35 |
我的代码如下:
import pandas as pd
i = 1
def MAD_single(grp):
global i
print(i, grp.name)
print(grp.head(10))
i += 1
# 在这里进行计算...
df = pd.read_csv('data_input/test_data.csv')
temp = df.groupby('code')[['high', 'low']].transform(MAD_single)
输出结果为:
1 high
0 1.99
1 1.94
Name: high, dtype: float64
2 low
0 1.92
1 1.91
Name: low, dtype: float64
3 600010.SH
high low
0 1.99 1.92
1 1.94 1.91
4 high
2 4.41
3 4.38
Name: high, dtype: float64
5 low
2 4.34
3 4.35
Name: low, dtype: float64
我想知道:
- 为什么会打印出下面的输出?
- 为什么第二个组键没有打印为600010.SH?
3 600010.SH
high low
0 1.99 1.92
1 1.94 1.91
英文:
transform with groupby: want to print each column name in each group in the function feed to transform, but something went strange, because there is an extra output with '600010.SH', which is the first group key.
my data here:test_data.csv
code | date | high | low |
---|---|---|---|
600010.SH | 2022/12/29 | 1.99 | 1.92 |
600010.SH | 2022/12/29 | 1.94 | 1.91 |
600028.SH | 2022/12/30 | 4.41 | 4.34 |
600028.SH | 2022/12/29 | 4.38 | 4.35 |
my code is:
import pandas as pd
i = 1
def MAD_single(grp):
global i
print(i, grp.name)
print(grp.head(10))
i += 1
# computing here...
df = pd.read_csv('data_input/test_data.csv')
temp = df.groupby('code')[['high', 'low']].transform(MAD_single)
the output is:
1 high
0 1.99
1 1.94
Name: high, dtype: float64
2 low
0 1.92
1 1.91
Name: low, dtype: float64
3 600010.SH
high low
0 1.99 1.92
1 1.94 1.91
4 high
2 4.41
3 4.38
Name: high, dtype: float64
5 low
2 4.34
3 4.35
Name: low, dtype: float64
I am wondering:
- why the following output will be printed?
- why the second group key is not printed as 600010.SH?
Thanks!
3 600010.SH
high low
0 1.99 1.92
1 1.94 1.91
答案1
得分: 2
这是一个很好的问题,但要理解你得到的结果,你需要理解Pandas将如何处理这个转换。
在这里,我只处理以下情况:
- 转换一个
DataFrame
(而不是Series
) - 使用自定义用户函数进行转换(而不是字符串,如'mean'或'sum')
- 使用默认引擎(而不是
numba
或cython
)
这种情况的实现可以在_transform_general
方法中找到。所有其他情况几乎相同,但不完全相同。
在尝试对DataFrame进行“transform”时,有2个主要步骤:
- 在第一组上应用您的函数
- 在其他组上应用您的函数
在第一步中,Pandas会尝试找到应用转换的最佳路径。
Pandas在_define_paths
方法中定义了两个lambda函数
fast_path
:该函数将一次应用于整个数据框slow_path
:该函数将逐个列(或索引)应用一次
fast_path
类似于func(group)
,而slow_path
是group.apply(lambda x: func(x))
。
也许现在你明白了为什么这些函数被命名为fast和slow。
现在它们已经设置好了,Pandas可以在第一组上调用它们。这是_choose_path
方法的作用。
关键在于这里。首先,Pandas会(安全地)在第一组上应用slow_path
函数,然后尝试在同一组上应用fast_path
函数。
如果最后一个函数成功并通过了一些检查,Pandas将在剩余的组上使用fast_path
以加速过程。
对于具有许多组的大型转换非常重要。假设您有一个包含100个组和30列要进行转换的数据框。
对于slow路径,您的函数将被调用3001次(100+1+29*100),而对于fast路径,只有130次(30+1+99)。
但是,请记住,使用fast或slow路径仅取决于自定义函数返回的内容。我稍微修改了您的自定义函数以更好地理解和演示目的:
i = 0
def MAD_single(grp, slow_strategy=True):
global i
print(f'{i}: "{grp.name}" is a {grp.__class__.__name__}')
i += 1
return 5 if slow_strategy else grp * 5
g = df.groupby('code')[['high', 'low']]
第一次运行时,我们通过返回标量值来强制使用slow_path
:
>>> g.transform(MAD_single, slow_strategy=True)
0: "high" is a Series # 第一组,slow_path
1: "low" is a Series # 第一组,slow_path
2: "600010.SH" is a DataFrame # 第一组,fast_path
3: "high" is a Series # 第二组,slow path
4: "low" is a Series # 第二组,slow path
fast_path
函数仅被调用了一次,因为结果未通过检查。
现在,我们通过返回相同维度的数组来强制使用fast_path
:
>>> g.transform(MAD_single, slow_strategy=False)
0: "high" is a Series # 第一组,slow_path
1: "low" is a Series # 第一组,slow_path
2: "600010.SH" is a DataFrame # 第一组,fast_path
3: "600028.SH" is a DataFrame # 第二组,fast_path
由于结果通过了检查,现在将使用fast_path
函数来处理剩余的组。
总之,转换过程似乎遵循以下步骤:
- 1 - 对第一组的每列应用转换(slow路径)
- 2 - 尝试对第一组的所有列应用转换(fast路径)
- 3 - 检查2的结果:fast路径(4a),如果检查正确,则使用slow路径(4b)作为回退
- 4a - 对剩余组的每列应用转换(slow路径)
- 4b - 对剩余组的所有列应用转换(fast路径)
免责声明:我不是Pandas开发人员,因此我的理解可能有些信息是错误的,但获得的结果和代码分析似乎确认了这种策略。
英文:
This is a great question but to understand the result you get, you need to understand how Pandas is going to handle the transformation.
Here, I will only deal with the case where you transform:
- a
DataFrame
(not aSeries
) - with a custom user function (not a string, like 'mean' or 'sum')
- with the default engine (not
numba
orcython
)
The implementation of this case can be found in the _transform_general
method. All other cases are almost the same but not quite.
When trying to transform
a DataFrame, there are 2 main steps:
- Apply your function on the first group
- Apply your function on other groups
During the first step, Pandas try to find the best path to apply the transformation.
Pandas defines two lambda functions in _define_paths
method
fast_path
: the function will be applied on the whole data frame at onceslow_path
: the function will be applied on each column (or index) once at a time
fast_path
is something like func(group)
and slow_path
is group.apply(lambda x: func(x))
.
Maybe, you understand now why these functions are named fast and slow.
Now they are set, Pandas can call them on the first group. This is what the _choose_path
method does.
And the trick is here. First, Pandas apply (safely) the slow_path
function on the first group then it try to apply the fast_path
function on the same group.
If the last function succeeded and some checks passed, Pandas will use the fast_path
with the remaining groups to speed up the process.
It's important for large transformation with many groups. Suppose you have a dataframe with 100 groups and 30 columns to transform.
For the slow path, your function will be called 3001 (100+1+29*100) times against only 130 (30+1+99) times for the fast path.
However, keep in mind, using fast or slow path only depends on what the custom function returns. I slightly modified your custom function for better understanding and demonstration purpose:
i = 0
def MAD_single(grp, slow_strategy=True):
global i
print(f'{i}: "{grp.name}" is a {grp.__class__.__name__}')
i += 1
return 5 if slow_strategy else grp * 5
g = df.groupby('code')[['high', 'low']]
First run, we force slow_path
by returning a scalar value:
>>> g.transform(MAD_single, slow_strategy=True)
0: "high" is a Series # 1st group, slow_path
1: "low" is a Series # 1st group, slow_path
2: "600010.SH" is a DataFrame # 1st group, fast_path
3: "high" is a Series # 2nd group, slow path
4: "low" is a Series # 2nd group, slow path
The fast_path
function was called only once because the result didn't pass the checks.
Now, we force fast_path
by returning a same dimension array:
>>> g.transform(MAD_single, slow_strategy=False)
0: "high" is a Series # 1st group, slow_path
1: "low" is a Series # 1st group, slow_path
2: "600010.SH" is a DataFrame # 1st group, fast_path
3: "600028.SH" is a DataFrame # 2nd group, fast_path
As the result pass the checks, the fast_path
function will be used now for remaining groups.
In conclusion, the transformation process seems to follow these steps:
- 1 - Apply the transformation on each column of the first group (slow path)
- 2 - Try to apply the transformation on all columns of the first group (fast path)
- 3 - Check the results of 2: fast path (4a) if checks are right else slow path (4b) as fallback
- 4a - Apply the transformation on each column of the remaining groups (slow path)
- 4b - Apply the transormation on all columns of the remaining groups (fast path)
Disclaimer: I'm not a Pandas developer so maybe some information as a result of my understanding is wrong but the obtained results and code analysis seem to confirm this strategy.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论