transform with groupby: want to print each column name in each group in the function feed to transform, but something went strange

huangapple go评论60阅读模式
英文:

transform with groupby: want to print each column name in each group in the function feed to transform, but something went strange

问题

transform与groupby一起使用时,想要在传递给transform函数的每个组中打印每列的列名,但出现了一个奇怪的输出,因为多出了一个带有'600010.SH'的额外输出,这是第一个组键。

我的数据在这里:test_data.csv

code date high low
600010.SH 2022/12/29 1.99 1.92
600010.SH 2022/12/29 1.94 1.91
600028.SH 2022/12/30 4.41 4.34
600028.SH 2022/12/29 4.38 4.35

我的代码如下:

import pandas as pd

i = 1
def MAD_single(grp):
    global i
    print(i, grp.name)
    print(grp.head(10))
    i += 1
    # 在这里进行计算...

df = pd.read_csv('data_input/test_data.csv')
temp = df.groupby('code')[['high', 'low']].transform(MAD_single)

输出结果为:

1 high
0    1.99
1    1.94
Name: high, dtype: float64
2 low
0    1.92
1    1.91
Name: low, dtype: float64
3 600010.SH
   high   low
0  1.99  1.92
1  1.94  1.91
4 high
2    4.41
3    4.38
Name: high, dtype: float64
5 low
2    4.34
3    4.35
Name: low, dtype: float64

我想知道:

  1. 为什么会打印出下面的输出?
  2. 为什么第二个组键没有打印为600010.SH?
3 600010.SH
   high   low
0  1.99  1.92
1  1.94  1.91
英文:

transform with groupby: want to print each column name in each group in the function feed to transform, but something went strange, because there is an extra output with '600010.SH', which is the first group key.

my data here:test_data.csv

code date high low
600010.SH 2022/12/29 1.99 1.92
600010.SH 2022/12/29 1.94 1.91
600028.SH 2022/12/30 4.41 4.34
600028.SH 2022/12/29 4.38 4.35

my code is:

import pandas as pd

i = 1
def MAD_single(grp):
    global i
    print(i, grp.name)
    print(grp.head(10))
    i += 1
    # computing here...

df = pd.read_csv('data_input/test_data.csv')
temp = df.groupby('code')[['high', 'low']].transform(MAD_single)

the output is:

1 high
0    1.99
1    1.94
Name: high, dtype: float64
2 low
0    1.92
1    1.91
Name: low, dtype: float64
3 600010.SH
   high   low
0  1.99  1.92
1  1.94  1.91
4 high
2    4.41
3    4.38
Name: high, dtype: float64
5 low
2    4.34
3    4.35
Name: low, dtype: float64

I am wondering:

  1. why the following output will be printed?
  2. why the second group key is not printed as 600010.SH?
    Thanks!
3 600010.SH
   high   low
0  1.99  1.92
1  1.94  1.91

答案1

得分: 2

这是一个很好的问题,但要理解你得到的结果,你需要理解Pandas将如何处理这个转换。

在这里,我只处理以下情况:

  1. 转换一个DataFrame(而不是Series
  2. 使用自定义用户函数进行转换(而不是字符串,如'mean'或'sum')
  3. 使用默认引擎(而不是numbacython

这种情况的实现可以在_transform_general方法中找到。所有其他情况几乎相同,但不完全相同。

在尝试对DataFrame进行“transform”时,有2个主要步骤:

  1. 在第一组上应用您的函数
  2. 在其他组上应用您的函数

在第一步中,Pandas会尝试找到应用转换的最佳路径

Pandas在_define_paths方法中定义了两个lambda函数

  • fast_path:该函数将一次应用于整个数据框
  • slow_path:该函数将逐个列(或索引)应用一次

fast_path类似于func(group),而slow_pathgroup.apply(lambda x: func(x))
也许现在你明白了为什么这些函数被命名为fast和slow。

现在它们已经设置好了,Pandas可以在第一组上调用它们。这是_choose_path方法的作用。
关键在于这里。首先,Pandas会(安全地)在第一组上应用slow_path函数,然后尝试在同一组上应用fast_path函数。
如果最后一个函数成功并通过了一些检查,Pandas将在剩余的组上使用fast_path以加速过程。

对于具有许多组的大型转换非常重要。假设您有一个包含100个组和30列要进行转换的数据框。
对于slow路径,您的函数将被调用3001次(100+1+29*100),而对于fast路径,只有130次(30+1+99)。

但是,请记住,使用fast或slow路径仅取决于自定义函数返回的内容。我稍微修改了您的自定义函数以更好地理解和演示目的:

i = 0
def MAD_single(grp, slow_strategy=True):
    global i
    print(f'{i}: "{grp.name}" is a {grp.__class__.__name__}')
    i += 1
    return 5 if slow_strategy else grp * 5

g = df.groupby('code')[['high', 'low']]

第一次运行时,我们通过返回标量值来强制使用slow_path

>>> g.transform(MAD_single, slow_strategy=True)
0: "high" is a Series  # 第一组,slow_path
1: "low" is a Series  # 第一组,slow_path
2: "600010.SH" is a DataFrame  # 第一组,fast_path
3: "high" is a Series  # 第二组,slow path
4: "low" is a Series  # 第二组,slow path

fast_path函数仅被调用了一次,因为结果未通过检查。

现在,我们通过返回相同维度的数组来强制使用fast_path

>>> g.transform(MAD_single, slow_strategy=False)
0: "high" is a Series  # 第一组,slow_path
1: "low" is a Series  # 第一组,slow_path
2: "600010.SH" is a DataFrame  # 第一组,fast_path
3: "600028.SH" is a DataFrame  # 第二组,fast_path

由于结果通过了检查,现在将使用fast_path函数来处理剩余的组。

总之,转换过程似乎遵循以下步骤:

  • 1 - 对第一组的每列应用转换(slow路径)
  • 2 - 尝试对第一组的所有列应用转换(fast路径)
  • 3 - 检查2的结果:fast路径(4a),如果检查正确,则使用slow路径(4b)作为回退
  • 4a - 对剩余组的每列应用转换(slow路径)
  • 4b - 对剩余组的所有列应用转换(fast路径)

免责声明:我不是Pandas开发人员,因此我的理解可能有些信息是错误的,但获得的结果和代码分析似乎确认了这种策略。

英文:

This is a great question but to understand the result you get, you need to understand how Pandas is going to handle the transformation.

Here, I will only deal with the case where you transform:

  1. a DataFrame (not a Series)
  2. with a custom user function (not a string, like 'mean' or 'sum')
  3. with the default engine (not numba or cython)

The implementation of this case can be found in the _transform_general method. All other cases are almost the same but not quite.

When trying to transform a DataFrame, there are 2 main steps:

  1. Apply your function on the first group
  2. Apply your function on other groups

During the first step, Pandas try to find the best path to apply the transformation.

Pandas defines two lambda functions in _define_paths method

  • fast_path: the function will be applied on the whole data frame at once
  • slow_path: the function will be applied on each column (or index) once at a time

fast_path is something like func(group) and slow_path is group.apply(lambda x: func(x)).
Maybe, you understand now why these functions are named fast and slow.

Now they are set, Pandas can call them on the first group. This is what the _choose_path method does.
And the trick is here. First, Pandas apply (safely) the slow_path function on the first group then it try to apply the fast_path function on the same group.
If the last function succeeded and some checks passed, Pandas will use the fast_path with the remaining groups to speed up the process.

It's important for large transformation with many groups. Suppose you have a dataframe with 100 groups and 30 columns to transform.
For the slow path, your function will be called 3001 (100+1+29*100) times against only 130 (30+1+99) times for the fast path.

However, keep in mind, using fast or slow path only depends on what the custom function returns. I slightly modified your custom function for better understanding and demonstration purpose:

i = 0
def MAD_single(grp, slow_strategy=True):
    global i
    print(f'{i}: "{grp.name}" is a {grp.__class__.__name__}')
    i += 1
    return 5 if slow_strategy else grp * 5

g = df.groupby('code')[['high', 'low']]

First run, we force slow_path by returning a scalar value:

>>> g.transform(MAD_single, slow_strategy=True)
0: "high" is a Series  # 1st group, slow_path
1: "low" is a Series  # 1st group, slow_path
2: "600010.SH" is a DataFrame  # 1st group, fast_path
3: "high" is a Series  # 2nd group, slow path
4: "low" is a Series  # 2nd group, slow path

The fast_path function was called only once because the result didn't pass the checks.

Now, we force fast_path by returning a same dimension array:

>>> g.transform(MAD_single, slow_strategy=False)
0: "high" is a Series  # 1st group, slow_path
1: "low" is a Series  # 1st group, slow_path
2: "600010.SH" is a DataFrame  # 1st group, fast_path
3: "600028.SH" is a DataFrame  # 2nd group, fast_path

As the result pass the checks, the fast_path function will be used now for remaining groups.

In conclusion, the transformation process seems to follow these steps:

  • 1 - Apply the transformation on each column of the first group (slow path)
  • 2 - Try to apply the transformation on all columns of the first group (fast path)
  • 3 - Check the results of 2: fast path (4a) if checks are right else slow path (4b) as fallback
  • 4a - Apply the transformation on each column of the remaining groups (slow path)
  • 4b - Apply the transormation on all columns of the remaining groups (fast path)

Disclaimer: I'm not a Pandas developer so maybe some information as a result of my understanding is wrong but the obtained results and code analysis seem to confirm this strategy.

huangapple
  • 本文由 发表于 2023年6月16日 10:49:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76486673.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定