transform with groupby: want to print each column name in each group in the function feed to transform, but something went strange

huangapple go评论78阅读模式

transform with groupby: want to print each column name in each group in the function feed to transform, but something went strange




code date high low
600010.SH 2022/12/29 1.99 1.92
600010.SH 2022/12/29 1.94 1.91
600028.SH 2022/12/30 4.41 4.34
600028.SH 2022/12/29 4.38 4.35


  1. import pandas as pd
  2. i = 1
  3. def MAD_single(grp):
  4. global i
  5. print(i,
  6. print(grp.head(10))
  7. i += 1
  8. # 在这里进行计算...
  9. df = pd.read_csv('data_input/test_data.csv')
  10. temp = df.groupby('code')[['high', 'low']].transform(MAD_single)


  1. 1 high
  2. 0 1.99
  3. 1 1.94
  4. Name: high, dtype: float64
  5. 2 low
  6. 0 1.92
  7. 1 1.91
  8. Name: low, dtype: float64
  9. 3 600010.SH
  10. high low
  11. 0 1.99 1.92
  12. 1 1.94 1.91
  13. 4 high
  14. 2 4.41
  15. 3 4.38
  16. Name: high, dtype: float64
  17. 5 low
  18. 2 4.34
  19. 3 4.35
  20. Name: low, dtype: float64


  1. 为什么会打印出下面的输出?
  2. 为什么第二个组键没有打印为600010.SH?
  1. 3 600010.SH
  2. high low
  3. 0 1.99 1.92
  4. 1 1.94 1.91

transform with groupby: want to print each column name in each group in the function feed to transform, but something went strange, because there is an extra output with '600010.SH', which is the first group key.

my data here:test_data.csv

code date high low
600010.SH 2022/12/29 1.99 1.92
600010.SH 2022/12/29 1.94 1.91
600028.SH 2022/12/30 4.41 4.34
600028.SH 2022/12/29 4.38 4.35

my code is:

  1. import pandas as pd
  2. i = 1
  3. def MAD_single(grp):
  4. global i
  5. print(i,
  6. print(grp.head(10))
  7. i += 1
  8. # computing here...
  9. df = pd.read_csv('data_input/test_data.csv')
  10. temp = df.groupby('code')[['high', 'low']].transform(MAD_single)

the output is:

  1. 1 high
  2. 0 1.99
  3. 1 1.94
  4. Name: high, dtype: float64
  5. 2 low
  6. 0 1.92
  7. 1 1.91
  8. Name: low, dtype: float64
  9. 3 600010.SH
  10. high low
  11. 0 1.99 1.92
  12. 1 1.94 1.91
  13. 4 high
  14. 2 4.41
  15. 3 4.38
  16. Name: high, dtype: float64
  17. 5 low
  18. 2 4.34
  19. 3 4.35
  20. Name: low, dtype: float64

I am wondering:

  1. why the following output will be printed?
  2. why the second group key is not printed as 600010.SH?
  1. 3 600010.SH
  2. high low
  3. 0 1.99 1.92
  4. 1 1.94 1.91


得分: 2



  1. 转换一个DataFrame(而不是Series
  2. 使用自定义用户函数进行转换(而不是字符串,如'mean'或'sum')
  3. 使用默认引擎(而不是numbacython



  1. 在第一组上应用您的函数
  2. 在其他组上应用您的函数



  • fast_path:该函数将一次应用于整个数据框
  • slow_path:该函数将逐个列(或索引)应用一次

fast_path类似于func(group),而slow_pathgroup.apply(lambda x: func(x))




  1. i = 0
  2. def MAD_single(grp, slow_strategy=True):
  3. global i
  4. print(f'{i}: "{}" is a {grp.__class__.__name__}')
  5. i += 1
  6. return 5 if slow_strategy else grp * 5
  7. g = df.groupby('code')[['high', 'low']]


  1. >>> g.transform(MAD_single, slow_strategy=True)
  2. 0: "high" is a Series # 第一组,slow_path
  3. 1: "low" is a Series # 第一组,slow_path
  4. 2: "600010.SH" is a DataFrame # 第一组,fast_path
  5. 3: "high" is a Series # 第二组,slow path
  6. 4: "low" is a Series # 第二组,slow path



  1. >>> g.transform(MAD_single, slow_strategy=False)
  2. 0: "high" is a Series # 第一组,slow_path
  3. 1: "low" is a Series # 第一组,slow_path
  4. 2: "600010.SH" is a DataFrame # 第一组,fast_path
  5. 3: "600028.SH" is a DataFrame # 第二组,fast_path



  • 1 - 对第一组的每列应用转换(slow路径)
  • 2 - 尝试对第一组的所有列应用转换(fast路径)
  • 3 - 检查2的结果:fast路径(4a),如果检查正确,则使用slow路径(4b)作为回退
  • 4a - 对剩余组的每列应用转换(slow路径)
  • 4b - 对剩余组的所有列应用转换(fast路径)



This is a great question but to understand the result you get, you need to understand how Pandas is going to handle the transformation.

Here, I will only deal with the case where you transform:

  1. a DataFrame (not a Series)
  2. with a custom user function (not a string, like 'mean' or 'sum')
  3. with the default engine (not numba or cython)

The implementation of this case can be found in the _transform_general method. All other cases are almost the same but not quite.

When trying to transform a DataFrame, there are 2 main steps:

  1. Apply your function on the first group
  2. Apply your function on other groups

During the first step, Pandas try to find the best path to apply the transformation.

Pandas defines two lambda functions in _define_paths method

  • fast_path: the function will be applied on the whole data frame at once
  • slow_path: the function will be applied on each column (or index) once at a time

fast_path is something like func(group) and slow_path is group.apply(lambda x: func(x)).
Maybe, you understand now why these functions are named fast and slow.

Now they are set, Pandas can call them on the first group. This is what the _choose_path method does.
And the trick is here. First, Pandas apply (safely) the slow_path function on the first group then it try to apply the fast_path function on the same group.
If the last function succeeded and some checks passed, Pandas will use the fast_path with the remaining groups to speed up the process.

It's important for large transformation with many groups. Suppose you have a dataframe with 100 groups and 30 columns to transform.
For the slow path, your function will be called 3001 (100+1+29*100) times against only 130 (30+1+99) times for the fast path.

However, keep in mind, using fast or slow path only depends on what the custom function returns. I slightly modified your custom function for better understanding and demonstration purpose:

  1. i = 0
  2. def MAD_single(grp, slow_strategy=True):
  3. global i
  4. print(f'{i}: "{}" is a {grp.__class__.__name__}')
  5. i += 1
  6. return 5 if slow_strategy else grp * 5
  7. g = df.groupby('code')[['high', 'low']]

First run, we force slow_path by returning a scalar value:

  1. >>> g.transform(MAD_single, slow_strategy=True)
  2. 0: "high" is a Series # 1st group, slow_path
  3. 1: "low" is a Series # 1st group, slow_path
  4. 2: "600010.SH" is a DataFrame # 1st group, fast_path
  5. 3: "high" is a Series # 2nd group, slow path
  6. 4: "low" is a Series # 2nd group, slow path

The fast_path function was called only once because the result didn't pass the checks.

Now, we force fast_path by returning a same dimension array:

  1. >>> g.transform(MAD_single, slow_strategy=False)
  2. 0: "high" is a Series # 1st group, slow_path
  3. 1: "low" is a Series # 1st group, slow_path
  4. 2: "600010.SH" is a DataFrame # 1st group, fast_path
  5. 3: "600028.SH" is a DataFrame # 2nd group, fast_path

As the result pass the checks, the fast_path function will be used now for remaining groups.

In conclusion, the transformation process seems to follow these steps:

  • 1 - Apply the transformation on each column of the first group (slow path)
  • 2 - Try to apply the transformation on all columns of the first group (fast path)
  • 3 - Check the results of 2: fast path (4a) if checks are right else slow path (4b) as fallback
  • 4a - Apply the transformation on each column of the remaining groups (slow path)
  • 4b - Apply the transormation on all columns of the remaining groups (fast path)

Disclaimer: I'm not a Pandas developer so maybe some information as a result of my understanding is wrong but the obtained results and code analysis seem to confirm this strategy.

  • 本文由 发表于 2023年6月16日 10:49:51
  • 转载请务必保留本文链接:



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
