2023年6月16日 10:49:51go评论78阅读模式

英文:

transform with groupby: want to print each column name in each group in the function feed to transform, but something went strange

问题

transform与groupby一起使用时，想要在传递给transform函数的每个组中打印每列的列名，但出现了一个奇怪的输出，因为多出了一个带有'600010.SH'的额外输出，这是第一个组键。

我的数据在这里：test_data.csv

code	date	high	low
600010.SH	2022/12/29	1.99	1.92
600010.SH	2022/12/29	1.94	1.91
600028.SH	2022/12/30	4.41	4.34
600028.SH	2022/12/29	4.38	4.35

我的代码如下：

import pandas as pd
i = 1
def MAD_single(grp):
    global i
    print(i, grp.name)
    print(grp.head(10))
    i += 1
    # 在这里进行计算...
df = pd.read_csv('data_input/test_data.csv')
temp = df.groupby('code')[['high', 'low']].transform(MAD_single)

输出结果为：

1 high
0    1.99
1    1.94
Name: high, dtype: float64
2 low
0    1.92
1    1.91
Name: low, dtype: float64
3 600010.SH
   high   low
0  1.99  1.92
1  1.94  1.91
4 high
2    4.41
3    4.38
Name: high, dtype: float64
5 low
2    4.34
3    4.35
Name: low, dtype: float64

我想知道：

为什么会打印出下面的输出？
为什么第二个组键没有打印为600010.SH？

3 600010.SH
   high   low
0  1.99  1.92
1  1.94  1.91

英文:

transform with groupby: want to print each column name in each group in the function feed to transform, but something went strange, because there is an extra output with '600010.SH', which is the first group key.

my data here:test_data.csv

code	date	high	low
600010.SH	2022/12/29	1.99	1.92
600010.SH	2022/12/29	1.94	1.91
600028.SH	2022/12/30	4.41	4.34
600028.SH	2022/12/29	4.38	4.35

my code is:

import pandas as pd
i = 1
def MAD_single(grp):
    global i
    print(i, grp.name)
    print(grp.head(10))
    i += 1
    # computing here...
df = pd.read_csv(&#39;data_input/test_data.csv&#39;)
temp = df.groupby(&#39;code&#39;)[[&#39;high&#39;, &#39;low&#39;]].transform(MAD_single)

the output is:

1 high
0    1.99
1    1.94
Name: high, dtype: float64
2 low
0    1.92
1    1.91
Name: low, dtype: float64
3 600010.SH
   high   low
0  1.99  1.92
1  1.94  1.91
4 high
2    4.41
3    4.38
Name: high, dtype: float64
5 low
2    4.34
3    4.35
Name: low, dtype: float64

I am wondering:

why the following output will be printed?
why the second group key is not printed as 600010.SH?
Thanks!

3 600010.SH
   high   low
0  1.99  1.92
1  1.94  1.91

答案1

得分: 2

这是一个很好的问题，但要理解你得到的结果，你需要理解Pandas将如何处理这个转换。

在这里，我只处理以下情况：

转换一个DataFrame（而不是Series）
使用自定义用户函数进行转换（而不是字符串，如'mean'或'sum'）
使用默认引擎（而不是numba或cython）

这种情况的实现可以在_transform_general方法中找到。所有其他情况几乎相同，但不完全相同。

在尝试对DataFrame进行“transform”时，有2个主要步骤：

在第一组上应用您的函数
在其他组上应用您的函数

在第一步中，Pandas会尝试找到应用转换的最佳路径。

Pandas在_define_paths方法中定义了两个lambda函数

fast_path：该函数将一次应用于整个数据框
slow_path：该函数将逐个列（或索引）应用一次

fast_path类似于func(group)，而slow_path是group.apply(lambda x: func(x))。
也许现在你明白了为什么这些函数被命名为fast和slow。

现在它们已经设置好了，Pandas可以在第一组上调用它们。这是_choose_path方法的作用。
关键在于这里。首先，Pandas会（安全地）在第一组上应用slow_path函数，然后尝试在同一组上应用fast_path函数。
如果最后一个函数成功并通过了一些检查，Pandas将在剩余的组上使用fast_path以加速过程。

对于具有许多组的大型转换非常重要。假设您有一个包含100个组和30列要进行转换的数据框。
对于slow路径，您的函数将被调用3001次（100+1+29*100），而对于fast路径，只有130次（30+1+99）。

但是，请记住，使用fast或slow路径仅取决于自定义函数返回的内容。我稍微修改了您的自定义函数以更好地理解和演示目的：

i = 0
def MAD_single(grp, slow_strategy=True):
    global i
    print(f'{i}: "{grp.name}" is a {grp.__class__.__name__}')
    i += 1
    return 5 if slow_strategy else grp * 5
g = df.groupby('code')[['high', 'low']]

第一次运行时，我们通过返回标量值来强制使用slow_path：

>>> g.transform(MAD_single, slow_strategy=True)
0: "high" is a Series  # 第一组，slow_path
1: "low" is a Series  # 第一组，slow_path
2: "600010.SH" is a DataFrame  # 第一组，fast_path
3: "high" is a Series  # 第二组，slow path
4: "low" is a Series  # 第二组，slow path

fast_path函数仅被调用了一次，因为结果未通过检查。

现在，我们通过返回相同维度的数组来强制使用fast_path：

>>> g.transform(MAD_single, slow_strategy=False)
0: "high" is a Series  # 第一组，slow_path
1: "low" is a Series  # 第一组，slow_path
2: "600010.SH" is a DataFrame  # 第一组，fast_path
3: "600028.SH" is a DataFrame  # 第二组，fast_path

由于结果通过了检查，现在将使用fast_path函数来处理剩余的组。

总之，转换过程似乎遵循以下步骤：

1 - 对第一组的每列应用转换（slow路径）
2 - 尝试对第一组的所有列应用转换（fast路径）
3 - 检查2的结果：fast路径（4a），如果检查正确，则使用slow路径（4b）作为回退
4a - 对剩余组的每列应用转换（slow路径）
4b - 对剩余组的所有列应用转换（fast路径）

免责声明：我不是Pandas开发人员，因此我的理解可能有些信息是错误的，但获得的结果和代码分析似乎确认了这种策略。

英文:

This is a great question but to understand the result you get, you need to understand how Pandas is going to handle the transformation.

Here, I will only deal with the case where you transform:

a DataFrame (not a Series)
with a custom user function (not a string, like 'mean' or 'sum')
with the default engine (not numba or cython)

The implementation of this case can be found in the _transform_general method. All other cases are almost the same but not quite.

When trying to transform a DataFrame, there are 2 main steps:

Apply your function on the first group
Apply your function on other groups

During the first step, Pandas try to find the best path to apply the transformation.

Pandas defines two lambda functions in _define_paths method

fast_path: the function will be applied on the whole data frame at once
slow_path: the function will be applied on each column (or index) once at a time

fast_path is something like func(group) and slow_path is group.apply(lambda x: func(x)).
Maybe, you understand now why these functions are named fast and slow.

Now they are set, Pandas can call them on the first group. This is what the _choose_path method does.
And the trick is here. First, Pandas apply (safely) the slow_path function on the first group then it try to apply the fast_path function on the same group.
If the last function succeeded and some checks passed, Pandas will use the fast_path with the remaining groups to speed up the process.

It's important for large transformation with many groups. Suppose you have a dataframe with 100 groups and 30 columns to transform.
For the slow path, your function will be called 3001 (100+1+29*100) times against only 130 (30+1+99) times for the fast path.

However, keep in mind, using fast or slow path only depends on what the custom function returns. I slightly modified your custom function for better understanding and demonstration purpose:

i = 0
def MAD_single(grp, slow_strategy=True):
    global i
    print(f&#39;{i}: &quot;{grp.name}&quot; is a {grp.__class__.__name__}&#39;)
    i += 1
    return 5 if slow_strategy else grp * 5
g = df.groupby(&#39;code&#39;)[[&#39;high&#39;, &#39;low&#39;]]

First run, we force slow_path by returning a scalar value:

&gt;&gt;&gt; g.transform(MAD_single, slow_strategy=True)
0: &quot;high&quot; is a Series  # 1st group, slow_path
1: &quot;low&quot; is a Series  # 1st group, slow_path
2: &quot;600010.SH&quot; is a DataFrame  # 1st group, fast_path
3: &quot;high&quot; is a Series  # 2nd group, slow path
4: &quot;low&quot; is a Series  # 2nd group, slow path

The fast_path function was called only once because the result didn't pass the checks.

Now, we force fast_path by returning a same dimension array:

&gt;&gt;&gt; g.transform(MAD_single, slow_strategy=False)
0: &quot;high&quot; is a Series  # 1st group, slow_path
1: &quot;low&quot; is a Series  # 1st group, slow_path
2: &quot;600010.SH&quot; is a DataFrame  # 1st group, fast_path
3: &quot;600028.SH&quot; is a DataFrame  # 2nd group, fast_path

As the result pass the checks, the fast_path function will be used now for remaining groups.

In conclusion, the transformation process seems to follow these steps:

1 - Apply the transformation on each column of the first group (slow path)
2 - Try to apply the transformation on all columns of the first group (fast path)
3 - Check the results of 2: fast path (4a) if checks are right else slow path (4b) as fallback
4a - Apply the transformation on each column of the remaining groups (slow path)
4b - Apply the transormation on all columns of the remaining groups (fast path)

Disclaimer: I'm not a Pandas developer so maybe some information as a result of my understanding is wrong but the obtained results and code analysis seem to confirm this strategy.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

transform with groupby: want to print each column name in each group in the function feed to transform, but something went strange

问题

答案1

最好的方法是如何迭代每一行，针对以下情况？

创建基于另一列数据的类别列。

如何改进我用于在时间序列中分类间歇信号的Python代码？

检查3个不同数据框中的3列，并创建一个新列。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。