2020年1月3日 21:02:18go评论158阅读模式

英文:

How create a new column based on other rows in pandas dataframe?

问题

以下是您的代码的翻译部分：

我有一个包含200,000行的数据框，并尝试根据其他行的一些条件添加列。我尝试过实现它，但花费了很长时间（2小时）。

这是我的代码：

for index in dataset.index:
    A_id = dataset.loc[index, 'A_id']
    B_id = dataset.loc[index, 'B_id']
    C_date = dataset.loc[index, 'C_date']
    subset = dataset[
        (dataset['A_id'] == A_id) & (dataset['B_id'] == B_id) & (
                dataset['C_date'] < C_date)]
    dataset.at[index, 'D_mean'] = subset['D'].mean()
    dataset.at[index, 'E_mean'] = subset['E'].mean()

我的数据框看起来像这样：

A = [1, 2, 1, 2, 1, 2]
B = [10, 20, 10, 20, 10, 20]
C = ["22-02-2019", "28-02-19", "07-03-2019", "14-03-2019", "21-12-2019", "11-10-2019"]
D = [10, 12, 21, 81, 20, 1]
E = [7, 10, 14, 31, 61, 9]

dataset = pd.DataFrame({
    'A_id': A,
    'B_id': B,
    'C_date': C,
    'D': D,
    'E': E,
})

dataset.C_date = pd.to_datetime(dataset.C_date)
dataset
Out[27]: 
   A_id  B_id     C_date   D   E
0     1    10 2019-02-22  10   7
1     2    20 2019-02-28  12  10
2     1    10 2019-07-03  21  14
3     2    20 2019-03-14  81  31
4     1    10 2019-12-21  20  61
5     2    20 2019-11-10   1   9

我希望以比我的解决方案更有效的方式获得这个结果：

   A_id  B_id     C_date   D   E  D_mean  E_mean
0     1    10 2019-02-22  10   7     NaN     NaN
1     2    20 2019-02-28  12  10     NaN     NaN
2     1    10 2019-07-03  21  14    10.0     7.0
3     2    20 2019-03-14  81  31    12.0    10.0
4     1    10 2019-12-21  20  61    15.5    10.5
5     2    20 2019-11-10   1   9    46.5    20.5

您有什么想法？

英文:

I have a data frame with 200k rows and i try to add columns based on other rows with some conditions. I tried to achieve it but take a lot of time(2 hours).

Here is my code :

for index in dataset.index:
A_id = dataset.loc[index, &#39;A_id&#39;]
B_id = dataset.loc[index, &#39;B_id&#39;]
C_date = dataset.loc[index, &#39;C_date&#39;]
subset = dataset[
(dataset[&#39;A_id&#39;] == A_id) &amp; (dataset[&#39;B_id&#39;] == B_id) &amp; (
dataset[&#39;C_date&#39;] &lt; C_date)]
dataset.at[index, &#39;D_mean&#39;] = subset[&#39;D&#39;].mean()
dataset.at[index, &#39;E_mean&#39;] = subset[&#39;E&#39;].mean()

My data frame looks this:

A = [1, 2, 1, 2, 1, 2]
B = [10, 20, 10, 20, 10, 20]
C = [&quot;22-02-2019&quot;, &quot;28-02-19&quot;, &quot;07-03-2019&quot;, &quot;14-03-2019&quot;, &quot;21-12-2019&quot;, &quot;11-10-2019&quot;]
D = [10, 12, 21, 81, 20, 1]
E = [7, 10, 14, 31, 61, 9]
dataset = pd.DataFrame({
&#39;A_id&#39;: A,
&#39;B_id&#39;: B,
&#39;C_date&#39;: C,
&#39;D&#39;: D,
&#39;E&#39;: E,
})
dataset.C_date = pd.to_datetime(dataset.C_date)
dataset
Out[27]: 
A_id  B_id     C_date   D   E
0     1    10 2019-02-22  10   7
1     2    20 2019-02-28  12  10
2     1    10 2019-07-03  21  14
3     2    20 2019-03-14  81  31
4     1    10 2019-12-21  20  61
5     2    20 2019-11-10   1   9

I would like to have this result in better effective way than my solution :

   A_id  B_id     C_date   D   E  D_mean  E_mean
0     1    10 2019-02-22  10   7     NaN     NaN
1     2    20 2019-02-28  12  10     NaN     NaN
2     1    10 2019-07-03  21  14    10.0     7.0
3     2    20 2019-03-14  81  31    12.0    10.0
4     1    10 2019-12-21  20  61    15.5    10.5
5     2    20 2019-11-10   1   9    46.5    20.5

Do you have an idea ?

答案1

得分: 3

我们可以使用一组函数来实现这个目标，其中最重要的是 pd.DataFrame.rolling 用于计算移动平均。

def custom_agg(group):
    cols = ['D', 'E']
    for col in cols:
        name = '{}_mean'.format(col)
        group[name] = group[col].shift() \
                                .rolling(len(group[col]), min_periods=2) \
                                .mean() \
                                .fillna(group[col].iloc[0])
        group[name].iloc[0] = pd.np.nan
    return group

dataset.groupby(['A_id', 'B_id'], as_index=False).apply(custom_agg)

可能有一种更加优雅的方法来完成这个任务，但使用这种方法应该已经能够看到性能的提升。只需确保 C_date 列事先按照顺序排序，因为它是一个移动平均。

英文:

We can use a combination of functions to achieve this, most notable the pd.DataFrame.rolling to calculate the moving average.

def custom_agg(group):
cols = [&#39;D&#39;, &#39;E&#39;]
for col in cols:
name = &#39;{}_mean&#39;.format(col)
group[name] = group[col].shift() \
.rolling(len(group[col]), min_periods=2) \
.mean() \
.fillna(group[col].iloc[0])
group[name].iloc[0] = pd.np.nan
return group
dataset.groupby([&#39;A_id&#39;, &#39;B_id&#39;], as_index=False).apply(custom_agg)
A_id  B_id     C_date   D   E  D_mean  E_mean
0     1    10 2019-02-22  10   7     NaN     NaN
1     2    20 2019-02-28  12  10     NaN     NaN
2     1    10 2019-07-03  21  14    10.0     7.0
3     2    20 2019-03-14  81  31    12.0    10.0
4     1    10 2019-12-21  20  61    15.5    10.5
5     2    20 2019-11-10   1   9    46.5    20.5

There might be an even more elegant way of doing this, however you should already see a performance increase using this method. Just make sure the C_date column is sorted ahead of time since it is a moving average.

答案2

得分: 0

我怀疑你在循环中创建子集的操作很耗费资源，我的测试表明你的算法每分钟运行约11,000个索引。我提出了一种替代算法，它对数据进行预排序，以使计算子集变得微不足道，对包含20万行随机数据的数据集的处理时间不到5分钟。

dataset.sort_values(by=['A_id', 'B_id', 'C_date'], inplace=True)
dataset.reset_index(drop=True, inplace=True)

last_A = None
last_B = None
first_index = -1
for index in dataset.index:
    A_id = dataset.loc[index, 'A_id']
    B_id = dataset.loc[index, 'B_id']
    C_date = dataset.loc[index, 'C_date']

    if (last_A != A_id) | (last_B != B_id):
        first_index = index
        last_A = A_id
        last_B = B_id

    subset = dataset[first_index:index]
    dataset.at[index, 'D_mean'] = subset['D'].mean()
    dataset.at[index, 'E_mean'] = subset['E'].mean()

英文:

I suspected that your creation of subset in the loop was expensive, and my testing revealed that your algorithm was running at about ~11,000 indices per minute. I came up with an alternative algorithm that pre-sorts the data so that computing the subset becomes trivial, and running over a 200k-row dataset of random data takes under 5 minutes.

dataset.sort_values(by=[&#39;A_id&#39;, &#39;B_id&#39;, &#39;C_date&#39;], inplace=True)
dataset.reset_index(drop=True, inplace=True)
last_A = None
last_B = None
first_index = -1
for index in dataset.index:
A_id = dataset.loc[index, &#39;A_id&#39;]
B_id = dataset.loc[index, &#39;B_id&#39;]
C_date = dataset.loc[index, &#39;C_date&#39;]
if (last_A != A_id) | (last_B != B_id):
first_index = index
last_A = A_id
last_B = B_id
subset = dataset[first_index:index]
dataset.at[index, &#39;D_mean&#39;] = subset[&#39;D&#39;].mean()
dataset.at[index, &#39;E_mean&#39;] = subset[&#39;E&#39;].mean()

答案3

得分: 0

这是使用 .apply 的一种方法：

dataset[['D_mean', 'E_mean']] = (dataset
                                .apply(lambda df: dataset[(dataset['A_id'] == df['A_id']) & 
                                                          (dataset['B_id'] == df['B_id']) & 
                                                          (dataset['C_date'] < df['C_date'])
                                                          ][['D','E']].mean(axis=0), axis=1)

   A_id  B_id     C_date   D   E  D_mean  E_mean
0     1    10 2019-02-22  10   7     NaN     NaN
1     2    20 2019-02-28  12  10     NaN     NaN
2     1    10 2019-07-03  21  14    10.0     7.0
3     2    20 2019-03-14  81  31    12.0    10.0
4     1    10 2019-12-21  20  61    15.5    10.5
5     2    20 2019-11-10   1   9    46.5    20.5

英文:

Here's one way to do using .apply:

dataset[[&#39;D_mean&#39;, &#39;E_mean&#39;]] = (dataset
.apply(lambda df: dataset[(dataset[&#39;A_id&#39;] == df[&#39;A_id&#39;]) &amp; 
(dataset[&#39;B_id&#39;] == df[&#39;B_id&#39;]) &amp; 
(dataset[&#39;C_date&#39;] &lt; df[&#39;C_date&#39;])
][[&#39;D&#39;,&#39;E&#39;]].mean(axis=0), axis=1)
A_id  B_id     C_date   D   E  D_mean  E_mean
0     1    10 2019-02-22  10   7     NaN     NaN
1     2    20 2019-02-28  12  10     NaN     NaN
2     1    10 2019-07-03  21  14    10.0     7.0
3     2    20 2019-03-14  81  31    12.0    10.0
4     1    10 2019-12-21  20  61    15.5    10.5
5     2    20 2019-11-10   1   9    46.5    20.5

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在pandas数据帧中如何基于其他行创建新列？

问题

答案1

答案2

答案3

如何在 pandas 中获取姓和名，当姓是多个名字时。

在保存的 PNG 图像中显示数据标签的方法：

如何优化处理大型数据框的pandas iterrows。

获取多个项目，根据任意查询 -> Python FastAPI

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论