英文:
How create a new column based on other rows in pandas dataframe?
问题
以下是您的代码的翻译部分:
我有一个包含200,000行的数据框,并尝试根据其他行的一些条件添加列。我尝试过实现它,但花费了很长时间(2小时)。
这是我的代码:
for index in dataset.index:
A_id = dataset.loc[index, 'A_id']
B_id = dataset.loc[index, 'B_id']
C_date = dataset.loc[index, 'C_date']
subset = dataset[
(dataset['A_id'] == A_id) & (dataset['B_id'] == B_id) & (
dataset['C_date'] < C_date)]
dataset.at[index, 'D_mean'] = subset['D'].mean()
dataset.at[index, 'E_mean'] = subset['E'].mean()
我的数据框看起来像这样:
A = [1, 2, 1, 2, 1, 2]
B = [10, 20, 10, 20, 10, 20]
C = ["22-02-2019", "28-02-19", "07-03-2019", "14-03-2019", "21-12-2019", "11-10-2019"]
D = [10, 12, 21, 81, 20, 1]
E = [7, 10, 14, 31, 61, 9]
dataset = pd.DataFrame({
'A_id': A,
'B_id': B,
'C_date': C,
'D': D,
'E': E,
})
dataset.C_date = pd.to_datetime(dataset.C_date)
dataset
Out[27]:
A_id B_id C_date D E
0 1 10 2019-02-22 10 7
1 2 20 2019-02-28 12 10
2 1 10 2019-07-03 21 14
3 2 20 2019-03-14 81 31
4 1 10 2019-12-21 20 61
5 2 20 2019-11-10 1 9
我希望以比我的解决方案更有效的方式获得这个结果:
A_id B_id C_date D E D_mean E_mean
0 1 10 2019-02-22 10 7 NaN NaN
1 2 20 2019-02-28 12 10 NaN NaN
2 1 10 2019-07-03 21 14 10.0 7.0
3 2 20 2019-03-14 81 31 12.0 10.0
4 1 10 2019-12-21 20 61 15.5 10.5
5 2 20 2019-11-10 1 9 46.5 20.5
您有什么想法?
英文:
I have a data frame with 200k rows and i try to add columns based on other rows with some conditions. I tried to achieve it but take a lot of time(2 hours).
Here is my code :
for index in dataset.index:
A_id = dataset.loc[index, 'A_id']
B_id = dataset.loc[index, 'B_id']
C_date = dataset.loc[index, 'C_date']
subset = dataset[
(dataset['A_id'] == A_id) & (dataset['B_id'] == B_id) & (
dataset['C_date'] < C_date)]
dataset.at[index, 'D_mean'] = subset['D'].mean()
dataset.at[index, 'E_mean'] = subset['E'].mean()
My data frame looks this:
A = [1, 2, 1, 2, 1, 2]
B = [10, 20, 10, 20, 10, 20]
C = ["22-02-2019", "28-02-19", "07-03-2019", "14-03-2019", "21-12-2019", "11-10-2019"]
D = [10, 12, 21, 81, 20, 1]
E = [7, 10, 14, 31, 61, 9]
dataset = pd.DataFrame({
'A_id': A,
'B_id': B,
'C_date': C,
'D': D,
'E': E,
})
dataset.C_date = pd.to_datetime(dataset.C_date)
dataset
Out[27]:
A_id B_id C_date D E
0 1 10 2019-02-22 10 7
1 2 20 2019-02-28 12 10
2 1 10 2019-07-03 21 14
3 2 20 2019-03-14 81 31
4 1 10 2019-12-21 20 61
5 2 20 2019-11-10 1 9
I would like to have this result in better effective way than my solution :
A_id B_id C_date D E D_mean E_mean
0 1 10 2019-02-22 10 7 NaN NaN
1 2 20 2019-02-28 12 10 NaN NaN
2 1 10 2019-07-03 21 14 10.0 7.0
3 2 20 2019-03-14 81 31 12.0 10.0
4 1 10 2019-12-21 20 61 15.5 10.5
5 2 20 2019-11-10 1 9 46.5 20.5
Do you have an idea ?
答案1
得分: 3
我们可以使用一组函数来实现这个目标,其中最重要的是 pd.DataFrame.rolling
用于计算移动平均。
def custom_agg(group):
cols = ['D', 'E']
for col in cols:
name = '{}_mean'.format(col)
group[name] = group[col].shift() \
.rolling(len(group[col]), min_periods=2) \
.mean() \
.fillna(group[col].iloc[0])
group[name].iloc[0] = pd.np.nan
return group
dataset.groupby(['A_id', 'B_id'], as_index=False).apply(custom_agg)
可能有一种更加优雅的方法来完成这个任务,但使用这种方法应该已经能够看到性能的提升。只需确保 C_date
列事先按照顺序排序,因为它是一个移动平均。
英文:
We can use a combination of functions to achieve this, most notable the pd.DataFrame.rolling
to calculate the moving average.
def custom_agg(group):
cols = ['D', 'E']
for col in cols:
name = '{}_mean'.format(col)
group[name] = group[col].shift() \
.rolling(len(group[col]), min_periods=2) \
.mean() \
.fillna(group[col].iloc[0])
group[name].iloc[0] = pd.np.nan
return group
dataset.groupby(['A_id', 'B_id'], as_index=False).apply(custom_agg)
A_id B_id C_date D E D_mean E_mean
0 1 10 2019-02-22 10 7 NaN NaN
1 2 20 2019-02-28 12 10 NaN NaN
2 1 10 2019-07-03 21 14 10.0 7.0
3 2 20 2019-03-14 81 31 12.0 10.0
4 1 10 2019-12-21 20 61 15.5 10.5
5 2 20 2019-11-10 1 9 46.5 20.5
There might be an even more elegant way of doing this, however you should already see a performance increase using this method. Just make sure the C_date
column is sorted ahead of time since it is a moving average.
答案2
得分: 0
我怀疑你在循环中创建子集的操作很耗费资源,我的测试表明你的算法每分钟运行约11,000个索引。我提出了一种替代算法,它对数据进行预排序,以使计算子集变得微不足道,对包含20万行随机数据的数据集的处理时间不到5分钟。
dataset.sort_values(by=['A_id', 'B_id', 'C_date'], inplace=True)
dataset.reset_index(drop=True, inplace=True)
last_A = None
last_B = None
first_index = -1
for index in dataset.index:
A_id = dataset.loc[index, 'A_id']
B_id = dataset.loc[index, 'B_id']
C_date = dataset.loc[index, 'C_date']
if (last_A != A_id) | (last_B != B_id):
first_index = index
last_A = A_id
last_B = B_id
subset = dataset[first_index:index]
dataset.at[index, 'D_mean'] = subset['D'].mean()
dataset.at[index, 'E_mean'] = subset['E'].mean()
英文:
I suspected that your creation of subset in the loop was expensive, and my testing revealed that your algorithm was running at about ~11,000 indices per minute. I came up with an alternative algorithm that pre-sorts the data so that computing the subset becomes trivial, and running over a 200k-row dataset of random data takes under 5 minutes.
dataset.sort_values(by=['A_id', 'B_id', 'C_date'], inplace=True)
dataset.reset_index(drop=True, inplace=True)
last_A = None
last_B = None
first_index = -1
for index in dataset.index:
A_id = dataset.loc[index, 'A_id']
B_id = dataset.loc[index, 'B_id']
C_date = dataset.loc[index, 'C_date']
if (last_A != A_id) | (last_B != B_id):
first_index = index
last_A = A_id
last_B = B_id
subset = dataset[first_index:index]
dataset.at[index, 'D_mean'] = subset['D'].mean()
dataset.at[index, 'E_mean'] = subset['E'].mean()
答案3
得分: 0
这是使用 .apply
的一种方法:
dataset[['D_mean', 'E_mean']] = (dataset
.apply(lambda df: dataset[(dataset['A_id'] == df['A_id']) &
(dataset['B_id'] == df['B_id']) &
(dataset['C_date'] < df['C_date'])
][['D','E']].mean(axis=0), axis=1)
A_id B_id C_date D E D_mean E_mean
0 1 10 2019-02-22 10 7 NaN NaN
1 2 20 2019-02-28 12 10 NaN NaN
2 1 10 2019-07-03 21 14 10.0 7.0
3 2 20 2019-03-14 81 31 12.0 10.0
4 1 10 2019-12-21 20 61 15.5 10.5
5 2 20 2019-11-10 1 9 46.5 20.5
英文:
Here's one way to do using .apply
:
dataset[['D_mean', 'E_mean']] = (dataset
.apply(lambda df: dataset[(dataset['A_id'] == df['A_id']) &
(dataset['B_id'] == df['B_id']) &
(dataset['C_date'] < df['C_date'])
][['D','E']].mean(axis=0), axis=1)
A_id B_id C_date D E D_mean E_mean
0 1 10 2019-02-22 10 7 NaN NaN
1 2 20 2019-02-28 12 10 NaN NaN
2 1 10 2019-07-03 21 14 10.0 7.0
3 2 20 2019-03-14 81 31 12.0 10.0
4 1 10 2019-12-21 20 61 15.5 10.5
5 2 20 2019-11-10 1 9 46.5 20.5
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论