英文:
Pandas increment every nth row in dataframe with conditions and groupby
问题
这是我的当前数据框(df1)的简化示例,它包含了来自嵌套循环计算的结果合并到客户数据之后的内容。我的实际数据框包含超过1000万行,所以我正在处理大数据,希望找到最快的方法。
我正在尝试找到以下操作的最有效方式:
按照每个用户的 id 进行分组,我想要:
- 将列名为 "period" 的每个第三行的值增加 1。
- 将列名为 "month_end" 的每个第三行的日期增加 1(到下一个月底日期)。
- 添加一个计算编号列以标识计算(calc_num)。
我的预期输出是 df2 数据框。
英文:
This is a simplified example of my current dataframe (df1), after merging results from a nested loop calculation back to customer data.
My actual dataframe is 10 million rows+, so im dealing with large data, and would prefer the fastest way.
df1 = pd.DataFrame({"id": ['z111','z111','z111','z111','z112','z112','z112','z112'], #customer data
"calc_amt": [1000,500,200,300,100,50,30,200],
"month_end":['28-02-2023','28-02-2023','28-02-2023','28-02-2023','28-02-2023','28-02-2023','28-02-2023','28-02-2023'],
"period":[2,2,2,2,6,6,6,6],})
I am trying to find the most efficient way to do the following;
grouping by each users id, id like to,
increment every 3rd row in the column name period by 1
increment every 3rd row in the column name month_end by 1 (to the next month_end date)
add a calculation number column to label the calculations(calc_num).
My expected output (df2)
df2 = pd.DataFrame({"id": ['z111','z111','z111','z111','z112','z112','z112','z112'], #customer data
"calc_amt": [1000,500,200,300,100,50,30,200],
"month_end":['28-02-2023','28-02-2023','31-03-2023','31-03-2023','28-02-2023','28-02-2023','31-03-2023','31-03-2023'],
"period":[2,2,3,3,6,6,7,7],
"calc_num":[1,2,1,2,1,2,1,2],})
答案1
得分: 1
你可以使用 groupby.cumcount
来对每个组的行进行枚举,然后使用模数或整数除法:
N = 2 # 周期性
# 确保日期时间
df1['month_end'] = pd.to_datetime(df1['month_end'])
# 枚举行
c = df1.groupby('id').cumcount()
df1['period'] += c.floordiv(N)
df1['calc_num'] = c.mod(N).add(1)
df1['month_end'] += np.array([pd.offsets.MonthEnd(x) for x in c.floordiv(2)])
注:如果要创建一个新的数据框,首先运行 df2 = df1.copy()
,然后使用 df2
。
输出:
id calc_amt month_end period calc_num
0 z111 1000 2023-02-28 2 1
1 z111 500 2023-02-28 2 2
2 z111 200 2023-03-31 3 1
3 z111 300 2023-03-31 3 2
4 z111 400 2023-04-30 4 1
5 z111 450 2023-04-30 4 2
6 z112 100 2023-02-28 6 1
7 z112 50 2023-02-28 6 2
8 z112 30 2023-03-31 7 1
9 z112 200 2023-03-31 7 2
英文:
You can use groupby.cumcount
to enumerate the rows per group, then use the modulo or floor division:
N = 2 # periodicity
# ensure datetime
df1['month_end'] = pd.to_datetime(df1['month_end'])
# enumerate rows
c = df1.groupby('id').cumcount()
df1['period'] += c.floordiv(N)
df1['calc_num'] = c.mod(N).add(1)
df1['month_end'] += np.array([pd.offsets.MonthEnd(x) for x in c.floordiv(2)])
NB. if you want to create a new dataframe, first run df2 = df1.copy()
, then use df2
.
Output:
id calc_amt month_end period calc_num
0 z111 1000 2023-02-28 2 1
1 z111 500 2023-02-28 2 2
2 z111 200 2023-03-31 3 1
3 z111 300 2023-03-31 3 2
4 z111 400 2023-04-30 4 1
5 z111 450 2023-04-30 4 2
6 z112 100 2023-02-28 6 1
7 z112 50 2023-02-28 6 2
8 z112 30 2023-03-31 7 1
9 z112 200 2023-03-31 7 2
答案2
得分: 0
以下是翻译好的代码部分:
使用 GroupBy.cumcount
进行整数计数和模除 2
,然后将下个月添加到月份期间并将其转换为月份周期:
df1['month_end'] = pd.to_datetime(df1['month_end'])
g = df1.groupby('id').cumcount()
df2 = df1.assign(period = df1['period'] + g // 2,
calc_num = g % 2 + 1,
month_end = (df1['month_end'].dt.to_period('M') +
g // 2).dt.to_timestamp(how='e').dt.normalize())
print (df2)
id calc_amt month_end period calc_num
0 z111 1000 2023-02-28 2 1
1 z111 500 2023-02-28 2 2
2 z111 200 2023-03-31 3 1
3 z111 300 2023-03-31 3 2
4 z112 100 2023-02-28 6 1
5 z112 50 2023-02-28 6 2
6 z112 30 2023-03-31 7 1
7 z112 200 2023-03-31 7 2
或者使用列表推导和 offsets.MonthEnd
:
df1['month_end'] = pd.to_datetime(df1['month_end'])
g = df1.groupby('id').cumcount()
df2 = df1.assign(period = df1['period'] + g // 2,
calc_num = g % 2 + 1,
month_end = [x + pd.offsets.MonthEnd(y) for x, y in zip(df1['month_end'], g // 2)])
print (df2)
id calc_amt month_end period calc_num
0 z111 1000 2023-02-28 2 1
1 z111 500 2023-02-28 2 2
2 z111 200 2023-03-31 3 1
3 z111 300 2023-03-31 3 2
4 z112 100 2023-02-28 6 1
5 z112 50 2023-02-28 6 2
6 z112 30 2023-03-31 7 1
7 z112 200 2023-03-31 7 2
因为在处理大型 DataFrame 时,以下是有效添加月份的技巧 - 技巧是通过整数除法 2
来添加下个月,并减去一天:
df1['month_end'] = pd.to_datetime(df1['month_end'])
g = df1.groupby('id').cumcount()
df2 = df1.assign(period = df1['period'] + g // 2,
calc_num = g % 2 + 1,
month_end = df1['month_end'].values.astype('datetime64[M]') +
np.array(g.to_numpy() // 2 + 1, dtype='timedelta64[M]') -
np.array([1], dtype='timedelta64[D]')
)
print (df2)
id calc_amt month_end period calc_num
0 z111 1000 2023-02-28 2 1
1 z111 500 2023-02-28 2 2
2 z111 200 2023-03-31 3 1
3 z111 300 2023-03-31 3 2
4 z112 100 2023-02-28 6 1
5 z112 50 2023-02-28 6 2
6 z112 30 2023-03-31 7 1
7 z112 200 2023-03-31 7 2
英文:
Use GroupBy.cumcount
for counter with integer and modulo division by 2
, last add next months with converting to month periods by Serie.dt.to_period
:
df1['month_end'] = pd.to_datetime(df1['month_end'])
g = df1.groupby('id').cumcount()
df2 = df1.assign(period = df1['period'] + g // 2,
calc_num = g % 2 + 1,
month_end = (df1['month_end'].dt.to_period('m') +
g // 2).dt.to_timestamp(how='e').dt.normalize())
print (df2)
id calc_amt month_end period calc_num
0 z111 1000 2023-02-28 2 1
1 z111 500 2023-02-28 2 2
2 z111 200 2023-03-31 3 1
3 z111 300 2023-03-31 3 2
4 z112 100 2023-02-28 6 1
5 z112 50 2023-02-28 6 2
6 z112 30 2023-03-31 7 1
7 z112 200 2023-03-31 7 2
Or use lsit comprehension with offsets.MonthEnd
:
df1['month_end'] = pd.to_datetime(df1['month_end'])
g = df1.groupby('id').cumcount()
df2 = df1.assign(period = df1['period'] + g // 2,
calc_num = g % 2 + 1,
month_end = [x + pd.offsets.MonthEnd(y) for x , y
in zip(df1['month_end'], g // 2)])
print (df2)
id calc_amt month_end period calc_num
0 z111 1000 2023-02-28 2 1
1 z111 500 2023-02-28 2 2
2 z111 200 2023-03-31 3 1
3 z111 300 2023-03-31 3 2
4 z112 100 2023-02-28 6 1
5 z112 50 2023-02-28 6 2
6 z112 30 2023-03-31 7 1
7 z112 200 2023-03-31 7 2
Because working with large DataFrame, here is trick for add monhts effectively - trick is add next months from integer division by 2
and subtract one day:
df1['month_end'] = pd.to_datetime(df1['month_end'])
g = df1.groupby('id').cumcount()
df2 = df1.assign(period = df1['period'] + g // 2,
calc_num = g % 2 + 1,
month_end = df1['month_end'].values.astype('datetime64[M]') +
np.array(g.to_numpy() // 2 + 1, dtype='timedelta64[M]') -
np.array([1], dtype='timedelta64[D]')
)
print (df2)
id calc_amt month_end period calc_num
0 z111 1000 2023-02-28 2 1
1 z111 500 2023-02-28 2 2
2 z111 200 2023-03-31 3 1
3 z111 300 2023-03-31 3 2
4 z112 100 2023-02-28 6 1
5 z112 50 2023-02-28 6 2
6 z112 30 2023-03-31 7 1
7 z112 200 2023-03-31 7 2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论