Pandas根据条件和分组,递增数据框中的每第n行。

huangapple go评论91阅读模式
英文:

Pandas increment every nth row in dataframe with conditions and groupby

问题

这是我的当前数据框(df1)的简化示例,它包含了来自嵌套循环计算的结果合并到客户数据之后的内容。我的实际数据框包含超过1000万行,所以我正在处理大数据,希望找到最快的方法。

我正在尝试找到以下操作的最有效方式:
按照每个用户的 id 进行分组,我想要:

  1. 将列名为 "period" 的每个第三行的值增加 1。
  2. 将列名为 "month_end" 的每个第三行的日期增加 1(到下一个月底日期)。
  3. 添加一个计算编号列以标识计算(calc_num)。

我的预期输出是 df2 数据框。

英文:

This is a simplified example of my current dataframe (df1), after merging results from a nested loop calculation back to customer data.
My actual dataframe is 10 million rows+, so im dealing with large data, and would prefer the fastest way.


df1 = pd.DataFrame({"id": ['z111','z111','z111','z111','z112','z112','z112','z112'], #customer data
                    "calc_amt": [1000,500,200,300,100,50,30,200],
                    "month_end":['28-02-2023','28-02-2023','28-02-2023','28-02-2023','28-02-2023','28-02-2023','28-02-2023','28-02-2023'],
                    "period":[2,2,2,2,6,6,6,6],})
  
  

I am trying to find the most efficient way to do the following;
grouping by each users id, id like to,

increment every 3rd row in the column name period by 1
increment every 3rd row in the column name month_end by 1 (to the next month_end date)
add a calculation number column to label the calculations(calc_num).

My expected output (df2)


df2 = pd.DataFrame({"id": ['z111','z111','z111','z111','z112','z112','z112','z112'], #customer data
                    "calc_amt": [1000,500,200,300,100,50,30,200],
                    "month_end":['28-02-2023','28-02-2023','31-03-2023','31-03-2023','28-02-2023','28-02-2023','31-03-2023','31-03-2023'],
                    "period":[2,2,3,3,6,6,7,7],
                    "calc_num":[1,2,1,2,1,2,1,2],})
                    

答案1

得分: 1

你可以使用 groupby.cumcount 来对每个组的行进行枚举,然后使用模数或整数除法:

N = 2 # 周期性

# 确保日期时间
df1['month_end'] = pd.to_datetime(df1['month_end'])

# 枚举行
c = df1.groupby('id').cumcount()

df1['period'] += c.floordiv(N)
df1['calc_num'] = c.mod(N).add(1)
df1['month_end'] += np.array([pd.offsets.MonthEnd(x) for x in c.floordiv(2)])

注:如果要创建一个新的数据框,首先运行 df2 = df1.copy(),然后使用 df2

输出:

     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z111       400 2023-04-30       4         1
5  z111       450 2023-04-30       4         2
6  z112       100 2023-02-28       6         1
7  z112        50 2023-02-28       6         2
8  z112        30 2023-03-31       7         1
9  z112       200 2023-03-31       7         2
英文:

You can use groupby.cumcount to enumerate the rows per group, then use the modulo or floor division:

N = 2 # periodicity

# ensure datetime
df1['month_end'] = pd.to_datetime(df1['month_end'])

# enumerate rows
c = df1.groupby('id').cumcount()

df1['period'] += c.floordiv(N)
df1['calc_num'] = c.mod(N).add(1)
df1['month_end'] += np.array([pd.offsets.MonthEnd(x) for x in c.floordiv(2)])

NB. if you want to create a new dataframe, first run df2 = df1.copy(), then use df2.

Output:

     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z111       400 2023-04-30       4         1
5  z111       450 2023-04-30       4         2
6  z112       100 2023-02-28       6         1
7  z112        50 2023-02-28       6         2
8  z112        30 2023-03-31       7         1
9  z112       200 2023-03-31       7         2

答案2

得分: 0

以下是翻译好的代码部分:

使用 GroupBy.cumcount 进行整数计数和模除 2,然后将下个月添加到月份期间并将其转换为月份周期:

df1['month_end'] = pd.to_datetime(df1['month_end'])

g = df1.groupby('id').cumcount()

df2 = df1.assign(period = df1['period'] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = (df1['month_end'].dt.to_period('M') + 
                              g // 2).dt.to_timestamp(how='e').dt.normalize())

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

或者使用列表推导和 offsets.MonthEnd

df1['month_end'] = pd.to_datetime(df1['month_end'])

g = df1.groupby('id').cumcount()

df2 = df1.assign(period = df1['period'] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = [x + pd.offsets.MonthEnd(y) for x, y in zip(df1['month_end'], g // 2)])

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

因为在处理大型 DataFrame 时,以下是有效添加月份的技巧 - 技巧是通过整数除法 2 来添加下个月,并减去一天:

df1['month_end'] = pd.to_datetime(df1['month_end'])

g = df1.groupby('id').cumcount()

df2 = df1.assign(period = df1['period'] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = df1['month_end'].values.astype('datetime64[M]') + 
                             np.array(g.to_numpy() // 2 + 1, dtype='timedelta64[M]') - 
                             np.array([1], dtype='timedelta64[D]')
                             )

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2
英文:

Use GroupBy.cumcount for counter with integer and modulo division by 2, last add next months with converting to month periods by Serie.dt.to_period:

df1['month_end'] = pd.to_datetime(df1['month_end'])

g = df1.groupby('id').cumcount()

df2 = df1.assign(period = df1['period'] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = (df1['month_end'].dt.to_period('m') + 
                              g // 2).dt.to_timestamp(how='e').dt.normalize())

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

Or use lsit comprehension with offsets.MonthEnd:

df1['month_end'] = pd.to_datetime(df1['month_end'])

g = df1.groupby('id').cumcount()

df2 = df1.assign(period = df1['period'] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = [x + pd.offsets.MonthEnd(y) for x , y 
                              in zip(df1['month_end'], g // 2)])

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

Because working with large DataFrame, here is trick for add monhts effectively - trick is add next months from integer division by 2 and subtract one day:

df1['month_end'] = pd.to_datetime(df1['month_end'])

g = df1.groupby('id').cumcount()

df2 = df1.assign(period = df1['period'] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = df1['month_end'].values.astype('datetime64[M]') + 
                             np.array(g.to_numpy() // 2 + 1, dtype='timedelta64[M]') - 
                             np.array([1], dtype='timedelta64[D]')
                             )

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

huangapple
  • 本文由 发表于 2023年3月8日 18:50:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75672075.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定