2023年3月8日 18:50:38go评论99阅读模式

英文:

Pandas increment every nth row in dataframe with conditions and groupby

问题

这是我的当前数据框（df1）的简化示例，它包含了来自嵌套循环计算的结果合并到客户数据之后的内容。我的实际数据框包含超过1000万行，所以我正在处理大数据，希望找到最快的方法。

我正在尝试找到以下操作的最有效方式：
按照每个用户的 id 进行分组，我想要：

将列名为 "period" 的每个第三行的值增加 1。
将列名为 "month_end" 的每个第三行的日期增加 1（到下一个月底日期）。
添加一个计算编号列以标识计算（calc_num）。

我的预期输出是 df2 数据框。

英文:

This is a simplified example of my current dataframe (df1), after merging results from a nested loop calculation back to customer data.
My actual dataframe is 10 million rows+, so im dealing with large data, and would prefer the fastest way.


df1 = pd.DataFrame({&quot;id&quot;: [&#39;z111&#39;,&#39;z111&#39;,&#39;z111&#39;,&#39;z111&#39;,&#39;z112&#39;,&#39;z112&#39;,&#39;z112&#39;,&#39;z112&#39;], #customer data
                    &quot;calc_amt&quot;: [1000,500,200,300,100,50,30,200],
                    &quot;month_end&quot;:[&#39;28-02-2023&#39;,&#39;28-02-2023&#39;,&#39;28-02-2023&#39;,&#39;28-02-2023&#39;,&#39;28-02-2023&#39;,&#39;28-02-2023&#39;,&#39;28-02-2023&#39;,&#39;28-02-2023&#39;],
                    &quot;period&quot;:[2,2,2,2,6,6,6,6],})

I am trying to find the most efficient way to do the following;
grouping by each users id, id like to,

increment every 3rd row in the column name period by 1
increment every 3rd row in the column name month_end by 1 (to the next month_end date)
add a calculation number column to label the calculations(calc_num).

My expected output (df2)


df2 = pd.DataFrame({&quot;id&quot;: [&#39;z111&#39;,&#39;z111&#39;,&#39;z111&#39;,&#39;z111&#39;,&#39;z112&#39;,&#39;z112&#39;,&#39;z112&#39;,&#39;z112&#39;], #customer data
                    &quot;calc_amt&quot;: [1000,500,200,300,100,50,30,200],
                    &quot;month_end&quot;:[&#39;28-02-2023&#39;,&#39;28-02-2023&#39;,&#39;31-03-2023&#39;,&#39;31-03-2023&#39;,&#39;28-02-2023&#39;,&#39;28-02-2023&#39;,&#39;31-03-2023&#39;,&#39;31-03-2023&#39;],
                    &quot;period&quot;:[2,2,3,3,6,6,7,7],
                    &quot;calc_num&quot;:[1,2,1,2,1,2,1,2],})

答案1

得分: 1

你可以使用 groupby.cumcount 来对每个组的行进行枚举，然后使用模数或整数除法：

N = 2 # 周期性

# 确保日期时间
df1['month_end'] = pd.to_datetime(df1['month_end'])

# 枚举行
c = df1.groupby('id').cumcount()

df1['period'] += c.floordiv(N)
df1['calc_num'] = c.mod(N).add(1)
df1['month_end'] += np.array([pd.offsets.MonthEnd(x) for x in c.floordiv(2)])

注：如果要创建一个新的数据框，首先运行 df2 = df1.copy()，然后使用 df2。

输出：

     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z111       400 2023-04-30       4         1
5  z111       450 2023-04-30       4         2
6  z112       100 2023-02-28       6         1
7  z112        50 2023-02-28       6         2
8  z112        30 2023-03-31       7         1
9  z112       200 2023-03-31       7         2

英文:

You can use groupby.cumcount to enumerate the rows per group, then use the modulo or floor division:

N = 2 # periodicity

# ensure datetime
df1[&#39;month_end&#39;] = pd.to_datetime(df1[&#39;month_end&#39;])

# enumerate rows
c = df1.groupby(&#39;id&#39;).cumcount()

df1[&#39;period&#39;] += c.floordiv(N)
df1[&#39;calc_num&#39;] = c.mod(N).add(1)
df1[&#39;month_end&#39;] += np.array([pd.offsets.MonthEnd(x) for x in c.floordiv(2)])

NB. if you want to create a new dataframe, first run df2 = df1.copy(), then use df2.

Output:

     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z111       400 2023-04-30       4         1
5  z111       450 2023-04-30       4         2
6  z112       100 2023-02-28       6         1
7  z112        50 2023-02-28       6         2
8  z112        30 2023-03-31       7         1
9  z112       200 2023-03-31       7         2

答案2

得分: 0

以下是翻译好的代码部分：

使用 GroupBy.cumcount 进行整数计数和模除 2，然后将下个月添加到月份期间并将其转换为月份周期：

df1['month_end'] = pd.to_datetime(df1['month_end'])

g = df1.groupby('id').cumcount()

df2 = df1.assign(period = df1['period'] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = (df1['month_end'].dt.to_period('M') + 
                              g // 2).dt.to_timestamp(how='e').dt.normalize())

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

或者使用列表推导和 offsets.MonthEnd：

df1['month_end'] = pd.to_datetime(df1['month_end'])

g = df1.groupby('id').cumcount()

df2 = df1.assign(period = df1['period'] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = [x + pd.offsets.MonthEnd(y) for x, y in zip(df1['month_end'], g // 2)])

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

因为在处理大型 DataFrame 时，以下是有效添加月份的技巧 - 技巧是通过整数除法 2 来添加下个月，并减去一天：

df1['month_end'] = pd.to_datetime(df1['month_end'])

g = df1.groupby('id').cumcount()

df2 = df1.assign(period = df1['period'] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = df1['month_end'].values.astype('datetime64[M]') + 
                             np.array(g.to_numpy() // 2 + 1, dtype='timedelta64[M]') - 
                             np.array([1], dtype='timedelta64[D]')
                             )

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

英文:

Use GroupBy.cumcount for counter with integer and modulo division by 2, last add next months with converting to month periods by Serie.dt.to_period:

df1[&#39;month_end&#39;] = pd.to_datetime(df1[&#39;month_end&#39;])

g = df1.groupby(&#39;id&#39;).cumcount()

df2 = df1.assign(period = df1[&#39;period&#39;] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = (df1[&#39;month_end&#39;].dt.to_period(&#39;m&#39;) + 
                              g // 2).dt.to_timestamp(how=&#39;e&#39;).dt.normalize())

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

Or use lsit comprehension with offsets.MonthEnd:

df1[&#39;month_end&#39;] = pd.to_datetime(df1[&#39;month_end&#39;])

g = df1.groupby(&#39;id&#39;).cumcount()

df2 = df1.assign(period = df1[&#39;period&#39;] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = [x + pd.offsets.MonthEnd(y) for x , y 
                              in zip(df1[&#39;month_end&#39;], g // 2)])

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

Because working with large DataFrame, here is trick for add monhts effectively - trick is add next months from integer division by 2 and subtract one day:

df1[&#39;month_end&#39;] = pd.to_datetime(df1[&#39;month_end&#39;])

g = df1.groupby(&#39;id&#39;).cumcount()

df2 = df1.assign(period = df1[&#39;period&#39;] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = df1[&#39;month_end&#39;].values.astype(&#39;datetime64[M]&#39;) + 
                             np.array(g.to_numpy() // 2 + 1, dtype=&#39;timedelta64[M]&#39;) - 
                             np.array([1], dtype=&#39;timedelta64[D]&#39;)
                             )

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas根据条件和分组，递增数据框中的每第n行。

问题

答案1

答案2

Python＆VTK＆PyQt5：如何在def init中截取vtk渲染的屏幕？

如何在Python的枚举中初始化命名元组

修复包含字典的Python代码。

Python Pandas Fisher Exact Test 2×2 Python Pandas Fisher精确检验2×2

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论