2023年3月7日 08:57:40go评论101阅读模式

英文:

How to conditionally count based on grouping and time constraints

问题

我明白你的问题。你想要在Life_xLame中获得一个累积计数，包括前一泌乳期的疾病案例。你已经正确地使用了df.groupby(['NID'])以关注相同NID的行，但在计算Life_xLame时，你需要在不同的RxDate之间重置计数。为了实现这一点，你可以进一步分组，以确保在不同的RxDate之间重新开始计数。

以下是修改后的代码示例：

df['Lact_xLame'] = (df.groupby(['NID', 'RxDate', 'Fdat'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['RxDate'], df['Fdat'], df['NID']])
                    .cumsum() + 1
                 )
df['Life_xLame'] = (df.groupby(['NID'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['RxDate'], df['NID'], df.groupby(['NID']).cumcount()])
                    .cumsum() + 1
                 )
df

这个修改后的代码在Life_xLame的计算中使用了df.groupby(['NID']).cumcount()，以确保在不同的RxDate之间重置计数。这应该会得到你期望的输出。希望这对你有所帮助。

英文:

I am counting the number of cases of lameness in cattle. I want to determine the number of cases in the current lactation and over the animals lifetime.

The following sample data is provided as an example.

[[52316,
  Timestamp(&#39;2021-11-18 00:00:00&#39;),
  Timestamp(&#39;2022-03-04 00:00:00&#39;),
  Timestamp(&#39;2022-03-04 00:00:00&#39;),
  106,
  1,
  1],
 [52316,
  Timestamp(&#39;2021-11-18 00:00:00&#39;),
  Timestamp(&#39;2022-09-21 00:00:00&#39;),
  Timestamp(&#39;2022-03-04 00:00:00&#39;),
  106,
  1,
  1],
 [52316,
  Timestamp(&#39;2021-11-18 00:00:00&#39;),
  Timestamp(&#39;2022-09-21 00:00:00&#39;),
  Timestamp(&#39;2022-09-21 00:00:00&#39;),
  307,
  2,
  2],
 [52316,
  Timestamp(&#39;2021-11-18 00:00:00&#39;),
  Timestamp(&#39;2022-10-04 00:00:00&#39;),
  Timestamp(&#39;2022-03-04 00:00:00&#39;),
  106,
  1,
  2],
 [52316,
  Timestamp(&#39;2021-11-18 00:00:00&#39;),
  Timestamp(&#39;2022-10-04 00:00:00&#39;),
  Timestamp(&#39;2022-09-21 00:00:00&#39;),
  307,
  2,
  3],
 [52316,
  Timestamp(&#39;2021-11-18 00:00:00&#39;),
  Timestamp(&#39;2022-10-05 00:00:00&#39;),
  Timestamp(&#39;2022-03-04 00:00:00&#39;),
  106,
  1,
  2],
 [52316,
  Timestamp(&#39;2021-11-18 00:00:00&#39;),
  Timestamp(&#39;2022-10-05 00:00:00&#39;),
  Timestamp(&#39;2022-09-21 00:00:00&#39;),
  307,
  2,
  3],
 [52316,
  Timestamp(&#39;2021-11-18 00:00:00&#39;),
  Timestamp(&#39;2022-10-06 00:00:00&#39;),
  Timestamp(&#39;2022-03-04 00:00:00&#39;),
  106,
  1,
  2],
 [52316,
  Timestamp(&#39;2021-11-18 00:00:00&#39;),
  Timestamp(&#39;2022-10-06 00:00:00&#39;),
  Timestamp(&#39;2022-09-21 00:00:00&#39;),
  307,
  2,
  3],
 [35724,
  Timestamp(&#39;2018-08-22 00:00:00&#39;),
  Timestamp(&#39;2018-09-08 00:00:00&#39;),
  Timestamp(&#39;2018-08-26 00:00:00&#39;),
  4,
  1,
  1],
 [35724,
  Timestamp(&#39;2018-08-22 00:00:00&#39;),
  Timestamp(&#39;2018-09-08 00:00:00&#39;),
  Timestamp(&#39;2018-09-08 00:00:00&#39;),
  17,
  2,
  2],
 [35724,
  Timestamp(&#39;2018-08-22 00:00:00&#39;),
  Timestamp(&#39;2018-11-13 00:00:00&#39;),
  Timestamp(&#39;2018-08-26 00:00:00&#39;),
  4,
  1,
  2],
 [35724,
  Timestamp(&#39;2018-08-22 00:00:00&#39;),
  Timestamp(&#39;2018-11-13 00:00:00&#39;),
  Timestamp(&#39;2018-09-08 00:00:00&#39;),
  17,
  2,
  3],
 [35724,
  Timestamp(&#39;2018-08-22 00:00:00&#39;),
  Timestamp(&#39;2018-11-13 00:00:00&#39;),
  Timestamp(&#39;2018-10-05 00:00:00&#39;),
  44,
  3,
  4],
 [10295,
  Timestamp(&#39;2005-01-19 00:00:00&#39;),
  Timestamp(&#39;2006-03-07 00:00:00&#39;),
  Timestamp(&#39;2006-03-03 00:00:00&#39;),
  408,
  1,
  1],
 [10295,
  Timestamp(&#39;2008-04-30 00:00:00&#39;),
  Timestamp(&#39;2009-01-08 00:00:00&#39;),
  Timestamp(&#39;2008-06-12 00:00:00&#39;),
  43,
  1,
  2],
 [10295,
  Timestamp(&#39;2008-04-30 00:00:00&#39;),
  Timestamp(&#39;2009-01-08 00:00:00&#39;),
  Timestamp(&#39;2008-08-28 00:00:00&#39;),
  120,
  2,
  3],
 [10295,
  Timestamp(&#39;2008-04-30 00:00:00&#39;),
  Timestamp(&#39;2009-01-08 00:00:00&#39;),
  Timestamp(&#39;2008-12-01 00:00:00&#39;),
  215,
  3,
  4],
 [10295,
  Timestamp(&#39;2008-04-30 00:00:00&#39;),
  Timestamp(&#39;2009-04-09 00:00:00&#39;),
  Timestamp(&#39;2008-06-12 00:00:00&#39;),
  43,
  1,
  2],
 [10295,
  Timestamp(&#39;2008-04-30 00:00:00&#39;),
  Timestamp(&#39;2009-04-09 00:00:00&#39;),
  Timestamp(&#39;2008-08-28 00:00:00&#39;),
  120,
  2,
  3],
 [10295,
  Timestamp(&#39;2008-04-30 00:00:00&#39;),
  Timestamp(&#39;2009-04-09 00:00:00&#39;),
  Timestamp(&#39;2008-12-01 00:00:00&#39;),
  215,
  3,
  4],
 [10295,
  Timestamp(&#39;2008-04-30 00:00:00&#39;),
  Timestamp(&#39;2009-04-09 00:00:00&#39;),
  Timestamp(&#39;2009-02-09 00:00:00&#39;),
  285,
  4,
  5]]

This produces the following dataframe

	NID	     Fdat	     RxDate	    LameDate	DIM	Lact_xLame	Life_xLame
0	52316	2021-11-18	2022-03-04	2022-03-04	106	   1	       1
1	52316	2021-11-18	2022-09-21	2022-03-04	106	   1	       1
2	52316	2021-11-18	2022-09-21	2022-09-21	307	   2	       2
3	52316	2021-11-18	2022-10-04	2022-03-04	106	   1	       2
4	52316	2021-11-18	2022-10-04	2022-09-21	307	   2	       3
5	52316	2021-11-18	2022-10-05	2022-03-04	106	   1	       2
6	52316	2021-11-18	2022-10-05	2022-09-21	307	   2	       3
7	52316	2021-11-18	2022-10-06	2022-03-04	106	   1	       2
8	52316	2021-11-18	2022-10-06	2022-09-21	307	   2	       3
9	35724	2018-08-22	2018-09-08	2018-08-26	4	   1	       1
10	35724	2018-08-22	2018-09-08	2018-09-08	17	   2           2
11	35724	2018-08-22	2018-11-13	2018-08-26	4	   1	       2
12	35724	2018-08-22	2018-11-13	2018-09-08	17	   2	       3
13	35724	2018-08-22	2018-11-13	2018-10-05	44	   3	       4
14	10295	2005-01-19	2006-03-07	2006-03-03	408	   1	       1
15	10295	2008-04-30	2009-01-08	2008-06-12	43	   1	       2
16	10295	2008-04-30	2009-01-08	2008-08-28	120	   2	       3
17	10295	2008-04-30	2009-01-08	2008-12-01	215	   3	       4
18	10295	2008-04-30	2009-04-09	2008-06-12	43	   1	       2
19	10295	2008-04-30	2009-04-09	2008-08-28	120	   2	       3
20	10295	2008-04-30	2009-04-09	2008-12-01	215	   3	       4
21	10295	2008-04-30	2009-04-09	2009-02-09	285	   4	       5

I have prepared the following code to count the number of Lact_xLame (cases of lameness that are more than 7 days apart within the lactation that preceed the RxDate (Time of a mastitis event) and Life_xLame that counts the cumulative number of cases (cases of lameness that are more than 7 days apart) of lameness over the life of the animal that also preced the RxDate (time of mastitis event)

df[&#39;Lact_xLame&#39;] = (df.groupby([&#39;NID&#39;, &#39;RxDate&#39;, &#39;Fdat&#39;])
                    [&#39;DIM&#39;].diff().abs().gt(7)
                    .groupby([df[&#39;RxDate&#39;], df[&#39;Fdat&#39;]])
                    .cumsum()+1
                 )
df[&#39;Life_xLame&#39;] = (df.groupby([&#39;NID&#39;])
                    [&#39;DIM&#39;].diff().abs().gt(7)
                    .groupby([df[&#39;RxDate&#39;], df[&#39;NID&#39;]])
                    .cumsum()+1
                 )
df

The Lact_xLame is calculating correctly. The output of Life_xLame is unexpected.

The output that I am looking for is

	NID	     Fdat	     RxDate	    LameDate	DIM	Lact_xLame	Life_xLame
0	52316	2021-11-18	2022-03-04	2022-03-04	106	   1	       1
1	52316	2021-11-18	2022-09-21	2022-03-04	106	   1	       1
2	52316	2021-11-18	2022-09-21	2022-09-21	307	   2	       2
3	52316	2021-11-18	2022-10-04	2022-03-04	106	   1	       1
4	52316	2021-11-18	2022-10-04	2022-09-21	307	   2	       2
5	52316	2021-11-18	2022-10-05	2022-03-04	106	   1	       1
6	52316	2021-11-18	2022-10-05	2022-09-21	307	   2	       2
7	52316	2021-11-18	2022-10-06	2022-03-04	106	   1	       1
8	52316	2021-11-18	2022-10-06	2022-09-21	307	   2	       2
9	35724	2018-08-22	2018-09-08	2018-08-26	4	   1	       1
10	35724	2018-08-22	2018-09-08	2018-09-08	17	   2           2
11	35724	2018-08-22	2018-11-13	2018-08-26	4	   1	       1
12	35724	2018-08-22	2018-11-13	2018-09-08	17	   2	       2
13	35724	2018-08-22	2018-11-13	2018-10-05	44	   3	       3
14	10295	2005-01-19	2006-03-07	2006-03-03	408	   1	       1
15	**10295	2008-04-30**	2009-01-08	2008-06-12	43	   1	       2
16	10295	2008-04-30	2009-01-08	2008-08-28	120	   2	       3
17	10295	2008-04-30	2009-01-08	2008-12-01	215	   3	       4
18	10295	2008-04-30	2009-04-09	2008-06-12	43	   1	       2
19	10295	2008-04-30	2009-04-09	2008-08-28	120	   2	       3
20	10295	2008-04-30	2009-04-09	2008-12-01	215	   3	       4
21	10295	2008-04-30	2009-04-09	2009-02-09	285	   4	       5

For Life_xLame I am looking for a cumulative count that includes cases from a previous lactation Fdat as illustrated by NID = 10295. The output for NID == 10295 is correct. For the other NID examples it is not resetting back to 1 for the first case of lameness preceding RxDate (Time of mastitis event. These other NID numbers did not have a case of lameness in the preceding lactation (ie there is only one Fdat for 35725 and 52316).

My understanding is that the df.groupby(['NID'] statement focuses attention on rows with the same NID. The ['DIM'].diff().abs().gt(7) statement checks to make sure that the difference in DIM between rows is greater than 7. The .groupby([df['RxDate'], df['NID']]).cumsum()+1 statement groups the records by RxDate and NID and counts them returning the count number as ['Life_xLame']. What confuses me is that when it is going to the next RxDate it is starting the count at 2 rather than 1.

答案1

得分: 0

终于弄清楚了！！

问题与数据排序方式与groupby语句设置有关。需要在排序中包括DIM。

df = df.sort_values(['NID', 'Fdat', 'DIM'])
df['Lact_xLame'] = (df.groupby(['NID', 'Fdat', 'RxDate'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['NID'], df['Fdat'], df['RxDate']])
                    .cumsum()+1
                 )
df['Life_xLame'] = (df.groupby(['NID'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['NID']])
                    .cumsum()+1
                 )
df

产生了我期望的输出。

英文:

Finally figured it out!!

The problem was associated with the way the data was sorted relative to how the groupby statements were set up. Needed to include DIM in the sort.

df = df.sort_values([&#39;NID&#39;, &#39;Fdat&#39;, &#39;DIM&#39;])
df[&#39;Lact_xLame&#39;] = (df.groupby([&#39;NID&#39;, &#39;Fdat&#39;, &#39;RxDate&#39;])
                    [&#39;DIM&#39;].diff().abs().gt(7)
                    .groupby([df[&#39;NID&#39;], df[&#39;Fdat&#39;], df[&#39;RxDate&#39;]])
                    .cumsum()+1
                 )
df[&#39;Life_xLame&#39;] = (df.groupby([&#39;NID&#39;])
                    [&#39;DIM&#39;].diff().abs().gt(7)
                    .groupby([df[&#39;NID&#39;]])
                    .cumsum()+1
                 )
df

Produced the output I was looking for.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据分组和时间约束条件计数的方法

问题

答案1

如何在一个ipywidget的回调中更新另一个ipywidget？

Pandas通过分类列从当前列集创建一组新列的切片。

基于条件筛选行在 R 中

‘scikit-learn documentation example: ‘got an unexpected keyword argument”

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。