根据分组和时间约束条件计数的方法

huangapple go评论77阅读模式
英文:

How to conditionally count based on grouping and time constraints

问题

我明白你的问题。你想要在Life_xLame中获得一个累积计数,包括前一泌乳期的疾病案例。你已经正确地使用了df.groupby(['NID'])以关注相同NID的行,但在计算Life_xLame时,你需要在不同的RxDate之间重置计数。为了实现这一点,你可以进一步分组,以确保在不同的RxDate之间重新开始计数。

以下是修改后的代码示例:

df['Lact_xLame'] = (df.groupby(['NID', 'RxDate', 'Fdat'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['RxDate'], df['Fdat'], df['NID']])
                    .cumsum() + 1
                 )

df['Life_xLame'] = (df.groupby(['NID'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['RxDate'], df['NID'], df.groupby(['NID']).cumcount()])
                    .cumsum() + 1
                 )

df

这个修改后的代码在Life_xLame的计算中使用了df.groupby(['NID']).cumcount(),以确保在不同的RxDate之间重置计数。这应该会得到你期望的输出。希望这对你有所帮助。

英文:

I am counting the number of cases of lameness in cattle. I want to determine the number of cases in the current lactation and over the animals lifetime.

The following sample data is provided as an example.

[[52316,
  Timestamp('2021-11-18 00:00:00'),
  Timestamp('2022-03-04 00:00:00'),
  Timestamp('2022-03-04 00:00:00'),
  106,
  1,
  1],
 [52316,
  Timestamp('2021-11-18 00:00:00'),
  Timestamp('2022-09-21 00:00:00'),
  Timestamp('2022-03-04 00:00:00'),
  106,
  1,
  1],
 [52316,
  Timestamp('2021-11-18 00:00:00'),
  Timestamp('2022-09-21 00:00:00'),
  Timestamp('2022-09-21 00:00:00'),
  307,
  2,
  2],
 [52316,
  Timestamp('2021-11-18 00:00:00'),
  Timestamp('2022-10-04 00:00:00'),
  Timestamp('2022-03-04 00:00:00'),
  106,
  1,
  2],
 [52316,
  Timestamp('2021-11-18 00:00:00'),
  Timestamp('2022-10-04 00:00:00'),
  Timestamp('2022-09-21 00:00:00'),
  307,
  2,
  3],
 [52316,
  Timestamp('2021-11-18 00:00:00'),
  Timestamp('2022-10-05 00:00:00'),
  Timestamp('2022-03-04 00:00:00'),
  106,
  1,
  2],
 [52316,
  Timestamp('2021-11-18 00:00:00'),
  Timestamp('2022-10-05 00:00:00'),
  Timestamp('2022-09-21 00:00:00'),
  307,
  2,
  3],
 [52316,
  Timestamp('2021-11-18 00:00:00'),
  Timestamp('2022-10-06 00:00:00'),
  Timestamp('2022-03-04 00:00:00'),
  106,
  1,
  2],
 [52316,
  Timestamp('2021-11-18 00:00:00'),
  Timestamp('2022-10-06 00:00:00'),
  Timestamp('2022-09-21 00:00:00'),
  307,
  2,
  3],
 [35724,
  Timestamp('2018-08-22 00:00:00'),
  Timestamp('2018-09-08 00:00:00'),
  Timestamp('2018-08-26 00:00:00'),
  4,
  1,
  1],
 [35724,
  Timestamp('2018-08-22 00:00:00'),
  Timestamp('2018-09-08 00:00:00'),
  Timestamp('2018-09-08 00:00:00'),
  17,
  2,
  2],
 [35724,
  Timestamp('2018-08-22 00:00:00'),
  Timestamp('2018-11-13 00:00:00'),
  Timestamp('2018-08-26 00:00:00'),
  4,
  1,
  2],
 [35724,
  Timestamp('2018-08-22 00:00:00'),
  Timestamp('2018-11-13 00:00:00'),
  Timestamp('2018-09-08 00:00:00'),
  17,
  2,
  3],
 [35724,
  Timestamp('2018-08-22 00:00:00'),
  Timestamp('2018-11-13 00:00:00'),
  Timestamp('2018-10-05 00:00:00'),
  44,
  3,
  4],
 [10295,
  Timestamp('2005-01-19 00:00:00'),
  Timestamp('2006-03-07 00:00:00'),
  Timestamp('2006-03-03 00:00:00'),
  408,
  1,
  1],
 [10295,
  Timestamp('2008-04-30 00:00:00'),
  Timestamp('2009-01-08 00:00:00'),
  Timestamp('2008-06-12 00:00:00'),
  43,
  1,
  2],
 [10295,
  Timestamp('2008-04-30 00:00:00'),
  Timestamp('2009-01-08 00:00:00'),
  Timestamp('2008-08-28 00:00:00'),
  120,
  2,
  3],
 [10295,
  Timestamp('2008-04-30 00:00:00'),
  Timestamp('2009-01-08 00:00:00'),
  Timestamp('2008-12-01 00:00:00'),
  215,
  3,
  4],
 [10295,
  Timestamp('2008-04-30 00:00:00'),
  Timestamp('2009-04-09 00:00:00'),
  Timestamp('2008-06-12 00:00:00'),
  43,
  1,
  2],
 [10295,
  Timestamp('2008-04-30 00:00:00'),
  Timestamp('2009-04-09 00:00:00'),
  Timestamp('2008-08-28 00:00:00'),
  120,
  2,
  3],
 [10295,
  Timestamp('2008-04-30 00:00:00'),
  Timestamp('2009-04-09 00:00:00'),
  Timestamp('2008-12-01 00:00:00'),
  215,
  3,
  4],
 [10295,
  Timestamp('2008-04-30 00:00:00'),
  Timestamp('2009-04-09 00:00:00'),
  Timestamp('2009-02-09 00:00:00'),
  285,
  4,
  5]]

This produces the following dataframe

	NID	     Fdat	     RxDate	    LameDate	DIM	Lact_xLame	Life_xLame
0	52316	2021-11-18	2022-03-04	2022-03-04	106	   1	       1
1	52316	2021-11-18	2022-09-21	2022-03-04	106	   1	       1
2	52316	2021-11-18	2022-09-21	2022-09-21	307	   2	       2
3	52316	2021-11-18	2022-10-04	2022-03-04	106	   1	       2
4	52316	2021-11-18	2022-10-04	2022-09-21	307	   2	       3
5	52316	2021-11-18	2022-10-05	2022-03-04	106	   1	       2
6	52316	2021-11-18	2022-10-05	2022-09-21	307	   2	       3
7	52316	2021-11-18	2022-10-06	2022-03-04	106	   1	       2
8	52316	2021-11-18	2022-10-06	2022-09-21	307	   2	       3
9	35724	2018-08-22	2018-09-08	2018-08-26	4	   1	       1
10	35724	2018-08-22	2018-09-08	2018-09-08	17	   2           2
11	35724	2018-08-22	2018-11-13	2018-08-26	4	   1	       2
12	35724	2018-08-22	2018-11-13	2018-09-08	17	   2	       3
13	35724	2018-08-22	2018-11-13	2018-10-05	44	   3	       4
14	10295	2005-01-19	2006-03-07	2006-03-03	408	   1	       1
15	10295	2008-04-30	2009-01-08	2008-06-12	43	   1	       2
16	10295	2008-04-30	2009-01-08	2008-08-28	120	   2	       3
17	10295	2008-04-30	2009-01-08	2008-12-01	215	   3	       4
18	10295	2008-04-30	2009-04-09	2008-06-12	43	   1	       2
19	10295	2008-04-30	2009-04-09	2008-08-28	120	   2	       3
20	10295	2008-04-30	2009-04-09	2008-12-01	215	   3	       4
21	10295	2008-04-30	2009-04-09	2009-02-09	285	   4	       5

I have prepared the following code to count the number of Lact_xLame (cases of lameness that are more than 7 days apart within the lactation that preceed the RxDate (Time of a mastitis event) and Life_xLame that counts the cumulative number of cases (cases of lameness that are more than 7 days apart) of lameness over the life of the animal that also preced the RxDate (time of mastitis event)

df['Lact_xLame'] = (df.groupby(['NID', 'RxDate', 'Fdat'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['RxDate'], df['Fdat']])
                    .cumsum()+1
                 )

df['Life_xLame'] = (df.groupby(['NID'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['RxDate'], df['NID']])
                    .cumsum()+1
                 )

df

The Lact_xLame is calculating correctly. The output of Life_xLame is unexpected.

The output that I am looking for is

	NID	     Fdat	     RxDate	    LameDate	DIM	Lact_xLame	Life_xLame
0	52316	2021-11-18	2022-03-04	2022-03-04	106	   1	       1
1	52316	2021-11-18	2022-09-21	2022-03-04	106	   1	       1
2	52316	2021-11-18	2022-09-21	2022-09-21	307	   2	       2
3	52316	2021-11-18	2022-10-04	2022-03-04	106	   1	       1
4	52316	2021-11-18	2022-10-04	2022-09-21	307	   2	       2
5	52316	2021-11-18	2022-10-05	2022-03-04	106	   1	       1
6	52316	2021-11-18	2022-10-05	2022-09-21	307	   2	       2
7	52316	2021-11-18	2022-10-06	2022-03-04	106	   1	       1
8	52316	2021-11-18	2022-10-06	2022-09-21	307	   2	       2
9	35724	2018-08-22	2018-09-08	2018-08-26	4	   1	       1
10	35724	2018-08-22	2018-09-08	2018-09-08	17	   2           2
11	35724	2018-08-22	2018-11-13	2018-08-26	4	   1	       1
12	35724	2018-08-22	2018-11-13	2018-09-08	17	   2	       2
13	35724	2018-08-22	2018-11-13	2018-10-05	44	   3	       3
14	10295	2005-01-19	2006-03-07	2006-03-03	408	   1	       1
15	**10295	2008-04-30**	2009-01-08	2008-06-12	43	   1	       2
16	10295	2008-04-30	2009-01-08	2008-08-28	120	   2	       3
17	10295	2008-04-30	2009-01-08	2008-12-01	215	   3	       4
18	10295	2008-04-30	2009-04-09	2008-06-12	43	   1	       2
19	10295	2008-04-30	2009-04-09	2008-08-28	120	   2	       3
20	10295	2008-04-30	2009-04-09	2008-12-01	215	   3	       4
21	10295	2008-04-30	2009-04-09	2009-02-09	285	   4	       5

For Life_xLame I am looking for a cumulative count that includes cases from a previous lactation Fdat as illustrated by NID = 10295. The output for NID == 10295 is correct. For the other NID examples it is not resetting back to 1 for the first case of lameness preceding RxDate (Time of mastitis event. These other NID numbers did not have a case of lameness in the preceding lactation (ie there is only one Fdat for 35725 and 52316).

My understanding is that the df.groupby(['NID'] statement focuses attention on rows with the same NID. The ['DIM'].diff().abs().gt(7) statement checks to make sure that the difference in DIM between rows is greater than 7. The .groupby([df['RxDate'], df['NID']]).cumsum()+1 statement groups the records by RxDate and NID and counts them returning the count number as ['Life_xLame']. What confuses me is that when it is going to the next RxDate it is starting the count at 2 rather than 1.

答案1

得分: 0

终于弄清楚了!!

问题与数据排序方式与groupby语句设置有关。需要在排序中包括DIM。

df = df.sort_values(['NID', 'Fdat', 'DIM'])

df['Lact_xLame'] = (df.groupby(['NID', 'Fdat', 'RxDate'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['NID'], df['Fdat'], df['RxDate']])
                    .cumsum()+1
                 )

df['Life_xLame'] = (df.groupby(['NID'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['NID']])
                    .cumsum()+1
                 )

df

产生了我期望的输出。

英文:

Finally figured it out!!

The problem was associated with the way the data was sorted relative to how the groupby statements were set up. Needed to include DIM in the sort.

df = df.sort_values(['NID', 'Fdat', 'DIM'])


df['Lact_xLame'] = (df.groupby(['NID', 'Fdat', 'RxDate'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['NID'], df['Fdat'], df['RxDate']])
                    .cumsum()+1
                 )

df['Life_xLame'] = (df.groupby(['NID'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['NID']])
                    .cumsum()+1
                 )

df

Produced the output I was looking for.

huangapple
  • 本文由 发表于 2023年3月7日 08:57:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75657162.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定