英文:
How to conditionally count based on grouping and time constraints
问题
我明白你的问题。你想要在Life_xLame
中获得一个累积计数,包括前一泌乳期的疾病案例。你已经正确地使用了df.groupby(['NID'])
以关注相同NID的行,但在计算Life_xLame
时,你需要在不同的RxDate
之间重置计数。为了实现这一点,你可以进一步分组,以确保在不同的RxDate
之间重新开始计数。
以下是修改后的代码示例:
df['Lact_xLame'] = (df.groupby(['NID', 'RxDate', 'Fdat'])
['DIM'].diff().abs().gt(7)
.groupby([df['RxDate'], df['Fdat'], df['NID']])
.cumsum() + 1
)
df['Life_xLame'] = (df.groupby(['NID'])
['DIM'].diff().abs().gt(7)
.groupby([df['RxDate'], df['NID'], df.groupby(['NID']).cumcount()])
.cumsum() + 1
)
df
这个修改后的代码在Life_xLame
的计算中使用了df.groupby(['NID']).cumcount()
,以确保在不同的RxDate
之间重置计数。这应该会得到你期望的输出。希望这对你有所帮助。
英文:
I am counting the number of cases of lameness in cattle. I want to determine the number of cases in the current lactation and over the animals lifetime.
The following sample data is provided as an example.
[[52316,
Timestamp('2021-11-18 00:00:00'),
Timestamp('2022-03-04 00:00:00'),
Timestamp('2022-03-04 00:00:00'),
106,
1,
1],
[52316,
Timestamp('2021-11-18 00:00:00'),
Timestamp('2022-09-21 00:00:00'),
Timestamp('2022-03-04 00:00:00'),
106,
1,
1],
[52316,
Timestamp('2021-11-18 00:00:00'),
Timestamp('2022-09-21 00:00:00'),
Timestamp('2022-09-21 00:00:00'),
307,
2,
2],
[52316,
Timestamp('2021-11-18 00:00:00'),
Timestamp('2022-10-04 00:00:00'),
Timestamp('2022-03-04 00:00:00'),
106,
1,
2],
[52316,
Timestamp('2021-11-18 00:00:00'),
Timestamp('2022-10-04 00:00:00'),
Timestamp('2022-09-21 00:00:00'),
307,
2,
3],
[52316,
Timestamp('2021-11-18 00:00:00'),
Timestamp('2022-10-05 00:00:00'),
Timestamp('2022-03-04 00:00:00'),
106,
1,
2],
[52316,
Timestamp('2021-11-18 00:00:00'),
Timestamp('2022-10-05 00:00:00'),
Timestamp('2022-09-21 00:00:00'),
307,
2,
3],
[52316,
Timestamp('2021-11-18 00:00:00'),
Timestamp('2022-10-06 00:00:00'),
Timestamp('2022-03-04 00:00:00'),
106,
1,
2],
[52316,
Timestamp('2021-11-18 00:00:00'),
Timestamp('2022-10-06 00:00:00'),
Timestamp('2022-09-21 00:00:00'),
307,
2,
3],
[35724,
Timestamp('2018-08-22 00:00:00'),
Timestamp('2018-09-08 00:00:00'),
Timestamp('2018-08-26 00:00:00'),
4,
1,
1],
[35724,
Timestamp('2018-08-22 00:00:00'),
Timestamp('2018-09-08 00:00:00'),
Timestamp('2018-09-08 00:00:00'),
17,
2,
2],
[35724,
Timestamp('2018-08-22 00:00:00'),
Timestamp('2018-11-13 00:00:00'),
Timestamp('2018-08-26 00:00:00'),
4,
1,
2],
[35724,
Timestamp('2018-08-22 00:00:00'),
Timestamp('2018-11-13 00:00:00'),
Timestamp('2018-09-08 00:00:00'),
17,
2,
3],
[35724,
Timestamp('2018-08-22 00:00:00'),
Timestamp('2018-11-13 00:00:00'),
Timestamp('2018-10-05 00:00:00'),
44,
3,
4],
[10295,
Timestamp('2005-01-19 00:00:00'),
Timestamp('2006-03-07 00:00:00'),
Timestamp('2006-03-03 00:00:00'),
408,
1,
1],
[10295,
Timestamp('2008-04-30 00:00:00'),
Timestamp('2009-01-08 00:00:00'),
Timestamp('2008-06-12 00:00:00'),
43,
1,
2],
[10295,
Timestamp('2008-04-30 00:00:00'),
Timestamp('2009-01-08 00:00:00'),
Timestamp('2008-08-28 00:00:00'),
120,
2,
3],
[10295,
Timestamp('2008-04-30 00:00:00'),
Timestamp('2009-01-08 00:00:00'),
Timestamp('2008-12-01 00:00:00'),
215,
3,
4],
[10295,
Timestamp('2008-04-30 00:00:00'),
Timestamp('2009-04-09 00:00:00'),
Timestamp('2008-06-12 00:00:00'),
43,
1,
2],
[10295,
Timestamp('2008-04-30 00:00:00'),
Timestamp('2009-04-09 00:00:00'),
Timestamp('2008-08-28 00:00:00'),
120,
2,
3],
[10295,
Timestamp('2008-04-30 00:00:00'),
Timestamp('2009-04-09 00:00:00'),
Timestamp('2008-12-01 00:00:00'),
215,
3,
4],
[10295,
Timestamp('2008-04-30 00:00:00'),
Timestamp('2009-04-09 00:00:00'),
Timestamp('2009-02-09 00:00:00'),
285,
4,
5]]
This produces the following dataframe
NID Fdat RxDate LameDate DIM Lact_xLame Life_xLame
0 52316 2021-11-18 2022-03-04 2022-03-04 106 1 1
1 52316 2021-11-18 2022-09-21 2022-03-04 106 1 1
2 52316 2021-11-18 2022-09-21 2022-09-21 307 2 2
3 52316 2021-11-18 2022-10-04 2022-03-04 106 1 2
4 52316 2021-11-18 2022-10-04 2022-09-21 307 2 3
5 52316 2021-11-18 2022-10-05 2022-03-04 106 1 2
6 52316 2021-11-18 2022-10-05 2022-09-21 307 2 3
7 52316 2021-11-18 2022-10-06 2022-03-04 106 1 2
8 52316 2021-11-18 2022-10-06 2022-09-21 307 2 3
9 35724 2018-08-22 2018-09-08 2018-08-26 4 1 1
10 35724 2018-08-22 2018-09-08 2018-09-08 17 2 2
11 35724 2018-08-22 2018-11-13 2018-08-26 4 1 2
12 35724 2018-08-22 2018-11-13 2018-09-08 17 2 3
13 35724 2018-08-22 2018-11-13 2018-10-05 44 3 4
14 10295 2005-01-19 2006-03-07 2006-03-03 408 1 1
15 10295 2008-04-30 2009-01-08 2008-06-12 43 1 2
16 10295 2008-04-30 2009-01-08 2008-08-28 120 2 3
17 10295 2008-04-30 2009-01-08 2008-12-01 215 3 4
18 10295 2008-04-30 2009-04-09 2008-06-12 43 1 2
19 10295 2008-04-30 2009-04-09 2008-08-28 120 2 3
20 10295 2008-04-30 2009-04-09 2008-12-01 215 3 4
21 10295 2008-04-30 2009-04-09 2009-02-09 285 4 5
I have prepared the following code to count the number of Lact_xLame
(cases of lameness that are more than 7 days apart within the lactation that preceed the RxDate
(Time of a mastitis event) and Life_xLame
that counts the cumulative number of cases (cases of lameness that are more than 7 days apart) of lameness over the life of the animal that also preced the RxDate
(time of mastitis event)
df['Lact_xLame'] = (df.groupby(['NID', 'RxDate', 'Fdat'])
['DIM'].diff().abs().gt(7)
.groupby([df['RxDate'], df['Fdat']])
.cumsum()+1
)
df['Life_xLame'] = (df.groupby(['NID'])
['DIM'].diff().abs().gt(7)
.groupby([df['RxDate'], df['NID']])
.cumsum()+1
)
df
The Lact_xLame
is calculating correctly. The output of Life_xLame
is unexpected.
The output that I am looking for is
NID Fdat RxDate LameDate DIM Lact_xLame Life_xLame
0 52316 2021-11-18 2022-03-04 2022-03-04 106 1 1
1 52316 2021-11-18 2022-09-21 2022-03-04 106 1 1
2 52316 2021-11-18 2022-09-21 2022-09-21 307 2 2
3 52316 2021-11-18 2022-10-04 2022-03-04 106 1 1
4 52316 2021-11-18 2022-10-04 2022-09-21 307 2 2
5 52316 2021-11-18 2022-10-05 2022-03-04 106 1 1
6 52316 2021-11-18 2022-10-05 2022-09-21 307 2 2
7 52316 2021-11-18 2022-10-06 2022-03-04 106 1 1
8 52316 2021-11-18 2022-10-06 2022-09-21 307 2 2
9 35724 2018-08-22 2018-09-08 2018-08-26 4 1 1
10 35724 2018-08-22 2018-09-08 2018-09-08 17 2 2
11 35724 2018-08-22 2018-11-13 2018-08-26 4 1 1
12 35724 2018-08-22 2018-11-13 2018-09-08 17 2 2
13 35724 2018-08-22 2018-11-13 2018-10-05 44 3 3
14 10295 2005-01-19 2006-03-07 2006-03-03 408 1 1
15 **10295 2008-04-30** 2009-01-08 2008-06-12 43 1 2
16 10295 2008-04-30 2009-01-08 2008-08-28 120 2 3
17 10295 2008-04-30 2009-01-08 2008-12-01 215 3 4
18 10295 2008-04-30 2009-04-09 2008-06-12 43 1 2
19 10295 2008-04-30 2009-04-09 2008-08-28 120 2 3
20 10295 2008-04-30 2009-04-09 2008-12-01 215 3 4
21 10295 2008-04-30 2009-04-09 2009-02-09 285 4 5
For Life_xLame
I am looking for a cumulative count that includes cases from a previous lactation Fdat
as illustrated by NID = 10295
. The output for NID == 10295
is correct. For the other NID examples it is not resetting back to 1 for the first case of lameness preceding RxDate
(Time of mastitis event. These other NID numbers did not have a case of lameness in the preceding lactation (ie there is only one Fdat
for 35725 and 52316).
My understanding is that the df.groupby(['NID']
statement focuses attention on rows with the same NID. The ['DIM'].diff().abs().gt(7)
statement checks to make sure that the difference in DIM between rows is greater than 7. The .groupby([df['RxDate']
, df['NID']]).cumsum()+1
statement groups the records by RxDate and NID and counts them returning the count number as ['Life_xLame']
. What confuses me is that when it is going to the next RxDate
it is starting the count at 2 rather than 1.
答案1
得分: 0
终于弄清楚了!!
问题与数据排序方式与groupby语句设置有关。需要在排序中包括DIM。
df = df.sort_values(['NID', 'Fdat', 'DIM'])
df['Lact_xLame'] = (df.groupby(['NID', 'Fdat', 'RxDate'])
['DIM'].diff().abs().gt(7)
.groupby([df['NID'], df['Fdat'], df['RxDate']])
.cumsum()+1
)
df['Life_xLame'] = (df.groupby(['NID'])
['DIM'].diff().abs().gt(7)
.groupby([df['NID']])
.cumsum()+1
)
df
产生了我期望的输出。
英文:
Finally figured it out!!
The problem was associated with the way the data was sorted relative to how the groupby statements were set up. Needed to include DIM in the sort.
df = df.sort_values(['NID', 'Fdat', 'DIM'])
df['Lact_xLame'] = (df.groupby(['NID', 'Fdat', 'RxDate'])
['DIM'].diff().abs().gt(7)
.groupby([df['NID'], df['Fdat'], df['RxDate']])
.cumsum()+1
)
df['Life_xLame'] = (df.groupby(['NID'])
['DIM'].diff().abs().gt(7)
.groupby([df['NID']])
.cumsum()+1
)
df
Produced the output I was looking for.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论