英文:
R: Identifying distinct rows on dates within intervals and scoring rows based on occurence
问题
亲爱的Stack的聪明人,
我需要帮助写一个相当复杂的代码。我有一个包含三列的数据框;一个唯一的ID列(IDcol),一个结果列(Result),以及结果的日期(Date)。
有三种结果,N1、N2和N4。这些可以针对每个ID随时间测量。我的数据可能如下所示:
IDcol Result Date
1 ID1 N1 2018-06-11
2 ID1 N2 2018-06-11
3 ID1 N4 2018-01-01
4 ID1 N1 2012-06-11
5 ID1 N2 2012-06-11
6 ID2 N1 2016-04-23
7 ID2 N2 2016-04-23
8 ID2 N4 2015-01-05
9 ID3 N1 2015-11-24
10 ID3 N2 2015-11-24
11 ID3 N4 2013-05-05
12 ID4 N1 2015-01-01
13 ID4 N2 2015-01-01
14 ID4 N1 2014-05-06
15 ID4 N2 2014-05-06
16 ID5 N1 2016-04-04
17 ID5 N2 2014-04-04
18 ID5 N4 2012-04-04
你可以看到对于ID1,N1和N2在两个日期上测量 - N4也在它们之间测量。代码需要“扫描”N1和N2测量的“最早”日期。评分系统可以给这些发生的次数一个分数为2。但是,如果N4在N1和N2测量的一年之内发生,代码应该“扫描”这个并将其优先作为“更高分”。因此,对于ID1的示例,期望的输出如下:
IDcol Result Date score
1 ID1 N1 2018-06-11 3
2 ID1 N2 2018-06-11 3
3 ID1 N4 2018-01-01 3
这意味着尽管更早的N1和N2结果被测量,但由于N4也在N1和N2较晚的测量之一年内被测量,它们被丢弃。ID2保留作为一个示例,其中代码应该保留ID2的行。对于ID3,N4被测量,但落在一年之内,所以这一行应该被删除。对于ID4,应该优先考虑最早的测量值,最后,对于ID5 - 日期之间的结果测量差异太大。事实上,期望的输出应该如下所示:
IDcol Result Date score
1 ID1 N1 2018-06-11 3
2 ID1 N2 2018-06-11 3
3 ID1 N4 2018-01-01 3
4 ID2 N1 2016-04-23 3
5 ID2 N2 2016-04-23 3
6 ID2 N4 2015-01-05 3
7 ID3 N1 2015-11-24 2
8 ID3 N2 2015-11-24 2
9 ID4 N1 2014-05-06 2
10 ID4 N2 2014-05-06 2
我已经尝试使用dplyr和以下代码:
IDcol <- c("ID1", "ID1", "ID1",
"ID1", "ID1",
"ID2", "ID2", "ID2",
"ID3", "ID3", "ID3",
"ID4", "ID4",
"ID4", "ID4",
"ID5", "ID5", "ID5")
Result <- c("N1", "N2", "N4",
"N1", "N2",
"N1", "N2", "N4",
"N1", "N2", "N4",
"N1", "N2",
"N1", "N2",
"N1", "N2", "N4")
Date <- c("2018-06-11","2018-06-11", "2018-01-01",
"2012-06-11", "2012-06-11",
"2016-04-23", "2016-04-23", "2015-01-05",
"2015-11-24", "2015-11-24", "2013-05-05",
"2015-01-01", "2015-01-01",
"2014-05-06", "2014-05-06",
"2016-04-04", "2014-04-04", "2012-04-04")
df <- as.data.frame(cbind(IDcol, Result, Date))
df$Date <- ymd(df$Date)
df1 <- df %>% group_by(IDcol, Date) %>%
mutate(score=case_when(length(Result) > 3 & Date[Result=="N4"] %within% interval (Date[Result=="N1"]-duration(1,units=years)) ~ 3,
"N1" %in% Result & "N2" %in% Result ~ 2,
TRUE ~ 0,
)) %>% filter(score > 0) %>% group_by(IDcol) %>% arrange(desc(score), Date,.by_group=TRUE) %>% filter(Date == first(Date)) %>% ungroup()
非常感谢任何帮助。
英文:
Dear geniuses of Stack
I find myself in need of help with writing rather advanced code. I have a data frame with three columns; a unique ID column (IDcol), a result column (Result), and a date for the results (Date).
There are 3 types of results, N1, N2 and N4. These can be measured over time for each ID. My data could look something like this:
IDcol Result Date
1 ID1 N1 2018-06-11
2 ID1 N2 2018-06-11
3 ID1 N4 2018-01-01
4 ID1 N1 2012-06-11
5 ID1 N2 2012-06-11
6 ID2 N1 2016-04-23
7 ID2 N2 2016-04-23
8 ID2 N4 2015-01-05
9 ID3 N1 2015-11-24
10 ID3 N2 2015-11-24
11 ID3 N4 2013-05-05
12 ID4 N1 2015-01-01
13 ID4 N2 2015-01-01
14 ID4 N1 2014-05-06
15 ID4 N2 2014-05-06
16 ID5 N1 2016-04-04
17 ID5 N2 2014-04-04
18 ID5 N4 2012-04-04
As you can see for ID1, N1 and N2 is measured twice on two dates - N4 is also measured in between. The code needs to "scan" for the earliest dates where N1 and N2 are measured. The scoring system could give these occurrences a score of 2. However, if N4 occurs within minus 1 year of N1 and N2 measurements, the code should "scan" for this and prioritize this as a "higher score". So for the example of ID1, the desired output would look like this:
IDcol Result Date score
1 ID1 N1 2018-06-11 3
2 ID1 N2 2018-06-11 3
3 ID1 N4 2018-01-01 3
Meaning that even though earlier N1 and N2 results were measured, they were dropped due to N4 also being measured within one year of the later measures of N1 and N2. ID2 is kept as an example, where the code should keep the rows for ID2. For ID3, N4 is measured but falls out of the minus one-year interval, and the row should thus be dropped. For ID4, the earliest measurements should be prioritized over the latest, and lastly, for ID5 - there are too many discrepancies between dates where Results are measured. Effectively, the desired output should look like this:
IDcol Result Date score
1 ID1 N1 2018-06-11 3
2 ID1 N2 2018-06-11 3
3 ID1 N4 2018-01-01 3
4 ID2 N1 2016-04-23 3
5 ID2 N2 2016-04-23 3
6 ID2 N4 2015-01-05 3
7 ID3 N1 2015-11-24 2
8 ID3 N2 2015-11-24 2
9 ID4 N1 2014-05-06 2
10 ID4 N2 2014-05-06 2
I have tried with dplyr with the following:
IDcol <- c("ID1", "ID1", "ID1",
"ID1", "ID1",
"ID2", "ID2", "ID2",
"ID3", "ID3", "ID3",
"ID4", "ID4",
"ID4", "ID4",
"ID5", "ID5", "ID5")
Result <- c("N1", "N2", "N4",
"N1", "N2",
"N1", "N2", "N4",
"N1", "N2", "N4",
"N1", "N2",
"N1", "N2",
"N1", "N2", "N4")
Date <- c("2018-06-11","2018-06-11", "2018-01-01",
"2012-06-11", "2012-06-11",
"2016-04-23", "2016-04-23", "2015-01-05",
"2015-11-24", "2015-11-24", "2013-05-05",
"2015-01-01", "2015-01-01",
"2014-05-06", "2014-05-06",
"2016-04-04", "2014-04-04", "2012-04-04")
df <- as.data.frame(cbind(IDcol, Result, Date))
df$Date <- ymd(df$Date)
df1 <- df %>% group_by(IDcol, Date) %>%
mutate(score=case_when(length(Result) > 3 & Date[Result=="N4"] %within% interval (Date[Result=="N1"-duration(1,units=years)) ~ 3,
"N1" %in% Result & "N2" %in% Result ~ 2,
TRUE ~ 0,
)) %>% filter(score > 0) %>% group_by(IDcol) %>% arrange(desc(score), Date,.by_group=TRUE) %>% filter(Date == first(Date)) %>% ungroup()
Any help is greatly appreciated
答案1
得分: 1
I believe this gives you your desired output. With problems like this I like to simplify the logic/code a bit by separating some information into a different df first and using filter.
cutoff.data <- df %>%
filter(Result != 'N4') %>%
group_by(IDcol) %>%
summarise(earliest = min(Date)) %>%
mutate(n4.cutoff = earliest - dyears(1)) %>%
select(-earliest)
df %>%
left_join(cutoff.data) %>%
group_by(IDcol) %>%
mutate(n4.row = ifelse(Result == 'N4',1,0),
n4.check = sum(n4.row)) %>%
ungroup() %>% # grouping messes up the following ifelse step
mutate(score = ifelse(n4.check == 1 & Date[Result == 'N4'] >= n4.cutoff, 3, 2)) %>%
group_by(IDcol) %>% # but then grouping is required to get lowest score per ID
mutate(score = min(score))
A tibble: 18 x 7
Groups: IDcol [5]
IDcol Result Date n4.cutoff n4.row n4.check score
1 ID1 N1 2018-06-11 2011-06-11 18:00:00 0 1 3
2 ID1 N2 2018-06-11 2011-06-11 18:00:00 0 1 3
3 ID1 N4 2018-01-01 2011-06-11 18:00:00 1 1 3
4 ID1 N1 2012-06-11 2011-06-11 18:00:00 0 1 3
5 ID1 N2 2012-06-11 2011-06-11 18:00:00 0 1 3
6 ID2 N1 2016-04-23 2015-04-23 18:00:00 0 1 2
7 ID2 N2 2016-04-23 2015-04-23 18:00:00 0 1 2
8 ID2 N4 2015-01-05 2015-04-23 18:00:00 1 1 2
9 ID3 N1 2015-11-24 2014-11-23 18:00:00 0 1 2
10 ID3 N2 2015-11-24 2014-11-23 18:00:00 0 1 2
11 ID3 N4 2013-05-05 2014-11-23 18:00:00 1 1 2
12 ID4 N1 2015-01-01 2013-05-05 18:00:00 0 0 2
13 ID4 N2 2015-01-01 2013-05-05 18:00:00 0 0 2
14 ID4 N1 2014-05-06 2013-05-05 18:00:00 0 0 2
15 ID4 N2 2014-05-06 2013-05-05 18:00:00 0 0 2
16 ID5 N1 2016-04-04 2013-04-03 18:00:00 0 1 2
17 ID5 N2 2014-04-04 2013-04-03 18:00:00 0 1 2
18 ID5 N4 2012-04-04 2013-04-03 18:00:00 1 1 2
<details>
<summary>英文:</summary>
I believe this gives you your desired output. With problems like this I like to simplify the logic/code a bit by separating some information into a different df first and using filter.
cutoff.data <- df %>%
filter(Result != 'N4') %>%
group_by(IDcol) %>%
summarise(earliest = min(Date)) %>%
mutate(n4.cutoff = earliest - dyears(1)) %>%
select(-earliest)
df %>%
left_join(cutoff.data) %>%
group_by(IDcol) %>%
mutate(n4.row = ifelse(Result == 'N4',1,0),
n4.check = sum(n4.row)) %>%
ungroup() %>% # grouping messes up the following ifelse step
mutate(score = ifelse(n4.check == 1 & Date[Result == 'N4'] >= n4.cutoff, 3, 2)) %>%
group_by(IDcol) %>% # but then grouping is required to get lowest score per ID
mutate(score = min(score))
# A tibble: 18 x 7
# Groups: IDcol [5]
IDcol Result Date n4.cutoff n4.row n4.check score
<fct> <fct> <date> <dttm> <dbl> <dbl> <dbl>
1 ID1 N1 2018-06-11 2011-06-11 18:00:00 0 1 3
2 ID1 N2 2018-06-11 2011-06-11 18:00:00 0 1 3
3 ID1 N4 2018-01-01 2011-06-11 18:00:00 1 1 3
4 ID1 N1 2012-06-11 2011-06-11 18:00:00 0 1 3
5 ID1 N2 2012-06-11 2011-06-11 18:00:00 0 1 3
6 ID2 N1 2016-04-23 2015-04-23 18:00:00 0 1 2
7 ID2 N2 2016-04-23 2015-04-23 18:00:00 0 1 2
8 ID2 N4 2015-01-05 2015-04-23 18:00:00 1 1 2
9 ID3 N1 2015-11-24 2014-11-23 18:00:00 0 1 2
10 ID3 N2 2015-11-24 2014-11-23 18:00:00 0 1 2
11 ID3 N4 2013-05-05 2014-11-23 18:00:00 1 1 2
12 ID4 N1 2015-01-01 2013-05-05 18:00:00 0 0 2
13 ID4 N2 2015-01-01 2013-05-05 18:00:00 0 0 2
14 ID4 N1 2014-05-06 2013-05-05 18:00:00 0 0 2
15 ID4 N2 2014-05-06 2013-05-05 18:00:00 0 0 2
16 ID5 N1 2016-04-04 2013-04-03 18:00:00 0 1 2
17 ID5 N2 2014-04-04 2013-04-03 18:00:00 0 1 2
18 ID5 N4 2012-04-04 2013-04-03 18:00:00 1 1 2
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论