R: 在日期间识别不同的行,并根据出现次数对行进行评分

huangapple go评论55阅读模式
英文:

R: Identifying distinct rows on dates within intervals and scoring rows based on occurence

问题

亲爱的Stack的聪明人,

我需要帮助写一个相当复杂的代码。我有一个包含三列的数据框;一个唯一的ID列(IDcol),一个结果列(Result),以及结果的日期(Date)。

有三种结果,N1、N2和N4。这些可以针对每个ID随时间测量。我的数据可能如下所示:

   IDcol Result       Date
1    ID1     N1 2018-06-11
2    ID1     N2 2018-06-11
3    ID1     N4 2018-01-01
4    ID1     N1 2012-06-11
5    ID1     N2 2012-06-11
6    ID2     N1 2016-04-23
7    ID2     N2 2016-04-23
8    ID2     N4 2015-01-05
9    ID3     N1 2015-11-24
10   ID3     N2 2015-11-24
11   ID3     N4 2013-05-05
12   ID4     N1 2015-01-01
13   ID4     N2 2015-01-01
14   ID4     N1 2014-05-06
15   ID4     N2 2014-05-06
16   ID5     N1 2016-04-04
17   ID5     N2 2014-04-04
18   ID5     N4 2012-04-04

你可以看到对于ID1,N1和N2在两个日期上测量 - N4也在它们之间测量。代码需要“扫描”N1和N2测量的“最早”日期。评分系统可以给这些发生的次数一个分数为2。但是,如果N4在N1和N2测量的一年之内发生,代码应该“扫描”这个并将其优先作为“更高分”。因此,对于ID1的示例,期望的输出如下:

  IDcol Result       Date score
1   ID1     N1 2018-06-11     3
2   ID1     N2 2018-06-11     3
3   ID1     N4 2018-01-01     3

这意味着尽管更早的N1和N2结果被测量,但由于N4也在N1和N2较晚的测量之一年内被测量,它们被丢弃。ID2保留作为一个示例,其中代码应该保留ID2的行。对于ID3,N4被测量,但落在一年之内,所以这一行应该被删除。对于ID4,应该优先考虑最早的测量值,最后,对于ID5 - 日期之间的结果测量差异太大。事实上,期望的输出应该如下所示:


   IDcol Result       Date score
1    ID1     N1 2018-06-11     3
2    ID1     N2 2018-06-11     3
3    ID1     N4 2018-01-01     3
4    ID2     N1 2016-04-23     3
5    ID2     N2 2016-04-23     3
6    ID2     N4 2015-01-05     3
7    ID3     N1 2015-11-24     2
8    ID3     N2 2015-11-24     2
9    ID4     N1 2014-05-06     2
10   ID4     N2 2014-05-06     2

我已经尝试使用dplyr和以下代码:

IDcol <- c("ID1", "ID1", "ID1",
           "ID1", "ID1", 
           "ID2", "ID2", "ID2", 
           "ID3", "ID3", "ID3", 
           "ID4", "ID4", 
           "ID4", "ID4",
           "ID5", "ID5", "ID5") 

Result <- c("N1", "N2", "N4",
            "N1", "N2",
            "N1", "N2", "N4",
            "N1", "N2", "N4",
            "N1", "N2", 
            "N1", "N2",
            "N1", "N2",  "N4")
Date <- c("2018-06-11","2018-06-11", "2018-01-01",
          "2012-06-11", "2012-06-11",
          "2016-04-23", "2016-04-23", "2015-01-05",
          "2015-11-24", "2015-11-24", "2013-05-05",
          "2015-01-01", "2015-01-01",
          "2014-05-06", "2014-05-06",
          "2016-04-04", "2014-04-04", "2012-04-04")
df <- as.data.frame(cbind(IDcol, Result, Date))
df$Date <- ymd(df$Date)

df1 <- df %>% group_by(IDcol, Date) %>% 
mutate(score=case_when(length(Result) > 3 & Date[Result=="N4"] %within% interval (Date[Result=="N1"]-duration(1,units=years)) ~ 3,
"N1" %in% Result & "N2" %in% Result ~ 2,
TRUE ~ 0,
)) %>% filter(score > 0) %>% group_by(IDcol) %>% arrange(desc(score), Date,.by_group=TRUE) %>% filter(Date == first(Date)) %>% ungroup()

非常感谢任何帮助。

英文:

Dear geniuses of Stack

I find myself in need of help with writing rather advanced code. I have a data frame with three columns; a unique ID column (IDcol), a result column (Result), and a date for the results (Date).
There are 3 types of results, N1, N2 and N4. These can be measured over time for each ID. My data could look something like this:

   IDcol Result       Date
1    ID1     N1 2018-06-11
2    ID1     N2 2018-06-11
3    ID1     N4 2018-01-01
4    ID1     N1 2012-06-11
5    ID1     N2 2012-06-11
6    ID2     N1 2016-04-23
7    ID2     N2 2016-04-23
8    ID2     N4 2015-01-05
9    ID3     N1 2015-11-24
10   ID3     N2 2015-11-24
11   ID3     N4 2013-05-05
12   ID4     N1 2015-01-01
13   ID4     N2 2015-01-01
14   ID4     N1 2014-05-06
15   ID4     N2 2014-05-06
16   ID5     N1 2016-04-04
17   ID5     N2 2014-04-04
18   ID5     N4 2012-04-04

As you can see for ID1, N1 and N2 is measured twice on two dates - N4 is also measured in between. The code needs to "scan" for the earliest dates where N1 and N2 are measured. The scoring system could give these occurrences a score of 2. However, if N4 occurs within minus 1 year of N1 and N2 measurements, the code should "scan" for this and prioritize this as a "higher score". So for the example of ID1, the desired output would look like this:

  IDcol Result       Date score
1   ID1     N1 2018-06-11     3
2   ID1     N2 2018-06-11     3
3   ID1     N4 2018-01-01     3

Meaning that even though earlier N1 and N2 results were measured, they were dropped due to N4 also being measured within one year of the later measures of N1 and N2. ID2 is kept as an example, where the code should keep the rows for ID2. For ID3, N4 is measured but falls out of the minus one-year interval, and the row should thus be dropped. For ID4, the earliest measurements should be prioritized over the latest, and lastly, for ID5 - there are too many discrepancies between dates where Results are measured. Effectively, the desired output should look like this:


   IDcol Result       Date score
1    ID1     N1 2018-06-11     3
2    ID1     N2 2018-06-11     3
3    ID1     N4 2018-01-01     3
4    ID2     N1 2016-04-23     3
5    ID2     N2 2016-04-23     3
6    ID2     N4 2015-01-05     3
7    ID3     N1 2015-11-24     2
8    ID3     N2 2015-11-24     2
9    ID4     N1 2014-05-06     2
10   ID4     N2 2014-05-06     2

I have tried with dplyr with the following:

IDcol &lt;- c(&quot;ID1&quot;, &quot;ID1&quot;, &quot;ID1&quot;,
           &quot;ID1&quot;, &quot;ID1&quot;, 
           &quot;ID2&quot;, &quot;ID2&quot;, &quot;ID2&quot;, 
           &quot;ID3&quot;, &quot;ID3&quot;, &quot;ID3&quot;, 
           &quot;ID4&quot;, &quot;ID4&quot;, 
           &quot;ID4&quot;, &quot;ID4&quot;,
           &quot;ID5&quot;, &quot;ID5&quot;, &quot;ID5&quot;) 

Result &lt;- c(&quot;N1&quot;, &quot;N2&quot;, &quot;N4&quot;,
            &quot;N1&quot;, &quot;N2&quot;,
            &quot;N1&quot;, &quot;N2&quot;, &quot;N4&quot;,
            &quot;N1&quot;, &quot;N2&quot;, &quot;N4&quot;,
            &quot;N1&quot;, &quot;N2&quot;, 
            &quot;N1&quot;, &quot;N2&quot;,
            &quot;N1&quot;, &quot;N2&quot;,  &quot;N4&quot;)
Date &lt;- c(&quot;2018-06-11&quot;,&quot;2018-06-11&quot;, &quot;2018-01-01&quot;,
          &quot;2012-06-11&quot;, &quot;2012-06-11&quot;,
          &quot;2016-04-23&quot;, &quot;2016-04-23&quot;, &quot;2015-01-05&quot;,
          &quot;2015-11-24&quot;, &quot;2015-11-24&quot;, &quot;2013-05-05&quot;,
          &quot;2015-01-01&quot;, &quot;2015-01-01&quot;,
          &quot;2014-05-06&quot;, &quot;2014-05-06&quot;,
          &quot;2016-04-04&quot;, &quot;2014-04-04&quot;, &quot;2012-04-04&quot;)
df &lt;- as.data.frame(cbind(IDcol, Result, Date))
df$Date &lt;- ymd(df$Date)

df1 &lt;- df %&gt;% group_by(IDcol, Date) %&gt;% 
mutate(score=case_when(length(Result) &gt; 3 &amp; Date[Result==&quot;N4&quot;] %within% interval (Date[Result==&quot;N1&quot;-duration(1,units=years)) ~ 3,
&quot;N1&quot; %in% Result &amp; &quot;N2&quot; %in% Result ~ 2,
TRUE ~ 0,
)) %&gt;% filter(score &gt; 0) %&gt;% group_by(IDcol) %&gt;% arrange(desc(score), Date,.by_group=TRUE) %&gt;% filter(Date == first(Date)) %&gt;% ungroup()

Any help is greatly appreciated

答案1

得分: 1

I believe this gives you your desired output. With problems like this I like to simplify the logic/code a bit by separating some information into a different df first and using filter.

cutoff.data <- df %>%
  filter(Result != 'N4') %>%
  group_by(IDcol) %>%
  summarise(earliest = min(Date)) %>%
  mutate(n4.cutoff = earliest - dyears(1)) %>%
  select(-earliest)

df %>%
  left_join(cutoff.data) %>%
  group_by(IDcol) %>%
  mutate(n4.row = ifelse(Result == 'N4',1,0),
         n4.check = sum(n4.row)) %>%
  ungroup() %>% # grouping messes up the following ifelse step
  mutate(score = ifelse(n4.check == 1 & Date[Result == 'N4'] >= n4.cutoff, 3, 2)) %>%
  group_by(IDcol) %>% # but then grouping is required to get lowest score per ID
  mutate(score = min(score))

A tibble: 18 x 7

Groups: IDcol [5]

IDcol Result Date n4.cutoff n4.row n4.check score

1 ID1 N1 2018-06-11 2011-06-11 18:00:00 0 1 3
2 ID1 N2 2018-06-11 2011-06-11 18:00:00 0 1 3
3 ID1 N4 2018-01-01 2011-06-11 18:00:00 1 1 3
4 ID1 N1 2012-06-11 2011-06-11 18:00:00 0 1 3
5 ID1 N2 2012-06-11 2011-06-11 18:00:00 0 1 3
6 ID2 N1 2016-04-23 2015-04-23 18:00:00 0 1 2
7 ID2 N2 2016-04-23 2015-04-23 18:00:00 0 1 2
8 ID2 N4 2015-01-05 2015-04-23 18:00:00 1 1 2
9 ID3 N1 2015-11-24 2014-11-23 18:00:00 0 1 2
10 ID3 N2 2015-11-24 2014-11-23 18:00:00 0 1 2
11 ID3 N4 2013-05-05 2014-11-23 18:00:00 1 1 2
12 ID4 N1 2015-01-01 2013-05-05 18:00:00 0 0 2
13 ID4 N2 2015-01-01 2013-05-05 18:00:00 0 0 2
14 ID4 N1 2014-05-06 2013-05-05 18:00:00 0 0 2
15 ID4 N2 2014-05-06 2013-05-05 18:00:00 0 0 2
16 ID5 N1 2016-04-04 2013-04-03 18:00:00 0 1 2
17 ID5 N2 2014-04-04 2013-04-03 18:00:00 0 1 2
18 ID5 N4 2012-04-04 2013-04-03 18:00:00 1 1 2


<details>
<summary>英文:</summary>

I believe this gives you your desired output. With problems like this I like to simplify the logic/code a bit by separating some information into a different df first and using filter. 

    cutoff.data &lt;- df %&gt;% 
      filter(Result != &#39;N4&#39;) %&gt;% 
      group_by(IDcol) %&gt;% 
      summarise(earliest = min(Date)) %&gt;% 
      mutate(n4.cutoff = earliest - dyears(1)) %&gt;% 
      select(-earliest)
    
    df %&gt;% 
      left_join(cutoff.data) %&gt;% 
      group_by(IDcol) %&gt;% 
      mutate(n4.row = ifelse(Result == &#39;N4&#39;,1,0),
             n4.check = sum(n4.row)) %&gt;% 
      ungroup() %&gt;% # grouping messes up the following ifelse step
      mutate(score = ifelse(n4.check == 1 &amp; Date[Result == &#39;N4&#39;] &gt;= n4.cutoff, 3, 2)) %&gt;% 
      group_by(IDcol) %&gt;% # but then grouping is required to get lowest score per ID
      mutate(score = min(score))

    # A tibble: 18 x 7
    # Groups:   IDcol [5]
       IDcol Result Date       n4.cutoff           n4.row n4.check score
       &lt;fct&gt; &lt;fct&gt;  &lt;date&gt;     &lt;dttm&gt;               &lt;dbl&gt;    &lt;dbl&gt; &lt;dbl&gt;
     1 ID1   N1     2018-06-11 2011-06-11 18:00:00      0        1     3
     2 ID1   N2     2018-06-11 2011-06-11 18:00:00      0        1     3
     3 ID1   N4     2018-01-01 2011-06-11 18:00:00      1        1     3
     4 ID1   N1     2012-06-11 2011-06-11 18:00:00      0        1     3
     5 ID1   N2     2012-06-11 2011-06-11 18:00:00      0        1     3
     6 ID2   N1     2016-04-23 2015-04-23 18:00:00      0        1     2
     7 ID2   N2     2016-04-23 2015-04-23 18:00:00      0        1     2
     8 ID2   N4     2015-01-05 2015-04-23 18:00:00      1        1     2
     9 ID3   N1     2015-11-24 2014-11-23 18:00:00      0        1     2
    10 ID3   N2     2015-11-24 2014-11-23 18:00:00      0        1     2
    11 ID3   N4     2013-05-05 2014-11-23 18:00:00      1        1     2
    12 ID4   N1     2015-01-01 2013-05-05 18:00:00      0        0     2
    13 ID4   N2     2015-01-01 2013-05-05 18:00:00      0        0     2
    14 ID4   N1     2014-05-06 2013-05-05 18:00:00      0        0     2
    15 ID4   N2     2014-05-06 2013-05-05 18:00:00      0        0     2
    16 ID5   N1     2016-04-04 2013-04-03 18:00:00      0        1     2
    17 ID5   N2     2014-04-04 2013-04-03 18:00:00      0        1     2
    18 ID5   N4     2012-04-04 2013-04-03 18:00:00      1        1     2

</details>



huangapple
  • 本文由 发表于 2023年2月14日 19:20:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75447074.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定