2023年2月14日 19:20:08go评论86阅读模式

英文:

R: Identifying distinct rows on dates within intervals and scoring rows based on occurence

问题

亲爱的Stack的聪明人，

我需要帮助写一个相当复杂的代码。我有一个包含三列的数据框；一个唯一的ID列（IDcol），一个结果列（Result），以及结果的日期（Date）。

有三种结果，N1、N2和N4。这些可以针对每个ID随时间测量。我的数据可能如下所示：

   IDcol Result       Date
1    ID1     N1 2018-06-11
2    ID1     N2 2018-06-11
3    ID1     N4 2018-01-01
4    ID1     N1 2012-06-11
5    ID1     N2 2012-06-11
6    ID2     N1 2016-04-23
7    ID2     N2 2016-04-23
8    ID2     N4 2015-01-05
9    ID3     N1 2015-11-24
10   ID3     N2 2015-11-24
11   ID3     N4 2013-05-05
12   ID4     N1 2015-01-01
13   ID4     N2 2015-01-01
14   ID4     N1 2014-05-06
15   ID4     N2 2014-05-06
16   ID5     N1 2016-04-04
17   ID5     N2 2014-04-04
18   ID5     N4 2012-04-04

你可以看到对于ID1，N1和N2在两个日期上测量 - N4也在它们之间测量。代码需要“扫描”N1和N2测量的“最早”日期。评分系统可以给这些发生的次数一个分数为2。但是，如果N4在N1和N2测量的一年之内发生，代码应该“扫描”这个并将其优先作为“更高分”。因此，对于ID1的示例，期望的输出如下：

  IDcol Result       Date score
1   ID1     N1 2018-06-11     3
2   ID1     N2 2018-06-11     3
3   ID1     N4 2018-01-01     3

这意味着尽管更早的N1和N2结果被测量，但由于N4也在N1和N2较晚的测量之一年内被测量，它们被丢弃。ID2保留作为一个示例，其中代码应该保留ID2的行。对于ID3，N4被测量，但落在一年之内，所以这一行应该被删除。对于ID4，应该优先考虑最早的测量值，最后，对于ID5 - 日期之间的结果测量差异太大。事实上，期望的输出应该如下所示：


   IDcol Result       Date score
1    ID1     N1 2018-06-11     3
2    ID1     N2 2018-06-11     3
3    ID1     N4 2018-01-01     3
4    ID2     N1 2016-04-23     3
5    ID2     N2 2016-04-23     3
6    ID2     N4 2015-01-05     3
7    ID3     N1 2015-11-24     2
8    ID3     N2 2015-11-24     2
9    ID4     N1 2014-05-06     2
10   ID4     N2 2014-05-06     2

我已经尝试使用dplyr和以下代码：

IDcol <- c("ID1", "ID1", "ID1",
           "ID1", "ID1", 
           "ID2", "ID2", "ID2", 
           "ID3", "ID3", "ID3", 
           "ID4", "ID4", 
           "ID4", "ID4",
           "ID5", "ID5", "ID5") 
Result <- c("N1", "N2", "N4",
            "N1", "N2",
            "N1", "N2", "N4",
            "N1", "N2", "N4",
            "N1", "N2", 
            "N1", "N2",
            "N1", "N2",  "N4")
Date <- c("2018-06-11","2018-06-11", "2018-01-01",
          "2012-06-11", "2012-06-11",
          "2016-04-23", "2016-04-23", "2015-01-05",
          "2015-11-24", "2015-11-24", "2013-05-05",
          "2015-01-01", "2015-01-01",
          "2014-05-06", "2014-05-06",
          "2016-04-04", "2014-04-04", "2012-04-04")
df <- as.data.frame(cbind(IDcol, Result, Date))
df$Date <- ymd(df$Date)
df1 <- df %>% group_by(IDcol, Date) %>% 
mutate(score=case_when(length(Result) > 3 & Date[Result=="N4"] %within% interval (Date[Result=="N1"]-duration(1,units=years)) ~ 3,
"N1" %in% Result & "N2" %in% Result ~ 2,
TRUE ~ 0,
)) %>% filter(score > 0) %>% group_by(IDcol) %>% arrange(desc(score), Date,.by_group=TRUE) %>% filter(Date == first(Date)) %>% ungroup()

非常感谢任何帮助。

英文:

Dear geniuses of Stack

I find myself in need of help with writing rather advanced code. I have a data frame with three columns; a unique ID column (IDcol), a result column (Result), and a date for the results (Date).
There are 3 types of results, N1, N2 and N4. These can be measured over time for each ID. My data could look something like this:

   IDcol Result       Date
1    ID1     N1 2018-06-11
2    ID1     N2 2018-06-11
3    ID1     N4 2018-01-01
4    ID1     N1 2012-06-11
5    ID1     N2 2012-06-11
6    ID2     N1 2016-04-23
7    ID2     N2 2016-04-23
8    ID2     N4 2015-01-05
9    ID3     N1 2015-11-24
10   ID3     N2 2015-11-24
11   ID3     N4 2013-05-05
12   ID4     N1 2015-01-01
13   ID4     N2 2015-01-01
14   ID4     N1 2014-05-06
15   ID4     N2 2014-05-06
16   ID5     N1 2016-04-04
17   ID5     N2 2014-04-04
18   ID5     N4 2012-04-04

As you can see for ID1, N1 and N2 is measured twice on two dates - N4 is also measured in between. The code needs to "scan" for the earliest dates where N1 and N2 are measured. The scoring system could give these occurrences a score of 2. However, if N4 occurs within minus 1 year of N1 and N2 measurements, the code should "scan" for this and prioritize this as a "higher score". So for the example of ID1, the desired output would look like this:

  IDcol Result       Date score
1   ID1     N1 2018-06-11     3
2   ID1     N2 2018-06-11     3
3   ID1     N4 2018-01-01     3

Meaning that even though earlier N1 and N2 results were measured, they were dropped due to N4 also being measured within one year of the later measures of N1 and N2. ID2 is kept as an example, where the code should keep the rows for ID2. For ID3, N4 is measured but falls out of the minus one-year interval, and the row should thus be dropped. For ID4, the earliest measurements should be prioritized over the latest, and lastly, for ID5 - there are too many discrepancies between dates where Results are measured. Effectively, the desired output should look like this:


   IDcol Result       Date score
1    ID1     N1 2018-06-11     3
2    ID1     N2 2018-06-11     3
3    ID1     N4 2018-01-01     3
4    ID2     N1 2016-04-23     3
5    ID2     N2 2016-04-23     3
6    ID2     N4 2015-01-05     3
7    ID3     N1 2015-11-24     2
8    ID3     N2 2015-11-24     2
9    ID4     N1 2014-05-06     2
10   ID4     N2 2014-05-06     2

I have tried with dplyr with the following:

IDcol &lt;- c(&quot;ID1&quot;, &quot;ID1&quot;, &quot;ID1&quot;,
           &quot;ID1&quot;, &quot;ID1&quot;, 
           &quot;ID2&quot;, &quot;ID2&quot;, &quot;ID2&quot;, 
           &quot;ID3&quot;, &quot;ID3&quot;, &quot;ID3&quot;, 
           &quot;ID4&quot;, &quot;ID4&quot;, 
           &quot;ID4&quot;, &quot;ID4&quot;,
           &quot;ID5&quot;, &quot;ID5&quot;, &quot;ID5&quot;) 
Result &lt;- c(&quot;N1&quot;, &quot;N2&quot;, &quot;N4&quot;,
            &quot;N1&quot;, &quot;N2&quot;,
            &quot;N1&quot;, &quot;N2&quot;, &quot;N4&quot;,
            &quot;N1&quot;, &quot;N2&quot;, &quot;N4&quot;,
            &quot;N1&quot;, &quot;N2&quot;, 
            &quot;N1&quot;, &quot;N2&quot;,
            &quot;N1&quot;, &quot;N2&quot;,  &quot;N4&quot;)
Date &lt;- c(&quot;2018-06-11&quot;,&quot;2018-06-11&quot;, &quot;2018-01-01&quot;,
          &quot;2012-06-11&quot;, &quot;2012-06-11&quot;,
          &quot;2016-04-23&quot;, &quot;2016-04-23&quot;, &quot;2015-01-05&quot;,
          &quot;2015-11-24&quot;, &quot;2015-11-24&quot;, &quot;2013-05-05&quot;,
          &quot;2015-01-01&quot;, &quot;2015-01-01&quot;,
          &quot;2014-05-06&quot;, &quot;2014-05-06&quot;,
          &quot;2016-04-04&quot;, &quot;2014-04-04&quot;, &quot;2012-04-04&quot;)
df &lt;- as.data.frame(cbind(IDcol, Result, Date))
df$Date &lt;- ymd(df$Date)
df1 &lt;- df %&gt;% group_by(IDcol, Date) %&gt;% 
mutate(score=case_when(length(Result) &gt; 3 &amp; Date[Result==&quot;N4&quot;] %within% interval (Date[Result==&quot;N1&quot;-duration(1,units=years)) ~ 3,
&quot;N1&quot; %in% Result &amp; &quot;N2&quot; %in% Result ~ 2,
TRUE ~ 0,
)) %&gt;% filter(score &gt; 0) %&gt;% group_by(IDcol) %&gt;% arrange(desc(score), Date,.by_group=TRUE) %&gt;% filter(Date == first(Date)) %&gt;% ungroup()

Any help is greatly appreciated

答案1

得分: 1

I believe this gives you your desired output. With problems like this I like to simplify the logic/code a bit by separating some information into a different df first and using filter.

cutoff.data <- df %>%
  filter(Result != 'N4') %>%
  group_by(IDcol) %>%
  summarise(earliest = min(Date)) %>%
  mutate(n4.cutoff = earliest - dyears(1)) %>%
  select(-earliest)
df %>%
  left_join(cutoff.data) %>%
  group_by(IDcol) %>%
  mutate(n4.row = ifelse(Result == 'N4',1,0),
         n4.check = sum(n4.row)) %>%
  ungroup() %>% # grouping messes up the following ifelse step
  mutate(score = ifelse(n4.check == 1 & Date[Result == 'N4'] >= n4.cutoff, 3, 2)) %>%
  group_by(IDcol) %>% # but then grouping is required to get lowest score per ID
  mutate(score = min(score))

A tibble: 18 x 7

Groups: IDcol [5]

IDcol Result Date n4.cutoff n4.row n4.check score

1 ID1 N1 2018-06-11 2011-06-11 18:00:00 0 1 3
2 ID1 N2 2018-06-11 2011-06-11 18:00:00 0 1 3
3 ID1 N4 2018-01-01 2011-06-11 18:00:00 1 1 3
4 ID1 N1 2012-06-11 2011-06-11 18:00:00 0 1 3
5 ID1 N2 2012-06-11 2011-06-11 18:00:00 0 1 3
6 ID2 N1 2016-04-23 2015-04-23 18:00:00 0 1 2
7 ID2 N2 2016-04-23 2015-04-23 18:00:00 0 1 2
8 ID2 N4 2015-01-05 2015-04-23 18:00:00 1 1 2
9 ID3 N1 2015-11-24 2014-11-23 18:00:00 0 1 2
10 ID3 N2 2015-11-24 2014-11-23 18:00:00 0 1 2
11 ID3 N4 2013-05-05 2014-11-23 18:00:00 1 1 2
12 ID4 N1 2015-01-01 2013-05-05 18:00:00 0 0 2
13 ID4 N2 2015-01-01 2013-05-05 18:00:00 0 0 2
14 ID4 N1 2014-05-06 2013-05-05 18:00:00 0 0 2
15 ID4 N2 2014-05-06 2013-05-05 18:00:00 0 0 2
16 ID5 N1 2016-04-04 2013-04-03 18:00:00 0 1 2
17 ID5 N2 2014-04-04 2013-04-03 18:00:00 0 1 2
18 ID5 N4 2012-04-04 2013-04-03 18:00:00 1 1 2


<details>
<summary>英文:</summary>
I believe this gives you your desired output. With problems like this I like to simplify the logic/code a bit by separating some information into a different df first and using filter. 
    cutoff.data &lt;- df %&gt;% 
      filter(Result != &#39;N4&#39;) %&gt;% 
      group_by(IDcol) %&gt;% 
      summarise(earliest = min(Date)) %&gt;% 
      mutate(n4.cutoff = earliest - dyears(1)) %&gt;% 
      select(-earliest)
    
    df %&gt;% 
      left_join(cutoff.data) %&gt;% 
      group_by(IDcol) %&gt;% 
      mutate(n4.row = ifelse(Result == &#39;N4&#39;,1,0),
             n4.check = sum(n4.row)) %&gt;% 
      ungroup() %&gt;% # grouping messes up the following ifelse step
      mutate(score = ifelse(n4.check == 1 &amp; Date[Result == &#39;N4&#39;] &gt;= n4.cutoff, 3, 2)) %&gt;% 
      group_by(IDcol) %&gt;% # but then grouping is required to get lowest score per ID
      mutate(score = min(score))
    # A tibble: 18 x 7
    # Groups:   IDcol [5]
       IDcol Result Date       n4.cutoff           n4.row n4.check score
       &lt;fct&gt; &lt;fct&gt;  &lt;date&gt;     &lt;dttm&gt;               &lt;dbl&gt;    &lt;dbl&gt; &lt;dbl&gt;
     1 ID1   N1     2018-06-11 2011-06-11 18:00:00      0        1     3
     2 ID1   N2     2018-06-11 2011-06-11 18:00:00      0        1     3
     3 ID1   N4     2018-01-01 2011-06-11 18:00:00      1        1     3
     4 ID1   N1     2012-06-11 2011-06-11 18:00:00      0        1     3
     5 ID1   N2     2012-06-11 2011-06-11 18:00:00      0        1     3
     6 ID2   N1     2016-04-23 2015-04-23 18:00:00      0        1     2
     7 ID2   N2     2016-04-23 2015-04-23 18:00:00      0        1     2
     8 ID2   N4     2015-01-05 2015-04-23 18:00:00      1        1     2
     9 ID3   N1     2015-11-24 2014-11-23 18:00:00      0        1     2
    10 ID3   N2     2015-11-24 2014-11-23 18:00:00      0        1     2
    11 ID3   N4     2013-05-05 2014-11-23 18:00:00      1        1     2
    12 ID4   N1     2015-01-01 2013-05-05 18:00:00      0        0     2
    13 ID4   N2     2015-01-01 2013-05-05 18:00:00      0        0     2
    14 ID4   N1     2014-05-06 2013-05-05 18:00:00      0        0     2
    15 ID4   N2     2014-05-06 2013-05-05 18:00:00      0        0     2
    16 ID5   N1     2016-04-04 2013-04-03 18:00:00      0        1     2
    17 ID5   N2     2014-04-04 2013-04-03 18:00:00      0        1     2
    18 ID5   N4     2012-04-04 2013-04-03 18:00:00      1        1     2
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R: 在日期间识别不同的行，并根据出现次数对行进行评分

问题

答案1

A tibble: 18 x 7

Groups: IDcol [5]

在R中，根据另一列创建一个用于分组字符串文本的列。

在JAVA中两个时间的相加

如何确保 goroutine 在条件满足时被等待？

在R中，将一些列的值合并后，向数据框添加一列。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。