使用data.table进行筛选

huangapple go评论54阅读模式
英文:

using data.table to filter

问题

我有这个数据集(示例):

dt <- data.table(ID = c(1,1,1,2,2,3,4,5,5,5),
                 diagnosis = c("cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer"),
                 Date = c(2008,2001,2013,2008,2013,2013,2013,2001,2002,2013))

我只想要首次诊断在2013年的患者。因此,数据集中的任何其他年份都应该被排除。

然而,如果患者在2008年之前已经诊断过,那么他们不应计入新数据集。如果患者在2008年之前曾经被诊断过,那么我们将保留他们及其2013年的诊断。

所以最终的数据集将如下所示:

   ID diagnosis Date
1:  3    cancer 2013
2:  4    cancer 2013
3:  5    cancer 2013

您可以使用data.table来实现这个目标。

英文:

I have this dataset (example):

dt &lt;- data.table(ID = c(1,1,1,2,2,3,4,5,5,5),
                 diagnosis = c(&quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;),
                 Date = c(2008,2001,2013,2008,2013,2013,2013,2001,2002,2013))

I ONLY want patients with a first diagnosis in 2013. So any other year should be out of the dataset.

However a patient should not be counted in the new dataset if the patients has a diagnosis in 2008.
If the patient hav had a diagnosis before 2008, then we wil keep them, with their 2013 diagnosis.

So the final dataset will look like this:

 ID diagnosis Date
3  cancer    2013
4  cancer    2013
5  cancer    2013

How can I do so by using data.table

答案1

得分: 1

以下是已更新的代码和输出的翻译:

更新后的代码:

dt <- data.table(ID = c(1,1,1,2,2,3,4,5,5,5),
                 diagnosis = c("cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer"),
                 Date = c(2008,2001,2013,2008,2013,2013,2013,2001,2002,2013))

dt[diagnosis=="cancer" & Date == 2013 & !(ID %in% dt[diagnosis=="cancer" & Date == 2008, ID]),]

输出:

   ID diagnosis Date
1:  3    cancer 2013
2:  4    cancer 2013
3:  5    cancer 2013
英文:

Updated code:

dt &lt;- data.table(ID = c(1,1,1,2,2,3,4,5,5,5),
                 diagnosis = c(&quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;, &quot;cancer&quot;),
                 Date = c(2008,2001,2013,2008,2013,2013,2013,2001,2002,2013))

dt[diagnosis==&quot;cancer&quot; &amp; Date == 2013 &amp; !(ID %in% dt[diagnosis==&quot;cancer&quot; &amp; Date == 2008, ID]),]

Output:

   ID diagnosis Date
1:  3    cancer 2013
2:  4    cancer 2013
3:  5    cancer 2013

答案2

得分: 1

使用非连接(参见?data.table):

dt[Date == 2013][!dt[Date == 2008], on=.(ID)]

输出:

      ID diagnosis  Date
   &lt;num&gt;    &lt;char&gt; &lt;num&gt;
1:     3    cancer  2013
2:     4    cancer  2013
3:     5    cancer  2013

我猜这更有效率,使用过滤条件而不是类似 any 的聚合条件。

英文:

Using a not-join (see ?data.table):

dt[Date == 2013][!dt[Date == 2008], on=.(ID)]

Output

      ID diagnosis  Date
   &lt;num&gt;    &lt;char&gt; &lt;num&gt;
1:     3    cancer  2013
2:     4    cancer  2013
3:     5    cancer  2013

I guess this it's more efficient to use a filter than an aggregate condition like any.

答案3

得分: 0

dt[, .SD[!2008 %in% unlist(.SD) & Date == 2013], ID]

或者在看了Waldi的答案后,可能更好的方法是:

dt[, .SD[!any(Date == 2008) & Date == 2013], ID]

**结果**

# ID 诊断 日期
# 1:  3 癌症 2013
# 2:  4 癌症 2013
# 3:  5 癌症 2013
英文:
dt[, .SD[!2008 %in% unlist(.SD) &amp; Date == 2013], ID]

or probably a bit better after seeing Waldi's answer and do:

dt[, .SD[!any(Date == 2008) &amp; Date == 2013], ID]

results

# ID diagnosis Date
# 1:  3    cancer 2013
# 2:  4    cancer 2013
# 3:  5    cancer 2013

答案4

得分: 0

dt[, .SD[Date == 2013 & !any(between(Date, 2008, 2012)),], ID]
#       ID diagnosis  Date
#    <num>    <char> <num>
# 1:     3    cancer  2013
# 2:     4    cancer  2013
# 3:     5    cancer  2013
英文:
dt[, .SD[Date == 2013 &amp; !any(between(Date, 2008, 2012)),], ID]
#       ID diagnosis  Date
#    &lt;num&gt;    &lt;char&gt; &lt;num&gt;
# 1:     3    cancer  2013
# 2:     4    cancer  2013
# 3:     5    cancer  2013

huangapple
  • 本文由 发表于 2023年2月8日 19:36:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/75385221.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定