英文:
using data.table to filter
问题
我有这个数据集(示例):
dt <- data.table(ID = c(1,1,1,2,2,3,4,5,5,5),
diagnosis = c("cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer"),
Date = c(2008,2001,2013,2008,2013,2013,2013,2001,2002,2013))
我只想要首次诊断在2013年的患者。因此,数据集中的任何其他年份都应该被排除。
然而,如果患者在2008年之前已经诊断过,那么他们不应计入新数据集。如果患者在2008年之前曾经被诊断过,那么我们将保留他们及其2013年的诊断。
所以最终的数据集将如下所示:
ID diagnosis Date
1: 3 cancer 2013
2: 4 cancer 2013
3: 5 cancer 2013
您可以使用data.table
来实现这个目标。
英文:
I have this dataset (example):
dt <- data.table(ID = c(1,1,1,2,2,3,4,5,5,5),
diagnosis = c("cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer"),
Date = c(2008,2001,2013,2008,2013,2013,2013,2001,2002,2013))
I ONLY want patients with a first diagnosis in 2013. So any other year should be out of the dataset.
However a patient should not be counted in the new dataset if the patients has a diagnosis in 2008.
If the patient hav had a diagnosis before 2008, then we wil keep them, with their 2013 diagnosis.
So the final dataset will look like this:
ID diagnosis Date
3 cancer 2013
4 cancer 2013
5 cancer 2013
How can I do so by using data.table
答案1
得分: 1
以下是已更新的代码和输出的翻译:
更新后的代码:
dt <- data.table(ID = c(1,1,1,2,2,3,4,5,5,5),
diagnosis = c("cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer"),
Date = c(2008,2001,2013,2008,2013,2013,2013,2001,2002,2013))
dt[diagnosis=="cancer" & Date == 2013 & !(ID %in% dt[diagnosis=="cancer" & Date == 2008, ID]),]
输出:
ID diagnosis Date
1: 3 cancer 2013
2: 4 cancer 2013
3: 5 cancer 2013
英文:
Updated code:
dt <- data.table(ID = c(1,1,1,2,2,3,4,5,5,5),
diagnosis = c("cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer", "cancer"),
Date = c(2008,2001,2013,2008,2013,2013,2013,2001,2002,2013))
dt[diagnosis=="cancer" & Date == 2013 & !(ID %in% dt[diagnosis=="cancer" & Date == 2008, ID]),]
Output:
ID diagnosis Date
1: 3 cancer 2013
2: 4 cancer 2013
3: 5 cancer 2013
答案2
得分: 1
使用非连接(参见?data.table
):
dt[Date == 2013][!dt[Date == 2008], on=.(ID)]
输出:
ID diagnosis Date
<num> <char> <num>
1: 3 cancer 2013
2: 4 cancer 2013
3: 5 cancer 2013
我猜这更有效率,使用过滤条件而不是类似 any
的聚合条件。
英文:
Using a not-join (see ?data.table
):
dt[Date == 2013][!dt[Date == 2008], on=.(ID)]
Output
ID diagnosis Date
<num> <char> <num>
1: 3 cancer 2013
2: 4 cancer 2013
3: 5 cancer 2013
I guess this it's more efficient to use a filter than an aggregate condition like any
.
答案3
得分: 0
dt[, .SD[!2008 %in% unlist(.SD) & Date == 2013], ID]
或者在看了Waldi的答案后,可能更好的方法是:
dt[, .SD[!any(Date == 2008) & Date == 2013], ID]
**结果**
# ID 诊断 日期
# 1: 3 癌症 2013
# 2: 4 癌症 2013
# 3: 5 癌症 2013
英文:
dt[, .SD[!2008 %in% unlist(.SD) & Date == 2013], ID]
or probably a bit better after seeing Waldi's answer and do:
dt[, .SD[!any(Date == 2008) & Date == 2013], ID]
results
# ID diagnosis Date
# 1: 3 cancer 2013
# 2: 4 cancer 2013
# 3: 5 cancer 2013
答案4
得分: 0
dt[, .SD[Date == 2013 & !any(between(Date, 2008, 2012)),], ID]
# ID diagnosis Date
# <num> <char> <num>
# 1: 3 cancer 2013
# 2: 4 cancer 2013
# 3: 5 cancer 2013
英文:
dt[, .SD[Date == 2013 & !any(between(Date, 2008, 2012)),], ID]
# ID diagnosis Date
# <num> <char> <num>
# 1: 3 cancer 2013
# 2: 4 cancer 2013
# 3: 5 cancer 2013
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论