英文:
How can I use the MatchIt package from R to match control and case patients on age and multiple diagnosis codes (ICD10)?
问题
我有案例患者,并试图基于年龄(容易部分)和ICD10使用R中的MatchIt包来匹配对照组。我的问题是,对于给定的患者,可能有多个ICD代码。例如,一个20岁的案例患者可能有2个ICD代码,我想找到一个对照患者,他也是20岁,并且至少具有相同的两个ICD10代码(对照患者可能具有更多的ICD10代码也可以)。
以下是您尝试的内容:
library(MatchIt)
library(dplyr)
m.out <- matchit(I(status == "case") ~ age, data = df,
exact = ~age + diagnosis,
method = "optimal",
distance = "glm", ratio = 1)
m.data <- match.data(m.out, subclass = "matched_id")
print(m.data)
patient_id diagnosis age status distance weights matched_id
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <fct>
1 1001 Z34 20 case 0.269 1 1
2 1001 A24 20 case 0.269 1 2
3 1002 N39 22 case 0.308 1 3
4 1003 N89 23 case 0.329 1 4
5 1003 Z34 23 case 0.329 1 5
6 1004 A24 20 control 0.269 1 2
7 1005 N89 23 control 0.329 1 4
8 1005 Z34 23 control 0.329 1 5
9 1006 N39 22 control 0.308 1 3
10 1007 Z34 20 control 0.269 1 1
如您所见,患者1001与1007匹配,但也与1004匹配。我只想要1001与1004匹配,因为他们都是20岁,并且具有ICD代码Z34和A24。任何帮助将不胜感激。
英文:
I have case patients and am trying to match controls based of age (easy part) and ICD10 using the MatchIt package in R. My problem is that there are multiple ICD codes for a given patient. For example, a 20 year old case case patient may have 2 ICD codes and I want to find a control patient who is also 20 years old and has at least the same two ICD10 codes (the control patient may have more ICD10 codes which is fine).
patient_id diagnosis age status
<dbl> <chr> <dbl> <chr>
1 1001 Z34 20 case
2 1001 A24 20 case
3 1002 N39 22 case
4 1002 Z3A 22 case
5 1003 N89 23 case
6 1003 Z34 23 case
7 1004 Z34 20 control
8 1004 A24 20 control
9 1005 D50 23 control
10 1005 F41 23 control
11 1005 N89 23 control
12 1005 Z11 23 control
13 1005 Z34 23 control
14 1006 Z12 22 control
15 1006 Z34 22 control
16 1006 N39 22 control
17 1007 E66 20 control
18 1007 Z11 20 control
19 1007 Z12 20 control
20 1007 Z34 20 control
Here is what I have tried:
library(MatchIt)
library(dplyr)
m.out <- matchit(I(status == "case") ~ age, data = df,
exact = ~age + diagnosis,
method = "optimal",
distance = "glm", ratio = 1)
m.data <- match.data(m.out, subclass = "matched_id")
print(m.data)
patient_id diagnosis age status distance weights matched_id
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <fct>
1 1001 Z34 20 case 0.269 1 1
2 1001 A24 20 case 0.269 1 2
3 1002 N39 22 case 0.308 1 3
4 1003 N89 23 case 0.329 1 4
5 1003 Z34 23 case 0.329 1 5
6 1004 A24 20 control 0.269 1 2
7 1005 N89 23 control 0.329 1 4
8 1005 Z34 23 control 0.329 1 5
9 1006 N39 22 control 0.308 1 3
10 1007 Z34 20 control 0.269 1 1
As you can see, patient 1001 matched with 1007, but also matched with 1004. I only want 1001 to match with 1004 since they are both 20 years old and have the ICD codes Z34 & A24. Any help would be much appreciated.
答案1
得分: 1
问题是你的数据集是长格式的,所以matchit
会尝试为每一行进行匹配。解决方案是将数据重塑为宽格式,并对所有疾病进行虚拟编码,然后在其上进行匹配。请记住,如果你要求在疾病上进行精确匹配,那么你不会能够为每一行找到匹配项。
library(MatchIt)
library(dplyr, warn.conflicts = FALSE)
df <- tibble::tribble(
~patient_id, ~diagnosis, ~age, ~status,
1001, "Z34", 20, "case",
1001, "A24", 20, "case",
1002, "N39", 22, "case",
1002, "Z3A", 22, "case",
1003, "N89", 23, "case",
1003, "Z34", 23, "case",
1004, "Z34", 20, "control",
1004, "A24", 20, "control",
1005, "D50", 23, "control",
1005, "F41", 23, "control",
1005, "N89", 23, "control",
1005, "Z11", 23, "control",
1005, "Z34", 23, "control",
1006, "Z12", 22, "control",
1006, "Z34", 22, "control",
1006, "N39", 22, "control",
1007, "E66", 20, "control",
1007, "Z11", 20, "control",
1007, "Z12", 20, "control",
1007, "Z34", 20, "control"
)
df_wide <- tidyr::pivot_wider(df, names_from = diagnosis, values_from = diagnosis, values_fn = length, values_fill = 0)
df_wide
#> # A tibble: 7 × 13
#> patient_id age status Z34 A24 N39 Z3A N89 D50 F41 Z11 Z12
#> <dbl> <dbl> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1001 20 case 1 1 0 0 0 0 0 0 0
#> 2 1002 22 case 0 0 1 1 0 0 0 0 0
#> 3 1003 23 case 1 0 0 0 1 0 0 0 0
#> 4 1004 20 control 1 1 0 0 0 0 0 0 0
#> 5 1005 23 control 1 0 0 0 1 1 1 1 0
#> 6 1006 22 control 1 0 1 0 0 0 0 0 1
#> 7 1007 20 control 1 0 0 0 0 0 0 1 1
#> # ℹ 1 more variable: E66 <int>
m.out <- matchit(I(status == "case") ~ age,
data = df_wide,
exact = ~ age - patient_id,
method = "optimal",
distance = "glm", ratio = 1
)
m.data <- match.data(m.out, subclass = "matched_id")
print(m.data)
#> # A tibble: 6 × 16
#> patient_id age status Z34 A24 N39 Z3A N89 D50 F41 Z11 Z12
#> <dbl> <dbl> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1001 20 case 1 1 0 0 0 0 0 0 0
#> 2 1002 22 case 0 0 1 1 0 0 0 0 0
#> 3 1003 23 case 1 0 0 0 1 0 0 0 0
#> 4 1005 23 control 1 0 0 0 1 1 1 1 0
#> 5 1006 22 control 1 0 1 0 0 0 0 0 1
#> 6 1007 20 control 1 0 0 0 0 0 0 1 1
#> # ℹ 4 more variables: E66 <int>, distance <dbl>, weights <dbl>,
#> # matched_id <fct>
我假设这是一个拼写错误,你指的是1007。
此外,当你发布问题时,最好让其他人能够轻松复制数据。dput()
命令在这方面可能会有帮助。
英文:
I think the problem is that your dataset is in long format, so matchit
will try to make a match for each row. The solution is to reshape the data to be wide and dummy code all of the diseases, then match on that. Keep in mind that you're not going to be able to match every row if you're requesting an exact match on disease.
library(MatchIt)
library(dplyr, warn.conflicts = FALSE)
df <- tibble::tribble(
~patient_id, ~diagnosis, ~age, ~status,
1001, "Z34", 20, "case",
1001, "A24", 20, "case",
1002, "N39", 22, "case",
1002, "Z3A", 22, "case",
1003, "N89", 23, "case",
1003, "Z34", 23, "case",
1004, "Z34", 20, "control",
1004, "A24", 20, "control",
1005, "D50", 23, "control",
1005, "F41", 23, "control",
1005, "N89", 23, "control",
1005, "Z11", 23, "control",
1005, "Z34", 23, "control",
1006, "Z12", 22, "control",
1006, "Z34", 22, "control",
1006, "N39", 22, "control",
1007, "E66", 20, "control",
1007, "Z11", 20, "control",
1007, "Z12", 20, "control",
1007, "Z34", 20, "control"
)
df_wide <- tidyr::pivot_wider(df, names_from = diagnosis, values_from = diagnosis, values_fn = length, values_fill = 0)
df_wide
#> # A tibble: 7 × 13
#> patient_id age status Z34 A24 N39 Z3A N89 D50 F41 Z11 Z12
#> <dbl> <dbl> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1001 20 case 1 1 0 0 0 0 0 0 0
#> 2 1002 22 case 0 0 1 1 0 0 0 0 0
#> 3 1003 23 case 1 0 0 0 1 0 0 0 0
#> 4 1004 20 control 1 1 0 0 0 0 0 0 0
#> 5 1005 23 control 1 0 0 0 1 1 1 1 0
#> 6 1006 22 control 1 0 1 0 0 0 0 0 1
#> 7 1007 20 control 1 0 0 0 0 0 0 1 1
#> # ℹ 1 more variable: E66 <int>
m.out <- matchit(I(status == "case") ~ age,
data = df_wide,
exact = ~ age - patient_id,
method = "optimal",
distance = "glm", ratio = 1
)
m.data <- match.data(m.out, subclass = "matched_id")
print(m.data)
#> # A tibble: 6 × 16
#> patient_id age status Z34 A24 N39 Z3A N89 D50 F41 Z11 Z12
#> <dbl> <dbl> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1001 20 case 1 1 0 0 0 0 0 0 0
#> 2 1002 22 case 0 0 1 1 0 0 0 0 0
#> 3 1003 23 case 1 0 0 0 1 0 0 0 0
#> 4 1005 23 control 1 0 0 0 1 1 1 1 0
#> 5 1006 22 control 1 0 1 0 0 0 0 0 1
#> 6 1007 20 control 1 0 0 0 0 0 0 1 1
#> # ℹ 4 more variables: E66 <int>, distance <dbl>, weights <dbl>,
#> # matched_id <fct>
<sup>Created on 2023-07-11 with reprex v2.0.2</sup>
> I only want 1001 to match with 1004 since they are both 20 years old and have the ICD codes Z34 & A24.
I've assumed this is a typo, and you mean 1007.
Also, it's a good idea when you post your question to make it easy for others to copy the data. The dput()
command can be useful for this.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论