How can I use the MatchIt package from R to match control and case patients on age and multiple diagnosis codes (ICD10)?

huangapple go评论59阅读模式
英文:

How can I use the MatchIt package from R to match control and case patients on age and multiple diagnosis codes (ICD10)?

问题

我有案例患者,并试图基于年龄(容易部分)和ICD10使用R中的MatchIt包来匹配对照组。我的问题是,对于给定的患者,可能有多个ICD代码。例如,一个20岁的案例患者可能有2个ICD代码,我想找到一个对照患者,他也是20岁,并且至少具有相同的两个ICD10代码(对照患者可能具有更多的ICD10代码也可以)。

以下是您尝试的内容:

library(MatchIt)
library(dplyr)

m.out <- matchit(I(status == "case") ~ age, data = df,
                 exact = ~age + diagnosis,
                 method = "optimal",
                 distance = "glm", ratio = 1)

m.data <- match.data(m.out, subclass = "matched_id")

print(m.data)
   patient_id diagnosis   age status  distance weights matched_id
        <dbl> <chr>     <dbl> <chr>      <dbl>   <dbl> <fct>     
 1       1001 Z34          20 case       0.269       1 1         
 2       1001 A24          20 case       0.269       1 2         
 3       1002 N39          22 case       0.308       1 3         
 4       1003 N89          23 case       0.329       1 4         
 5       1003 Z34          23 case       0.329       1 5         
 6       1004 A24          20 control    0.269       1 2         
 7       1005 N89          23 control    0.329       1 4         
 8       1005 Z34          23 control    0.329       1 5         
 9       1006 N39          22 control    0.308       1 3         
10       1007 Z34          20 control    0.269       1 1   

如您所见,患者1001与1007匹配,但也与1004匹配。我只想要1001与1004匹配,因为他们都是20岁,并且具有ICD代码Z34和A24。任何帮助将不胜感激。

英文:

I have case patients and am trying to match controls based of age (easy part) and ICD10 using the MatchIt package in R. My problem is that there are multiple ICD codes for a given patient. For example, a 20 year old case case patient may have 2 ICD codes and I want to find a control patient who is also 20 years old and has at least the same two ICD10 codes (the control patient may have more ICD10 codes which is fine).

  patient_id diagnosis   age status 
        &lt;dbl&gt; &lt;chr&gt;     &lt;dbl&gt; &lt;chr&gt;  
 1       1001 Z34          20 case   
 2       1001 A24          20 case   
 3       1002 N39          22 case   
 4       1002 Z3A          22 case   
 5       1003 N89          23 case   
 6       1003 Z34          23 case   
 7       1004 Z34          20 control
 8       1004 A24          20 control
 9       1005 D50          23 control
10       1005 F41          23 control
11       1005 N89          23 control
12       1005 Z11          23 control
13       1005 Z34          23 control
14       1006 Z12          22 control
15       1006 Z34          22 control
16       1006 N39          22 control
17       1007 E66          20 control
18       1007 Z11          20 control
19       1007 Z12          20 control
20       1007 Z34          20 control

Here is what I have tried:

library(MatchIt)
library(dplyr)

m.out &lt;- matchit(I(status == &quot;case&quot;) ~ age, data = df,
                 exact = ~age + diagnosis,
                 method = &quot;optimal&quot;,
                 distance = &quot;glm&quot;, ratio = 1)

m.data &lt;- match.data(m.out, subclass = &quot;matched_id&quot;)

print(m.data)
   patient_id diagnosis   age status  distance weights matched_id
        &lt;dbl&gt; &lt;chr&gt;     &lt;dbl&gt; &lt;chr&gt;      &lt;dbl&gt;   &lt;dbl&gt; &lt;fct&gt;     
 1       1001 Z34          20 case       0.269       1 1         
 2       1001 A24          20 case       0.269       1 2         
 3       1002 N39          22 case       0.308       1 3         
 4       1003 N89          23 case       0.329       1 4         
 5       1003 Z34          23 case       0.329       1 5         
 6       1004 A24          20 control    0.269       1 2         
 7       1005 N89          23 control    0.329       1 4         
 8       1005 Z34          23 control    0.329       1 5         
 9       1006 N39          22 control    0.308       1 3         
10       1007 Z34          20 control    0.269       1 1   

As you can see, patient 1001 matched with 1007, but also matched with 1004. I only want 1001 to match with 1004 since they are both 20 years old and have the ICD codes Z34 & A24. Any help would be much appreciated.

答案1

得分: 1

问题是你的数据集是长格式的,所以matchit会尝试为每一行进行匹配。解决方案是将数据重塑为宽格式,并对所有疾病进行虚拟编码,然后在其上进行匹配。请记住,如果你要求在疾病上进行精确匹配,那么你不会能够为每一行找到匹配项。

library(MatchIt)
library(dplyr, warn.conflicts = FALSE)

df <- tibble::tribble(
  ~patient_id, ~diagnosis, ~age, ~status,
  1001, "Z34", 20, "case",
  1001, "A24", 20, "case",
  1002, "N39", 22, "case",
  1002, "Z3A", 22, "case",
  1003, "N89", 23, "case",
  1003, "Z34", 23, "case",
  1004, "Z34", 20, "control",
  1004, "A24", 20, "control",
  1005, "D50", 23, "control",
  1005, "F41", 23, "control",
  1005, "N89", 23, "control",
  1005, "Z11", 23, "control",
  1005, "Z34", 23, "control",
  1006, "Z12", 22, "control",
  1006, "Z34", 22, "control",
  1006, "N39", 22, "control",
  1007, "E66", 20, "control",
  1007, "Z11", 20, "control",
  1007, "Z12", 20, "control",
  1007, "Z34", 20, "control"
)

df_wide <- tidyr::pivot_wider(df, names_from = diagnosis, values_from = diagnosis, values_fn = length, values_fill = 0)

df_wide
#> # A tibble: 7 × 13
#>   patient_id   age status    Z34   A24   N39   Z3A   N89   D50   F41   Z11   Z12
#>        <dbl> <dbl> <chr>   <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1       1001    20 case        1     1     0     0     0     0     0     0     0
#> 2       1002    22 case        0     0     1     1     0     0     0     0     0
#> 3       1003    23 case        1     0     0     0     1     0     0     0     0
#> 4       1004    20 control     1     1     0     0     0     0     0     0     0
#> 5       1005    23 control     1     0     0     0     1     1     1     1     0
#> 6       1006    22 control     1     0     1     0     0     0     0     0     1
#> 7       1007    20 control     1     0     0     0     0     0     0     1     1
#> # ℹ 1 more variable: E66 <int>

m.out <- matchit(I(status == "case") ~ age,
  data = df_wide,
  exact = ~ age - patient_id,
  method = "optimal",
  distance = "glm", ratio = 1
)

m.data <- match.data(m.out, subclass = "matched_id")

print(m.data)
#> # A tibble: 6 × 16
#>   patient_id   age status    Z34   A24   N39   Z3A   N89   D50   F41   Z11   Z12
#>        <dbl> <dbl> <chr>   <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1       1001    20 case        1     1     0     0     0     0     0     0     0
#> 2       1002    22 case        0     0     1     1     0     0     0     0     0
#> 3       1003    23 case        1     0     0     0     1     0     0     0     0
#> 4       1005    23 control     1     0     0     0     1     1     1     1     0
#> 5       1006    22 control     1     0     1     0     0     0     0     0     1
#> 6       1007    20 control     1     0     0     0     0     0     0     1     1
#> # ℹ 4 more variables: E66 <int>, distance <dbl>, weights <dbl>,
#> #   matched_id <fct>

我假设这是一个拼写错误,你指的是1007。

此外,当你发布问题时,最好让其他人能够轻松复制数据。dput()命令在这方面可能会有帮助。

英文:

I think the problem is that your dataset is in long format, so matchit will try to make a match for each row. The solution is to reshape the data to be wide and dummy code all of the diseases, then match on that. Keep in mind that you're not going to be able to match every row if you're requesting an exact match on disease.

library(MatchIt)
library(dplyr, warn.conflicts = FALSE)

df &lt;- tibble::tribble(
  ~patient_id, ~diagnosis, ~age, ~status,
  1001, &quot;Z34&quot;, 20, &quot;case&quot;,
  1001, &quot;A24&quot;, 20, &quot;case&quot;,
  1002, &quot;N39&quot;, 22, &quot;case&quot;,
  1002, &quot;Z3A&quot;, 22, &quot;case&quot;,
  1003, &quot;N89&quot;, 23, &quot;case&quot;,
  1003, &quot;Z34&quot;, 23, &quot;case&quot;,
  1004, &quot;Z34&quot;, 20, &quot;control&quot;,
  1004, &quot;A24&quot;, 20, &quot;control&quot;,
  1005, &quot;D50&quot;, 23, &quot;control&quot;,
  1005, &quot;F41&quot;, 23, &quot;control&quot;,
  1005, &quot;N89&quot;, 23, &quot;control&quot;,
  1005, &quot;Z11&quot;, 23, &quot;control&quot;,
  1005, &quot;Z34&quot;, 23, &quot;control&quot;,
  1006, &quot;Z12&quot;, 22, &quot;control&quot;,
  1006, &quot;Z34&quot;, 22, &quot;control&quot;,
  1006, &quot;N39&quot;, 22, &quot;control&quot;,
  1007, &quot;E66&quot;, 20, &quot;control&quot;,
  1007, &quot;Z11&quot;, 20, &quot;control&quot;,
  1007, &quot;Z12&quot;, 20, &quot;control&quot;,
  1007, &quot;Z34&quot;, 20, &quot;control&quot;
)

df_wide &lt;- tidyr::pivot_wider(df, names_from = diagnosis, values_from = diagnosis, values_fn = length, values_fill = 0)

df_wide
#&gt; # A tibble: 7 &#215; 13
#&gt;   patient_id   age status    Z34   A24   N39   Z3A   N89   D50   F41   Z11   Z12
#&gt;        &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1       1001    20 case        1     1     0     0     0     0     0     0     0
#&gt; 2       1002    22 case        0     0     1     1     0     0     0     0     0
#&gt; 3       1003    23 case        1     0     0     0     1     0     0     0     0
#&gt; 4       1004    20 control     1     1     0     0     0     0     0     0     0
#&gt; 5       1005    23 control     1     0     0     0     1     1     1     1     0
#&gt; 6       1006    22 control     1     0     1     0     0     0     0     0     1
#&gt; 7       1007    20 control     1     0     0     0     0     0     0     1     1
#&gt; # ℹ 1 more variable: E66 &lt;int&gt;

m.out &lt;- matchit(I(status == &quot;case&quot;) ~ age,
  data = df_wide,
  exact = ~ age - patient_id,
  method = &quot;optimal&quot;,
  distance = &quot;glm&quot;, ratio = 1
)

m.data &lt;- match.data(m.out, subclass = &quot;matched_id&quot;)

print(m.data)
#&gt; # A tibble: 6 &#215; 16
#&gt;   patient_id   age status    Z34   A24   N39   Z3A   N89   D50   F41   Z11   Z12
#&gt;        &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1       1001    20 case        1     1     0     0     0     0     0     0     0
#&gt; 2       1002    22 case        0     0     1     1     0     0     0     0     0
#&gt; 3       1003    23 case        1     0     0     0     1     0     0     0     0
#&gt; 4       1005    23 control     1     0     0     0     1     1     1     1     0
#&gt; 5       1006    22 control     1     0     1     0     0     0     0     0     1
#&gt; 6       1007    20 control     1     0     0     0     0     0     0     1     1
#&gt; # ℹ 4 more variables: E66 &lt;int&gt;, distance &lt;dbl&gt;, weights &lt;dbl&gt;,
#&gt; #   matched_id &lt;fct&gt;

<sup>Created on 2023-07-11 with reprex v2.0.2</sup>

> I only want 1001 to match with 1004 since they are both 20 years old and have the ICD codes Z34 & A24.

I've assumed this is a typo, and you mean 1007.

Also, it's a good idea when you post your question to make it easy for others to copy the data. The dput() command can be useful for this.

huangapple
  • 本文由 发表于 2023年7月11日 06:41:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76657747.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定