R: 根据多个模式对多列进行数据透视

huangapple go评论51阅读模式
英文:

R: pivot multiple columns based on more than one pattern

问题

我有一个数据集,其中有一个名为Patient_ID的列,以及与每个出生事件的婴儿相关的多个列。由于有多个婴儿(双胞胎、三胞胎等),数据库决定以宽格式的方式工作。

所以,我有以下列:

Patient_ID *代表母亲;
pofid_1
pof1completeddate
pof1pregendweeks
pofid_2
pof2completeddate
pof2pregendweeks

等等。

pofid_1 是每个婴儿的唯一标识符,并且是唯一不遵循pofnvarname格式的变量。每个婴儿有大约50个列,我在这里仅列出了三个以示示例。是否有一种方法可以基于pof后面的数字将整个数据集透视,以便我有以下列名,并且每个出生的婴儿有一行:

Patient_ID
babynumber
pofid *婴儿ID;
pofcompleteddate
pofpregendweeks

所以,我从以下数据开始:

data.frame(
  Patient_ID = c(1, 2, 3, 4),
  pofid_1 = c(1, 2, 3, 4),
  pof1completeddate = as.Date(c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01")),
  pof1pregendweeks = c(40, 39, 41, 40),
  pofid_2 = c(NA, NA, 5, 6),
  pof2completeddate = as.Date(c(NA, NA, "2022-10-10", "2022-01-01")),
  pof2pregendweeks = c(NA, NA, 41, 40)
)

和想要的结果如下:

  Patient_ID pofid babynumber pofcompleteddate pofpregendweeks
1          1     1          1      2022-11-12             40
2          2     2          1      2022-12-11             39
3          3     3          1      2022-10-10             41
4          3     5          2      2022-10-10             41
5          4     4          1      2022-01-01             40
6          4     6          2      2022-01-01             40
英文:

I have a dataset where there is an ID columns Patient_ID, and multiple columns relating to each baby of a birth event. There are more than one set of each column, as there have been multiple births (twins, triplets etc), and the database decided to work in a wide format.

So, I have the columns:

Patient_ID *for the mother;
pofid_1
pof1completeddate
pof1pregendweeks
pofid_2
pof2completeddate
pof2pregendweeks

etc, etc.

pofid_1 refers to a unique identifier for each baby, and is the only variable that doesnt follow the format of pofnvarname (pof - pregnancy outcome form). There are ~50 columns for each baby, I have only listed three here for demonstration. Is there a way I can pivot the whole dataset based on the number after pof so I have the following column names, and one row for each baby born:

Patient_ID
babynumber
pofid *baby ID;
pofcompleteddate
pofpregendweeks

So, I am starting off with:

data.frame(
  Patient_ID = c(1, 2, 3, 4),
  pofid_1 = c(1, 2, 3, 4),
  pof1completeddate = as.Date(c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01")),
  pof1pregendweeks = c(40, 39, 41, 40),
  pofid_2 = c(NA, NA, 5, 6),
  pof2completeddate = as.Date(c(NA, NA, "2022-10-10", "2022-01-01")),
  pof2pregendweeks = c(NA, NA, 41, 40)
)

Patient_ID pofid_1 pof1completeddate pof1pregendweeks pofid_2 pof2completeddate pof2pregendweeks
1          1       1        2022-11-12               40      NA              <NA>               NA
2          2       2        2022-12-11               39      NA              <NA>               NA
3          3       3        2022-10-10               41       5        2022-10-10               41
4          4       4        2022-01-01               40       6        2022-01-01               40

And want

  Patient_ID pofid babynumber pofcompleteddate pofpregendweeks
1          1     1          1       2022-11-12              40
2          2     2          1       2022-12-11              39
3          3     3          1       2022-10-10              41
4          3     5          2       2022-10-10              41
5          4     4          1       2022-01-01              40
6          4     6          2       2022-01-01              40

答案1

得分: 2

更改列名以确保列名一致性最好的方法是将pofid_1pof_id2更改为pof1idpof2id。您可以使用rename_with一次性完成此操作。然后,只需将数据转换为长格式并筛选以保留完整的案例:

library(tidyverse)

df %>% 
  rename_with(~gsub('pofid_(\\d+)', 'pof\id', .x)) %>%
  pivot_longer(-Patient_ID, names_sep = '(?<=pof\\d)',
               names_to = c('babynumber', '.value')) %>%
  filter(complete.cases(.)) %>%
  mutate(babynumber = as.numeric(gsub('\\D', '', babynumber))) %>%
  rename(pofid = id)
#> # A tibble: 6 x 5
#>   Patient_ID babynumber pofid completeddate pregendweeks
#>        <int>      <dbl> <int> <chr>                <int>
#> 1          1          1     1 2022-11-12              40
#> 2          2          1     2 2022-12-11              39
#> 3          3          1     3 2022-10-10              41
#> 4          3          2     5 2022-10-10              41
#> 5          4          1     4 2022-01-01              40
#> 6          4          2     6 2022-01-01               4

创建于2023-02-13,使用reprex v2.0.2


可复制格式的数据

df <- structure(list(Patient_ID = 1:4, pofid_1 = 1:4, 
                     pof1completeddate = c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01"), pof1pregendweeks = c(40L, 
39L, 41L, 40L), pofid_2 = c(NA, NA, 5L, 6L), pof2completeddate = c(NA, 
NA, "2022-10-10", "2022-01-01"), pof2pregendweeks = c(NA, NA, 
41L, 4L)), class = "data.frame", row.names = c("1", "2", "3", 
"4"))
英文:

It's best to ensure you have consistent naming across your columns by changing pofid_1 and pof_id2 to pof1id and pof2id. You can do this in one gulp using rename_with. Then, it's just a case of pivoting to long format and filtering to retain complete cases:

library(tidyverse)

df %&gt;% 
  rename_with(~gsub(&#39;pofid_(\\d+)&#39;, &#39;pof\id&#39;, .x)) %&gt;%
  pivot_longer(-Patient_ID, names_sep = &#39;(?&lt;=pof\\d)&#39;,
               names_to = c(&#39;babynumber&#39;, &#39;.value&#39;)) %&gt;%
  filter(complete.cases(.)) %&gt;%
  mutate(babynumber = as.numeric(gsub(&#39;\\D&#39;, &#39;&#39;, babynumber))) %&gt;%
  rename(pofid = id)
#&gt; # A tibble: 6 x 5
#&gt;   Patient_ID babynumber pofid completeddate pregendweeks
#&gt;        &lt;int&gt;      &lt;dbl&gt; &lt;int&gt; &lt;chr&gt;                &lt;int&gt;
#&gt; 1          1          1     1 2022-11-12              40
#&gt; 2          2          1     2 2022-12-11              39
#&gt; 3          3          1     3 2022-10-10              41
#&gt; 4          3          2     5 2022-10-10              41
#&gt; 5          4          1     4 2022-01-01              40
#&gt; 6          4          2     6 2022-01-01               4

<sup>Created on 2023-02-13 with reprex v2.0.2</sup>


Data in reproducible format

df &lt;- structure(list(Patient_ID = 1:4, pofid_1 = 1:4, 
                     pof1completeddate = c(&quot;2022-11-12&quot;, 
&quot;2022-12-11&quot;, &quot;2022-10-10&quot;, &quot;2022-01-01&quot;), pof1pregendweeks = c(40L, 
39L, 41L, 40L), pofid_2 = c(NA, NA, 5L, 6L), pof2completeddate = c(NA, 
NA, &quot;2022-10-10&quot;, &quot;2022-01-01&quot;), pof2pregendweeks = c(NA, NA, 
41L, 4L)), class = &quot;data.frame&quot;, row.names = c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, 
&quot;4&quot;))

huangapple
  • 本文由 发表于 2023年2月13日 23:15:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75437766.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定