R: 根据多个模式对多列进行数据透视

huangapple go评论92阅读模式
英文:

R: pivot multiple columns based on more than one pattern

问题

我有一个数据集,其中有一个名为Patient_ID的列,以及与每个出生事件的婴儿相关的多个列。由于有多个婴儿(双胞胎、三胞胎等),数据库决定以宽格式的方式工作。

所以,我有以下列:

  1. Patient_ID *代表母亲;
  2. pofid_1
  3. pof1completeddate
  4. pof1pregendweeks
  5. pofid_2
  6. pof2completeddate
  7. pof2pregendweeks

等等。

pofid_1 是每个婴儿的唯一标识符,并且是唯一不遵循pofnvarname格式的变量。每个婴儿有大约50个列,我在这里仅列出了三个以示示例。是否有一种方法可以基于pof后面的数字将整个数据集透视,以便我有以下列名,并且每个出生的婴儿有一行:

  1. Patient_ID
  2. babynumber
  3. pofid *婴儿ID;
  4. pofcompleteddate
  5. pofpregendweeks

所以,我从以下数据开始:

  1. data.frame(
  2. Patient_ID = c(1, 2, 3, 4),
  3. pofid_1 = c(1, 2, 3, 4),
  4. pof1completeddate = as.Date(c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01")),
  5. pof1pregendweeks = c(40, 39, 41, 40),
  6. pofid_2 = c(NA, NA, 5, 6),
  7. pof2completeddate = as.Date(c(NA, NA, "2022-10-10", "2022-01-01")),
  8. pof2pregendweeks = c(NA, NA, 41, 40)
  9. )

和想要的结果如下:

  1. Patient_ID pofid babynumber pofcompleteddate pofpregendweeks
  2. 1 1 1 1 2022-11-12 40
  3. 2 2 2 1 2022-12-11 39
  4. 3 3 3 1 2022-10-10 41
  5. 4 3 5 2 2022-10-10 41
  6. 5 4 4 1 2022-01-01 40
  7. 6 4 6 2 2022-01-01 40
英文:

I have a dataset where there is an ID columns Patient_ID, and multiple columns relating to each baby of a birth event. There are more than one set of each column, as there have been multiple births (twins, triplets etc), and the database decided to work in a wide format.

So, I have the columns:

  1. Patient_ID *for the mother;
  2. pofid_1
  3. pof1completeddate
  4. pof1pregendweeks
  5. pofid_2
  6. pof2completeddate
  7. pof2pregendweeks

etc, etc.

pofid_1 refers to a unique identifier for each baby, and is the only variable that doesnt follow the format of pofnvarname (pof - pregnancy outcome form). There are ~50 columns for each baby, I have only listed three here for demonstration. Is there a way I can pivot the whole dataset based on the number after pof so I have the following column names, and one row for each baby born:

  1. Patient_ID
  2. babynumber
  3. pofid *baby ID;
  4. pofcompleteddate
  5. pofpregendweeks

So, I am starting off with:

  1. data.frame(
  2. Patient_ID = c(1, 2, 3, 4),
  3. pofid_1 = c(1, 2, 3, 4),
  4. pof1completeddate = as.Date(c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01")),
  5. pof1pregendweeks = c(40, 39, 41, 40),
  6. pofid_2 = c(NA, NA, 5, 6),
  7. pof2completeddate = as.Date(c(NA, NA, "2022-10-10", "2022-01-01")),
  8. pof2pregendweeks = c(NA, NA, 41, 40)
  9. )
  10. Patient_ID pofid_1 pof1completeddate pof1pregendweeks pofid_2 pof2completeddate pof2pregendweeks
  11. 1 1 1 2022-11-12 40 NA <NA> NA
  12. 2 2 2 2022-12-11 39 NA <NA> NA
  13. 3 3 3 2022-10-10 41 5 2022-10-10 41
  14. 4 4 4 2022-01-01 40 6 2022-01-01 40

And want

  1. Patient_ID pofid babynumber pofcompleteddate pofpregendweeks
  2. 1 1 1 1 2022-11-12 40
  3. 2 2 2 1 2022-12-11 39
  4. 3 3 3 1 2022-10-10 41
  5. 4 3 5 2 2022-10-10 41
  6. 5 4 4 1 2022-01-01 40
  7. 6 4 6 2 2022-01-01 40

答案1

得分: 2

更改列名以确保列名一致性最好的方法是将pofid_1pof_id2更改为pof1idpof2id。您可以使用rename_with一次性完成此操作。然后,只需将数据转换为长格式并筛选以保留完整的案例:

  1. library(tidyverse)
  2. df %>%
  3. rename_with(~gsub('pofid_(\\d+)', 'pof\id', .x)) %>%
  4. pivot_longer(-Patient_ID, names_sep = '(?<=pof\\d)',
  5. names_to = c('babynumber', '.value')) %>%
  6. filter(complete.cases(.)) %>%
  7. mutate(babynumber = as.numeric(gsub('\\D', '', babynumber))) %>%
  8. rename(pofid = id)
  9. #> # A tibble: 6 x 5
  10. #> Patient_ID babynumber pofid completeddate pregendweeks
  11. #> <int> <dbl> <int> <chr> <int>
  12. #> 1 1 1 1 2022-11-12 40
  13. #> 2 2 1 2 2022-12-11 39
  14. #> 3 3 1 3 2022-10-10 41
  15. #> 4 3 2 5 2022-10-10 41
  16. #> 5 4 1 4 2022-01-01 40
  17. #> 6 4 2 6 2022-01-01 4

创建于2023-02-13,使用reprex v2.0.2


可复制格式的数据

  1. df <- structure(list(Patient_ID = 1:4, pofid_1 = 1:4,
  2. pof1completeddate = c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01"), pof1pregendweeks = c(40L,
  3. 39L, 41L, 40L), pofid_2 = c(NA, NA, 5L, 6L), pof2completeddate = c(NA,
  4. NA, "2022-10-10", "2022-01-01"), pof2pregendweeks = c(NA, NA,
  5. 41L, 4L)), class = "data.frame", row.names = c("1", "2", "3",
  6. "4"))
英文:

It's best to ensure you have consistent naming across your columns by changing pofid_1 and pof_id2 to pof1id and pof2id. You can do this in one gulp using rename_with. Then, it's just a case of pivoting to long format and filtering to retain complete cases:

  1. library(tidyverse)
  2. df %&gt;%
  3. rename_with(~gsub(&#39;pofid_(\\d+)&#39;, &#39;pof\id&#39;, .x)) %&gt;%
  4. pivot_longer(-Patient_ID, names_sep = &#39;(?&lt;=pof\\d)&#39;,
  5. names_to = c(&#39;babynumber&#39;, &#39;.value&#39;)) %&gt;%
  6. filter(complete.cases(.)) %&gt;%
  7. mutate(babynumber = as.numeric(gsub(&#39;\\D&#39;, &#39;&#39;, babynumber))) %&gt;%
  8. rename(pofid = id)
  9. #&gt; # A tibble: 6 x 5
  10. #&gt; Patient_ID babynumber pofid completeddate pregendweeks
  11. #&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;chr&gt; &lt;int&gt;
  12. #&gt; 1 1 1 1 2022-11-12 40
  13. #&gt; 2 2 1 2 2022-12-11 39
  14. #&gt; 3 3 1 3 2022-10-10 41
  15. #&gt; 4 3 2 5 2022-10-10 41
  16. #&gt; 5 4 1 4 2022-01-01 40
  17. #&gt; 6 4 2 6 2022-01-01 4

<sup>Created on 2023-02-13 with reprex v2.0.2</sup>


Data in reproducible format

  1. df &lt;- structure(list(Patient_ID = 1:4, pofid_1 = 1:4,
  2. pof1completeddate = c(&quot;2022-11-12&quot;,
  3. &quot;2022-12-11&quot;, &quot;2022-10-10&quot;, &quot;2022-01-01&quot;), pof1pregendweeks = c(40L,
  4. 39L, 41L, 40L), pofid_2 = c(NA, NA, 5L, 6L), pof2completeddate = c(NA,
  5. NA, &quot;2022-10-10&quot;, &quot;2022-01-01&quot;), pof2pregendweeks = c(NA, NA,
  6. 41L, 4L)), class = &quot;data.frame&quot;, row.names = c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;,
  7. &quot;4&quot;))

huangapple
  • 本文由 发表于 2023年2月13日 23:15:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75437766.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定