英文:
R: pivot multiple columns based on more than one pattern
问题
我有一个数据集,其中有一个名为Patient_ID
的列,以及与每个出生事件的婴儿相关的多个列。由于有多个婴儿(双胞胎、三胞胎等),数据库决定以宽格式的方式工作。
所以,我有以下列:
Patient_ID *代表母亲;
pofid_1
pof1completeddate
pof1pregendweeks
pofid_2
pof2completeddate
pof2pregendweeks
等等。
pofid_1
是每个婴儿的唯一标识符,并且是唯一不遵循pofnvarname
格式的变量。每个婴儿有大约50个列,我在这里仅列出了三个以示示例。是否有一种方法可以基于pof
后面的数字将整个数据集透视,以便我有以下列名,并且每个出生的婴儿有一行:
Patient_ID
babynumber
pofid *婴儿ID;
pofcompleteddate
pofpregendweeks
所以,我从以下数据开始:
data.frame(
Patient_ID = c(1, 2, 3, 4),
pofid_1 = c(1, 2, 3, 4),
pof1completeddate = as.Date(c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01")),
pof1pregendweeks = c(40, 39, 41, 40),
pofid_2 = c(NA, NA, 5, 6),
pof2completeddate = as.Date(c(NA, NA, "2022-10-10", "2022-01-01")),
pof2pregendweeks = c(NA, NA, 41, 40)
)
和想要的结果如下:
Patient_ID pofid babynumber pofcompleteddate pofpregendweeks
1 1 1 1 2022-11-12 40
2 2 2 1 2022-12-11 39
3 3 3 1 2022-10-10 41
4 3 5 2 2022-10-10 41
5 4 4 1 2022-01-01 40
6 4 6 2 2022-01-01 40
英文:
I have a dataset where there is an ID columns Patient_ID
, and multiple columns relating to each baby of a birth event. There are more than one set of each column, as there have been multiple births (twins, triplets etc), and the database decided to work in a wide format.
So, I have the columns:
Patient_ID *for the mother;
pofid_1
pof1completeddate
pof1pregendweeks
pofid_2
pof2completeddate
pof2pregendweeks
etc, etc.
pofid_1
refers to a unique identifier for each baby, and is the only variable that doesnt follow the format of pofnvarname
(pof - pregnancy outcome form). There are ~50 columns for each baby, I have only listed three here for demonstration. Is there a way I can pivot the whole dataset based on the number after pof
so I have the following column names, and one row for each baby born:
Patient_ID
babynumber
pofid *baby ID;
pofcompleteddate
pofpregendweeks
So, I am starting off with:
data.frame(
Patient_ID = c(1, 2, 3, 4),
pofid_1 = c(1, 2, 3, 4),
pof1completeddate = as.Date(c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01")),
pof1pregendweeks = c(40, 39, 41, 40),
pofid_2 = c(NA, NA, 5, 6),
pof2completeddate = as.Date(c(NA, NA, "2022-10-10", "2022-01-01")),
pof2pregendweeks = c(NA, NA, 41, 40)
)
Patient_ID pofid_1 pof1completeddate pof1pregendweeks pofid_2 pof2completeddate pof2pregendweeks
1 1 1 2022-11-12 40 NA <NA> NA
2 2 2 2022-12-11 39 NA <NA> NA
3 3 3 2022-10-10 41 5 2022-10-10 41
4 4 4 2022-01-01 40 6 2022-01-01 40
And want
Patient_ID pofid babynumber pofcompleteddate pofpregendweeks
1 1 1 1 2022-11-12 40
2 2 2 1 2022-12-11 39
3 3 3 1 2022-10-10 41
4 3 5 2 2022-10-10 41
5 4 4 1 2022-01-01 40
6 4 6 2 2022-01-01 40
答案1
得分: 2
更改列名以确保列名一致性最好的方法是将pofid_1
和pof_id2
更改为pof1id
和pof2id
。您可以使用rename_with
一次性完成此操作。然后,只需将数据转换为长格式并筛选以保留完整的案例:
library(tidyverse)
df %>%
rename_with(~gsub('pofid_(\\d+)', 'pof\id', .x)) %>%
pivot_longer(-Patient_ID, names_sep = '(?<=pof\\d)',
names_to = c('babynumber', '.value')) %>%
filter(complete.cases(.)) %>%
mutate(babynumber = as.numeric(gsub('\\D', '', babynumber))) %>%
rename(pofid = id)
#> # A tibble: 6 x 5
#> Patient_ID babynumber pofid completeddate pregendweeks
#> <int> <dbl> <int> <chr> <int>
#> 1 1 1 1 2022-11-12 40
#> 2 2 1 2 2022-12-11 39
#> 3 3 1 3 2022-10-10 41
#> 4 3 2 5 2022-10-10 41
#> 5 4 1 4 2022-01-01 40
#> 6 4 2 6 2022-01-01 4
创建于2023-02-13,使用reprex v2.0.2
可复制格式的数据
df <- structure(list(Patient_ID = 1:4, pofid_1 = 1:4,
pof1completeddate = c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01"), pof1pregendweeks = c(40L,
39L, 41L, 40L), pofid_2 = c(NA, NA, 5L, 6L), pof2completeddate = c(NA,
NA, "2022-10-10", "2022-01-01"), pof2pregendweeks = c(NA, NA,
41L, 4L)), class = "data.frame", row.names = c("1", "2", "3",
"4"))
英文:
It's best to ensure you have consistent naming across your columns by changing pofid_1
and pof_id2
to pof1id
and pof2id
. You can do this in one gulp using rename_with
. Then, it's just a case of pivoting to long format and filtering to retain complete cases:
library(tidyverse)
df %>%
rename_with(~gsub('pofid_(\\d+)', 'pof\id', .x)) %>%
pivot_longer(-Patient_ID, names_sep = '(?<=pof\\d)',
names_to = c('babynumber', '.value')) %>%
filter(complete.cases(.)) %>%
mutate(babynumber = as.numeric(gsub('\\D', '', babynumber))) %>%
rename(pofid = id)
#> # A tibble: 6 x 5
#> Patient_ID babynumber pofid completeddate pregendweeks
#> <int> <dbl> <int> <chr> <int>
#> 1 1 1 1 2022-11-12 40
#> 2 2 1 2 2022-12-11 39
#> 3 3 1 3 2022-10-10 41
#> 4 3 2 5 2022-10-10 41
#> 5 4 1 4 2022-01-01 40
#> 6 4 2 6 2022-01-01 4
<sup>Created on 2023-02-13 with reprex v2.0.2</sup>
Data in reproducible format
df <- structure(list(Patient_ID = 1:4, pofid_1 = 1:4,
pof1completeddate = c("2022-11-12",
"2022-12-11", "2022-10-10", "2022-01-01"), pof1pregendweeks = c(40L,
39L, 41L, 40L), pofid_2 = c(NA, NA, 5L, 6L), pof2completeddate = c(NA,
NA, "2022-10-10", "2022-01-01"), pof2pregendweeks = c(NA, NA,
41L, 4L)), class = "data.frame", row.names = c("1", "2", "3",
"4"))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论