2023年2月13日 23:15:49go评论95阅读模式

英文:

R: pivot multiple columns based on more than one pattern

问题

我有一个数据集，其中有一个名为Patient_ID的列，以及与每个出生事件的婴儿相关的多个列。由于有多个婴儿（双胞胎、三胞胎等），数据库决定以宽格式的方式工作。

所以，我有以下列：

Patient_ID *代表母亲;
pofid_1
pof1completeddate
pof1pregendweeks
pofid_2
pof2completeddate
pof2pregendweeks

等等。

pofid_1 是每个婴儿的唯一标识符，并且是唯一不遵循pofnvarname格式的变量。每个婴儿有大约50个列，我在这里仅列出了三个以示示例。是否有一种方法可以基于pof后面的数字将整个数据集透视，以便我有以下列名，并且每个出生的婴儿有一行：

Patient_ID
babynumber
pofid *婴儿ID;
pofcompleteddate
pofpregendweeks

所以，我从以下数据开始：

data.frame(
  Patient_ID = c(1, 2, 3, 4),
  pofid_1 = c(1, 2, 3, 4),
  pof1completeddate = as.Date(c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01")),
  pof1pregendweeks = c(40, 39, 41, 40),
  pofid_2 = c(NA, NA, 5, 6),
  pof2completeddate = as.Date(c(NA, NA, "2022-10-10", "2022-01-01")),
  pof2pregendweeks = c(NA, NA, 41, 40)
)

和想要的结果如下：

  Patient_ID pofid babynumber pofcompleteddate pofpregendweeks
1          1     1          1      2022-11-12             40
2          2     2          1      2022-12-11             39
3          3     3          1      2022-10-10             41
4          3     5          2      2022-10-10             41
5          4     4          1      2022-01-01             40
6          4     6          2      2022-01-01             40

英文:

I have a dataset where there is an ID columns Patient_ID, and multiple columns relating to each baby of a birth event. There are more than one set of each column, as there have been multiple births (twins, triplets etc), and the database decided to work in a wide format.

So, I have the columns:

Patient_ID *for the mother;
pofid_1
pof1completeddate
pof1pregendweeks
pofid_2
pof2completeddate
pof2pregendweeks

etc, etc.

pofid_1 refers to a unique identifier for each baby, and is the only variable that doesnt follow the format of pofnvarname (pof - pregnancy outcome form). There are ~50 columns for each baby, I have only listed three here for demonstration. Is there a way I can pivot the whole dataset based on the number after pof so I have the following column names, and one row for each baby born:

Patient_ID
babynumber
pofid *baby ID;
pofcompleteddate
pofpregendweeks

So, I am starting off with:

data.frame(
  Patient_ID = c(1, 2, 3, 4),
  pofid_1 = c(1, 2, 3, 4),
  pof1completeddate = as.Date(c(&quot;2022-11-12&quot;, &quot;2022-12-11&quot;, &quot;2022-10-10&quot;, &quot;2022-01-01&quot;)),
  pof1pregendweeks = c(40, 39, 41, 40),
  pofid_2 = c(NA, NA, 5, 6),
  pof2completeddate = as.Date(c(NA, NA, &quot;2022-10-10&quot;, &quot;2022-01-01&quot;)),
  pof2pregendweeks = c(NA, NA, 41, 40)
)
Patient_ID pofid_1 pof1completeddate pof1pregendweeks pofid_2 pof2completeddate pof2pregendweeks
1          1       1        2022-11-12               40      NA              &lt;NA&gt;               NA
2          2       2        2022-12-11               39      NA              &lt;NA&gt;               NA
3          3       3        2022-10-10               41       5        2022-10-10               41
4          4       4        2022-01-01               40       6        2022-01-01               40

And want

  Patient_ID pofid babynumber pofcompleteddate pofpregendweeks
1          1     1          1       2022-11-12              40
2          2     2          1       2022-12-11              39
3          3     3          1       2022-10-10              41
4          3     5          2       2022-10-10              41
5          4     4          1       2022-01-01              40
6          4     6          2       2022-01-01              40

答案1

得分: 2

更改列名以确保列名一致性最好的方法是将pofid_1和pof_id2更改为pof1id和pof2id。您可以使用rename_with一次性完成此操作。然后，只需将数据转换为长格式并筛选以保留完整的案例：

library(tidyverse)
df %>% 
  rename_with(~gsub('pofid_(\\d+)', 'pof\id', .x)) %>%
  pivot_longer(-Patient_ID, names_sep = '(?<=pof\\d)',
               names_to = c('babynumber', '.value')) %>%
  filter(complete.cases(.)) %>%
  mutate(babynumber = as.numeric(gsub('\\D', '', babynumber))) %>%
  rename(pofid = id)
#> # A tibble: 6 x 5
#>   Patient_ID babynumber pofid completeddate pregendweeks
#>        <int>      <dbl> <int> <chr>                <int>
#> 1          1          1     1 2022-11-12              40
#> 2          2          1     2 2022-12-11              39
#> 3          3          1     3 2022-10-10              41
#> 4          3          2     5 2022-10-10              41
#> 5          4          1     4 2022-01-01              40
#> 6          4          2     6 2022-01-01               4

^{创建于2023-02-13，使用reprex v2.0.2}

可复制格式的数据

df <- structure(list(Patient_ID = 1:4, pofid_1 = 1:4, 
                     pof1completeddate = c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01"), pof1pregendweeks = c(40L, 
39L, 41L, 40L), pofid_2 = c(NA, NA, 5L, 6L), pof2completeddate = c(NA, 
NA, "2022-10-10", "2022-01-01"), pof2pregendweeks = c(NA, NA, 
41L, 4L)), class = "data.frame", row.names = c("1", "2", "3", 
"4"))

英文:

It's best to ensure you have consistent naming across your columns by changing pofid_1 and pof_id2 to pof1id and pof2id. You can do this in one gulp using rename_with. Then, it's just a case of pivoting to long format and filtering to retain complete cases:

library(tidyverse)
df %&gt;% 
  rename_with(~gsub(&#39;pofid_(\\d+)&#39;, &#39;pof\id&#39;, .x)) %&gt;%
  pivot_longer(-Patient_ID, names_sep = &#39;(?&lt;=pof\\d)&#39;,
               names_to = c(&#39;babynumber&#39;, &#39;.value&#39;)) %&gt;%
  filter(complete.cases(.)) %&gt;%
  mutate(babynumber = as.numeric(gsub(&#39;\\D&#39;, &#39;&#39;, babynumber))) %&gt;%
  rename(pofid = id)
#&gt; # A tibble: 6 x 5
#&gt;   Patient_ID babynumber pofid completeddate pregendweeks
#&gt;        &lt;int&gt;      &lt;dbl&gt; &lt;int&gt; &lt;chr&gt;                &lt;int&gt;
#&gt; 1          1          1     1 2022-11-12              40
#&gt; 2          2          1     2 2022-12-11              39
#&gt; 3          3          1     3 2022-10-10              41
#&gt; 4          3          2     5 2022-10-10              41
#&gt; 5          4          1     4 2022-01-01              40
#&gt; 6          4          2     6 2022-01-01               4

<sup>Created on 2023-02-13 with reprex v2.0.2</sup>

Data in reproducible format

df &lt;- structure(list(Patient_ID = 1:4, pofid_1 = 1:4, 
                     pof1completeddate = c(&quot;2022-11-12&quot;, 
&quot;2022-12-11&quot;, &quot;2022-10-10&quot;, &quot;2022-01-01&quot;), pof1pregendweeks = c(40L, 
39L, 41L, 40L), pofid_2 = c(NA, NA, 5L, 6L), pof2completeddate = c(NA, 
NA, &quot;2022-10-10&quot;, &quot;2022-01-01&quot;), pof2pregendweeks = c(NA, NA, 
41L, 4L)), class = &quot;data.frame&quot;, row.names = c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, 
&quot;4&quot;))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R: 根据多个模式对多列进行数据透视

问题

答案1

Optimization with multiple inequality constraints

如何在列范围下联合更改值，并在其他列中分别更改。

如何基于一个起始数字在一列中创建重复数字序列？

如何在PowerPoint中使用OfficeR搜索和替换文本。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。