如何在Excel工作表中将具有多个标题行的值矩阵进行”pivot_wider/melt”操作?

huangapple go评论65阅读模式
英文:

How to pivot_wider/melt a matrix of values with multiple header rows in Excel sheet?

问题

我有一些Excel表格,每个表格都有大约120个实验结果和150名患者的矩阵。但是,实验结果存储为从第4行开始的行,患者存储在第1到第3行的额外标头列数据中。因此,Excel表格看起来像这样:

我想要读取它并将其转换为更长/规范化的数据框,基本上像这个模拟:

我认为可以使用pivot_longermelt来处理它,但左上角的空单元格会引起问题,即实验结果的列标题从第4行开始,患者列从C列开始。

另外,患者列的名称始终从D列的位置开始,但显然实际名称“EXA_2665”、“EXA_2707”等是事先不知道的,它们将作为新发放的患者ID。

这是我到目前为止得到的代码。

library(tidyverse)
library(readxl)
library(dplyr)
library(tidyr)

##xls_data <- read_excel(path = "lab2_exafinal.xlsx", sheet = "Labata")
## TEST hardcoded data for testing purposes
xls_data2 <- read.table(text = "
...1,...2,Patient-Id,EXA_2665,EXA_2707,EXA_2971
,,Tube,87019054,87065976,87093024
,,Tube-Pos,C12,D01,D02
LabId,LabDescr,WeightFact,,,
9,HbA1c,1.00,1.784,0.419,1.045
37,HDL-Chol,1.00,0.684,2.172,1.954
38,LDL-Chol,1.00,2.438,1.995,0.806
41,Glucose,1.00,1.366,1.47,0.978
", header = TRUE, sep = ",", na.strings = "")  

library(reshape)
df_melt <- melt(xls_data, id.vars=c("...1", "...2"))
# error: names do not match previous names

pivoted_data <- pivot_longer(
  xls_data,
  cols = names(xls_data[3,]),
  names_to = "patient",
  values_to = "value" 
  )
#no error but not quite the desired result

EDIT: @Jon-Spring建议使用dput来分享由read_excel函数读取Excel文件时产生的数据的确切格式/结构(顺便说一下,它还将1.045读取为1.044999等,但现在这是一个单独的问题),另外请注意,患者标签名称可能包含“-”字符。

#xls_data <- read_excel(path = "lab2_exafinal.xlsx", sheet = "Labata")
#dput(xls_data)
xls_data <- structure(
  list(
    ...1 = c(NA, NA, "LabId", "9", "37", "38", "41"),
    ...2 = c(NA, NA, "LabDescr", "HbA1c", "HDL-Chol", "LDL-Chol", "Glucose"),
    ...3 = c("Tube", "Tube-Pos", "WeightFact", "1,00", "1,00", "1,00", "1,00"),
    EXA_2665 = c("87019054", "C12", NA, "1.784", "0.68400000000000005", "2.4380000000000002", "1.3660000000000001"),
    EXA_2707 = c("87065976", "D01", NA, "0.41899999999999998", "2.1720000000000002", "1.9950000000000001", "1.47"),
    EXA_2971 = c("87093024", "D02", NA, "1.0449999999999999", "1.954", "0.80600000000000005", "0.97799999999999998")
    ),
    class = c("tbl_df", "tbl", "data.frame"),
    row.names = c(NA, -7L)
  )

当我使用readxl::read_excel函数读取Excel表格时,第一列将是...1...2。因此,当使用melt时,我认为这些应该是id.vars参数(?),但它会产生一个错误消息。pivot_longer的结果看起来更接近我想要的结果,但问题是有额外的标头行,我不知道如何使用(或仅删除?)它们。

我可以想到其他方法来做到这一点,但我不知道在Rstudio中如何做到:

  • 以某种方式尝试将其读取为两个单独的数据框,患者(C1..F3)和实验室值(A4..F8),然后合并它们。
  • 或者首先将患者标头名称(如EXA_2665等)复制到空行4,然后重新解释A4..F8区域。

还有其他方法来最好地处理这个问题吗?

英文:

I've got a number of Excel sheets, each with a matrix of about 120 lab results and 150 patients. However, the lab results are stored as rows starting at row 4, and the patients are stored with extra header column data in rows 1 to 3. So the Excel sheet look like this:

如何在Excel工作表中将具有多个标题行的值矩阵进行”pivot_wider/melt”操作?

And I want to read it and convert it to a longer/normalised dataframe, basically like this mockup:
如何在Excel工作表中将具有多个标题行的值矩阵进行”pivot_wider/melt”操作?

I think it could be processed with pivot_longer or melt, but the empty cells in the top left are causing trouble, i.e. the column headers of the lab result start at row 4, and the patient colums start at column C.

Btw the patient column names always start at position of column D, but obviously the actual names EXA_2665, EXA_2707 etc are not known beforehand, they will be newly issued patient IDs.

Here is the code I've got so far.

library(tidyverse)
library(readxl)
library(dplyr)
library(tidyr)

##xls_data <- read_excel(path = "lab2_exafinal.xlsx", sheet = "Labata")
## TEST hardcoded data for testing purposes
xls_data2 <- read.table(text = "
...1,...2,Patient-Id,EXA_2665,EXA_2707,EXA_2971
,,Tube,87019054,87065976,87093024
,,Tube-Pos,C12,D01,D02
LabId,LabDescr,WeightFact,,,
9,HbA1c,1.00,1.784,0.419,1.045
37,HDL-Chol,1.00,0.684,2.172,1.954
38,LDL-Chol,1.00,2.438,1.995,0.806
41,Glucose,1.00,1.366,1.47,0.978
", header = TRUE, sep = ",", na.strings = "")  

library(reshape)
df_melt <- melt(xls_data, id.vars=c("...1", "...2"))
# error: names do not match previous names

pivoted_data <- pivot_longer(
  xls_data,
  cols = names(xls_data[3,]),
  names_to = "patient",
  values_to = "value" 
  )
#no error but not quite the desired result

EDIT: @Jon-Spring suggested using dput to share the exact formatting/structure of data as resulted by the read_excel function reading the Excel file (btw it also read 1.045 as 1.044999 etc but for now that's a separate issue) btw also notice that the patient labelnames may include the - character.

#xls_data <- read_excel(path = "lab2_exafinal.xlsx", sheet = "Labata")
#dput(xls_data)
xls_data <- structure(
  list(
    ...1 = c(NA, NA, "LabId", "9", "37", "38", "41"),
    ...2 = c(NA, NA, "LabDescr", "HbA1c", "HDL-Chol", "LDL-Chol", "Glucose"),
    ...3 = c("Tube", "Tube-Pos", "WeightFact", "1,00", "1,00", "1,00", "1,00"),
    EXA_2665 = c("87019054", "C12", NA, "1.784", "0.68400000000000005", "2.4380000000000002", "1.3660000000000001"),
    EXA_2707 = c("87065976", "D01", NA, "0.41899999999999998", "2.1720000000000002", "1.9950000000000001", "1.47"),
    EXA_2971 = c("87093024", "D02", NA, "1.0449999999999999", "1.954", "0.80600000000000005", "0.97799999999999998")
    ),
    class = c("tbl_df", "tbl", "data.frame"),
    row.names = c(NA, -7L)
  )

When I read the Excel sheet using readxl::read_excel the first columns will be ...1, ...2. So when using melt I think those should be the id.vars parameters(?) but it gives an error message. The result of pivot_longer looks closer to what I'm trying to do, but the problem is there are extra header rows andd I don't know how to use (or just remove?) those.

I can think of other ways to do it, but I don't know how to do that in Rstudio:

  • Somehow try to read is as two separate data frames, the patients (C1..F3) and the lab values (A4..F8) and then merge it.
  • Or first copy the patient header names EXA_2665 etc to the empty row 4, and then re-interpret just the A4..F8 region

Any other idea how to best approach this?

答案1

得分: 2

The unheadr包提供了一些用于重新排列具有此类特征的表格的工具,这些特征可能使它们对人类读者更紧凑,但在代码中解析时可能更加棘手。

这里,沿着您第二个想法的思路,我们可以将四行标题合并在一起(第一行作为标题,接下来的三行作为数据行)。我指定了“-”作为分隔符,因为“PatientId”中已经出现了一个非分隔符。

然后,我们可以将这些合并的标题使用pivot_longer转换为多个names列,同时保留每个值。

library(unheadr)
xls_data2 |>
  # 取前三行并将它们与标题合并
  mash_colnames(n_name_rows = 3, sep = "-") |>
  # 重塑除前三列之外的所有内容,将合并的标题分隔成新的列名
  # 1) 第三个原始标题
  # 2-4) 第三列中的前三个数据行
  pivot_longer(-c(1:3), names_sep = "-", 
               names_to = c(colnames(xls_data2)[3], xls_data2[1:3, 3])) |>
  dplyr::rename(LabId = 1, LabDescr = 2, WeightFact2 = 3) 
  # 使用dplyr::rename以避免与`reshape::rename`冲突

结果如下:

# 一个tibble:12 × 8
   LabId LabDescr WeightFact2 PatientId Tube     TubePos WeightFact value
   <chr> <chr>    <chr>       <chr>     <chr>    <chr>   <chr>      <chr>
 1 9     HbA1c    1.00        EXA_2665  87019054 C12     NA         1.784
 2 9     HbA1c    1.00        EXA_2707  87065976 D01     NA         0.419
 3 9     HbA1c    1.00        EXA_2971  87093024 D02     NA         1.045
 4 37    HDL-Chol 1.00        EXA_2665  87019054 C12     NA         0.684
 5 37    HDL-Chol 1.00        EXA_2707  87065976 D01     NA         2.172
 6 37    HDL-Chol 1.00        EXA_2971  87093024 D02     NA         1.954
 7 38    LDL-Chol 1.00        EXA_2665  87019054 C12     NA         2.438
 8 38    LDL-Chol 1.00        EXA_2707  87065976 D01     NA         1.995
 9 38    LDL-Chol 1.00        EXA_2971  87093024 D02     NA         0.806
10 41    Glucose  1.00        EXA_2665  87019054 C12     NA         1.366
11 41    Glucose  1.00        EXA_2707  87065976 D01     NA         1.47 
12 41    Glucose  1.00        EXA_2971  87093024 D02     NA         0.978

带有dput(xls_data)的编辑示例:

xls_data |>
  mash_colnames(n_name_rows = 3, sep = "@") |>
  pivot_longer(-c(1:3), names_sep = "@", 
               names_to = c(colnames(xls_data2)[3], xls_data2[1:3, 3])) |>
  dplyr::rename(LabId = 1, LabDescr = 2, WeightFact2 = 3) |>
  dplyr::mutate(across(value, as.numeric))

结果如下:

# 一个tibble:12 × 8
   LabId LabDescr WeightFact2 PatientId Tube     TubePos WeightFact value
   <chr> <chr>    <chr>       <chr>     <chr>    <chr>   <chr>      <dbl>
 1 9     HbA1c    1,00        EXA_2665  87019054 C12     NA         1.78 
 2 9     HbA1c    1,00        EXA_2707  87065976 D01     NA         0.419
 3 9     HbA1c    1,00        EXA_2971  87093024 D02     NA         1.04 
 4 37    HDL-Chol 1,00        EXA_2665  87019054 C12     NA         0.684
 5 37    HDL-Chol 1,00        EXA_2707  87065976 D01     NA         2.17 
 6 37    HDL-Chol 1,00        EXA_2971  87093024 D02     NA         1.95 
 7 38    LDL-Chol 1,00        EXA_2665  87019054 C12     NA         2.44 
 8 38    LDL-Chol 1,00        EXA_2707  87065976 D01     NA         2.00 
 9 38    LDL-Chol 1,00        EXA_2971  87093024 D02     NA         0.806
10 41    Glucose  1,00        EXA_2665  87019054 C12     NA         1.37 
11 41    Glucose  1,00        EXA_2707  87065976 D01     NA         1.47 
12 41    Glucose  1,00        EXA_2971  87093024 D

<details>
<summary>英文:</summary>

The `unheadr` package offers some tools for reshaping tables with these sorts of features that might make them more compact for human readers but trickier to parse with code. 

Here, along the lines of your 2nd idea, we can combine the four rows of headers (the first of which comes in as a header, the next three as data rows). I specified `-` as the separator because `_` already appears as a non-separator in `PatientId`.

Then we can `pivot_longer` those combined headers into multiple `names` columns alongside each value.


    library(unheadr)
    xls_data2 |&gt;
      # take the first three rows and combine them with the header
      mash_colnames(n_name_rows = 3, sep = &quot;-&quot;) |&gt;
      # reshape everything besides the first three columns, separating the 
      # combined headers into new columns named
      # 1) the third original header
      # 2-4) the first three data rows in the third column
      pivot_longer(-c(1:3), names_sep = &quot;-&quot;, 
                   names_to = c(colnames(xls_data2)[3], xls_data2[1:3, 3])) |&gt;
      dplyr::rename(LabId = 1, LabDescr = 2, WeightFact2 = 3) 
      # dplyr::rename to avoid conflict with `reshape::rename`

Result  


    # A tibble: 12 &#215; 8
       LabId LabDescr WeightFact2 PatientId Tube     TubePos WeightFact value
       &lt;chr&gt; &lt;chr&gt;    &lt;chr&gt;       &lt;chr&gt;     &lt;chr&gt;    &lt;chr&gt;   &lt;chr&gt;      &lt;chr&gt;
     1 9     HbA1c    1.00        EXA_2665  87019054 C12     NA         1.784
     2 9     HbA1c    1.00        EXA_2707  87065976 D01     NA         0.419
     3 9     HbA1c    1.00        EXA_2971  87093024 D02     NA         1.045
     4 37    HDL-Chol 1.00        EXA_2665  87019054 C12     NA         0.684
     5 37    HDL-Chol 1.00        EXA_2707  87065976 D01     NA         2.172
     6 37    HDL-Chol 1.00        EXA_2971  87093024 D02     NA         1.954
     7 38    LDL-Chol 1.00        EXA_2665  87019054 C12     NA         2.438
     8 38    LDL-Chol 1.00        EXA_2707  87065976 D01     NA         1.995
     9 38    LDL-Chol 1.00        EXA_2971  87093024 D02     NA         0.806
    10 41    Glucose  1.00        EXA_2665  87019054 C12     NA         1.366
    11 41    Glucose  1.00        EXA_2707  87065976 D01     NA         1.47 
    12 41    Glucose  1.00        EXA_2971  87093024 D02     NA         0.978


------

Edit with `dput(xls_data)` example:

    xls_data |&gt;
      mash_colnames(n_name_rows = 3, sep = &quot;@&quot;) |&gt;
      pivot_longer(-c(1:3), names_sep = &quot;@&quot;, 
                   names_to = c(colnames(xls_data2)[3], xls_data2[1:3, 3])) |&gt;
      dplyr::rename(LabId = 1, LabDescr = 2, WeightFact2 = 3) |&gt;
      dplyr::mutate(across(value, as.numeric))

Result

    # A tibble: 12 &#215; 8
       LabId LabDescr WeightFact2 PatientId Tube     TubePos WeightFact value
       &lt;chr&gt; &lt;chr&gt;    &lt;chr&gt;       &lt;chr&gt;     &lt;chr&gt;    &lt;chr&gt;   &lt;chr&gt;      &lt;dbl&gt;
     1 9     HbA1c    1,00        EXA_2665  87019054 C12     NA         1.78 
     2 9     HbA1c    1,00        EXA_2707  87065976 D01     NA         0.419
     3 9     HbA1c    1,00        EXA_2971  87093024 D02     NA         1.04 
     4 37    HDL-Chol 1,00        EXA_2665  87019054 C12     NA         0.684
     5 37    HDL-Chol 1,00        EXA_2707  87065976 D01     NA         2.17 
     6 37    HDL-Chol 1,00        EXA_2971  87093024 D02     NA         1.95 
     7 38    LDL-Chol 1,00        EXA_2665  87019054 C12     NA         2.44 
     8 38    LDL-Chol 1,00        EXA_2707  87065976 D01     NA         2.00 
     9 38    LDL-Chol 1,00        EXA_2971  87093024 D02     NA         0.806
    10 41    Glucose  1,00        EXA_2665  87019054 C12     NA         1.37 
    11 41    Glucose  1,00        EXA_2707  87065976 D01     NA         1.47 
    12 41    Glucose  1,00        EXA_2971  87093024 D02     NA         0.978


</details>



# 答案2
**得分**: 1

以下是翻译好的部分:

如果你感兴趣,下面有一种PowerQuery的方法。它是动态的,可以适应任何数量的PatientID数据列。

将数据导入,不包括任何标题

```let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"保留行" = Table.FirstN(Table.RemoveColumns(Source,{"Column1", "Column2"}),3),
#"提升标题" = Table.PromoteHeaders(Table.Transpose(#"保留行"), [PromoteAllScalars=true]),
#"添加自定义" = Table.AddColumn(#"提升标题", "Custom", each {1 .. Table.RowCount(Source)-4}),
Part1 = Table.ExpandListColumn(#"添加自定义", "Custom"),
#"提升标题2" = Table.PromoteHeaders(Table.Skip(Source,3), [PromoteAllScalars=true]),
leading=3, groupsof=1,
#"添加自定义1" = Table.AddColumn(#"提升标题2", "Custom", each List.Split( List.RemoveFirstN(Record.ToList( _),leading), groupsof) ),
#"添加自定义2" = Table.AddColumn(#"添加自定义1", "Custom0", each Text.Combine(List.Transform(List.FirstN(Record.ToList(_),leading), each Text.From(_)),"|")),
#"移除其他列" = Table.SelectColumns(#"添加自定义2",{"Custom0", "Custom"}),
#"展开自定义" = Table.ExpandListColumn( #"移除其他列", "Custom"),
#"提取值" = Table.TransformColumns(#"展开自定义", {"Custom", each Text.Combine(List.Transform(_, Text.From), "|"), type text}),
#"合并列" = Table.CombineColumns(#"提取值",{"Custom0", "Custom"},Combiner.CombineTextByDelimiter("|", QuoteStyle.None),"Custom"),
#"按分隔符拆分列" = Table.SplitColumn(#"合并列", "Custom", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), List.FirstN(Table.ColumnNames(Source),leading+groupsof)),
Part2 = Table.RenameColumns(#"按分隔符拆分列",{{"Column1", "LabID"}, {"Column2", "LabDescr"}, {"Column3", "WeightFact"}, {"Column4", "Value"}}),
Combine=Table.FromColumns(Table.ToColumns(Part2) & Table.ToColumns(Part1),Table.ColumnNames(Part2)&Table.ColumnNames(Part1)),
#"移除列2" = Table.RemoveColumns(Combine,{"Custom"})
in #"移除列2"

如何在Excel工作表中将具有多个标题行的值矩阵进行”pivot_wider/melt”操作?

英文:

If you feel like it there is a powerquery method as below. Its dynamic and would accommodate any number of PatientID data columns

Bring the data in without any headers

let Source = Excel.CurrentWorkbook(){[Name=&quot;Table1&quot;]}[Content],
#&quot;Keep Rows&quot; = Table.FirstN(Table.RemoveColumns(Source,{&quot;Column1&quot;, &quot;Column2&quot;}),3),
#&quot;Promoted Headers&quot; = Table.PromoteHeaders(Table.Transpose(#&quot;Keep Rows&quot;), [PromoteAllScalars=true]),
#&quot;Added Custom&quot; = Table.AddColumn(#&quot;Promoted Headers&quot;, &quot;Custom&quot;, each {1 .. Table.RowCount(Source)-4}),
Part1 = Table.ExpandListColumn(#&quot;Added Custom&quot;, &quot;Custom&quot;),
#&quot;Promoted Headers2&quot; = Table.PromoteHeaders(Table.Skip(Source,3), [PromoteAllScalars=true]),
leading=3, groupsof=1,
#&quot;Added Custom1&quot; = Table.AddColumn(#&quot;Promoted Headers2&quot;, &quot;Custom&quot;, each List.Split( List.RemoveFirstN(Record.ToList( _),leading), groupsof) ),
#&quot;Added Custom2&quot; = Table.AddColumn(#&quot;Added Custom1&quot;, &quot;Custom0&quot;, each Text.Combine(List.Transform(List.FirstN(Record.ToList(_),leading), each Text.From(_)),&quot;|&quot;)),
#&quot;Removed Other Columns&quot; = Table.SelectColumns(#&quot;Added Custom2&quot;,{&quot;Custom0&quot;, &quot;Custom&quot;}),
#&quot;Expanded Custom&quot; = Table.ExpandListColumn( #&quot;Removed Other Columns&quot;, &quot;Custom&quot;),
#&quot;Extracted Values&quot; = Table.TransformColumns(#&quot;Expanded Custom&quot;, {&quot;Custom&quot;, each Text.Combine(List.Transform(_, Text.From), &quot;|&quot;), type text}),
#&quot;Merged Columns&quot; = Table.CombineColumns(#&quot;Extracted Values&quot;,{&quot;Custom0&quot;, &quot;Custom&quot;},Combiner.CombineTextByDelimiter(&quot;|&quot;, QuoteStyle.None),&quot;Custom&quot;),
#&quot;Split Column by Delimiter&quot; = Table.SplitColumn(#&quot;Merged Columns&quot;, &quot;Custom&quot;, Splitter.SplitTextByDelimiter(&quot;|&quot;, QuoteStyle.Csv), List.FirstN(Table.ColumnNames(Source),leading+groupsof)),
Part2 = Table.RenameColumns(#&quot;Split Column by Delimiter&quot;,{{&quot;Column1&quot;, &quot;LabID&quot;}, {&quot;Column2&quot;, &quot;LabDescr&quot;}, {&quot;Column3&quot;, &quot;WeightFact&quot;}, {&quot;Column4&quot;, &quot;Value&quot;}}),
Combine=Table.FromColumns(Table.ToColumns(Part2) &amp; Table.ToColumns(Part1),Table.ColumnNames(Part2)&amp;Table.ColumnNames(Part1)),
#&quot;Removed Columns2&quot; = Table.RemoveColumns(Combine,{&quot;Custom&quot;})
in #&quot;Removed Columns2&quot;

如何在Excel工作表中将具有多个标题行的值矩阵进行”pivot_wider/melt”操作?

答案3

得分: 0

如果您想要一个“纯粹”的 tidyverse 方法,可以使用以下代码:

library(dplyr)
library(tidyr)

xls_data <- structure(
   list(
      ...1 = c(NA, NA, "LabId", "9", "37", "38", "41"),
      ...2 = c(NA, NA, "LabDescr", "HbA1c", "HDL-Chol", "LDL-Chol", "Glucose"),
      ...3 = c("Tube", "TubePos", "WeightFact", "1,00", "1,00", "1,00", "1,00"),
      EXA_2665 = c("87019054", "C12", NA, "1.784", "0.68400000000000005", "2.4380000000000002", "1.3660000000000001"),
      EXA_2707 = c("87065976", "D01", NA, "0.41899999999999998", "2.1720000000000002", "1.9950000000000001", "1.47"),
      EXA_2971 = c("87093024", "D02", NA, "1.0449999999999999", "1.954", "0.80600000000000005", "0.97799999999999998")
   ),
   class = c("tbl_df", "tbl", "data.frame"),
   row.names = c(NA, -7L)
)

# 1. 获取数据的起始位置
data_start <- xls_data %>%
   select(first_col = 1) %>%
   summarize(start = which(!is.na(first_col))[1L] + 1L) %>%
   pull(start)

# 2. 提取两个数据块

values <- xls_data %>%
   slice(data_start:n()) %>%
   set_names(
      c(
         xls_data %>%
            slice(data_start - 1L) %>%
            select(1:3) %>%
            unlist(),
         names(.)[-(1:3)]
      )
   ) %>%
   mutate(across(-(2:3), as.numeric))

meta <-  xls_data %>%
   slice(1:(data_start - 2L)) %>%
   select(-(1:2)) %>%
   set_names(c("PatientId", names(.)[-1L])) %>%
   bind_rows(names(.) %>%
                as.list() %>%
                set_names(.) %>%
                as.data.frame(),
             .
   ) %>%
   t() %>%
   `colnames<-`(.[1L, ]) %>%
   as_tibble() %>%
   slice(-1L)   

# 3. 重塑和连接
values %>%
   pivot_longer(-(1:3)) %>%
   inner_join(meta,
              c(name = "PatientId")) %>%
   rename(PatientId = name) %>%
   select(LabId:PatientId, Tube:TubePos, Value = value) %>%
   arrange(PatientId, LabId)

诚然,这比John的优秀提议要长得多,但是对于掌握 tidyverse 来说是一个不错的(并且冗长的)练习 如何在Excel工作表中将具有多个标题行的值矩阵进行”pivot_wider/melt”操作?

英文:

If you want a "pure" tidyverse approach you can use the following:

library(dplyr)
library(tidyr)

xls_data &lt;- structure(
   list(
      ...1 = c(NA, NA, &quot;LabId&quot;, &quot;9&quot;, &quot;37&quot;, &quot;38&quot;, &quot;41&quot;),
      ...2 = c(NA, NA, &quot;LabDescr&quot;, &quot;HbA1c&quot;, &quot;HDL-Chol&quot;, &quot;LDL-Chol&quot;, &quot;Glucose&quot;),
      ...3 = c(&quot;Tube&quot;, &quot;TubePos&quot;, &quot;WeightFact&quot;, &quot;1,00&quot;, &quot;1,00&quot;, &quot;1,00&quot;, &quot;1,00&quot;),
      EXA_2665 = c(&quot;87019054&quot;, &quot;C12&quot;, NA, &quot;1.784&quot;, &quot;0.68400000000000005&quot;, &quot;2.4380000000000002&quot;, &quot;1.3660000000000001&quot;),
      EXA_2707 = c(&quot;87065976&quot;, &quot;D01&quot;, NA, &quot;0.41899999999999998&quot;, &quot;2.1720000000000002&quot;, &quot;1.9950000000000001&quot;, &quot;1.47&quot;),
      EXA_2971 = c(&quot;87093024&quot;, &quot;D02&quot;, NA, &quot;1.0449999999999999&quot;, &quot;1.954&quot;, &quot;0.80600000000000005&quot;, &quot;0.97799999999999998&quot;)
   ),
   class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;),
   row.names = c(NA, -7L)
)

# 1. Get start of the data
data_start &lt;- xls_data %&gt;%
   select(first_col = 1) %&gt;%
   summarize(start = which(!is.na(first_col))[1L] + 1L) %&gt;% 
   pull(start)

# 2. Extract the 2 data blocks

values &lt;- xls_data %&gt;% 
   slice(data_start:n()) %&gt;% 
   set_names(
      c(
         xls_data %&gt;% 
            slice(data_start - 1L) %&gt;% 
            select(1:3) %&gt;% 
            unlist(),
         names(.)[-(1:3)]
      )
   ) %&gt;% 
   mutate(across(-(2:3), as.numeric)) 

meta &lt;-  xls_data %&gt;% 
   slice(1:(data_start - 2L)) %&gt;% 
   select(-(1:2)) %&gt;% 
   set_names(c(&quot;PatientId&quot;, names(.)[-1L])) %&gt;% 
   bind_rows(names(.) %&gt;% 
                as.list() %&gt;% 
                set_names(.) %&gt;% 
                as.data.frame(),
             .
   ) %&gt;% 
   t() %&gt;% 
   `colnames&lt;-`(.[1L, ]) %&gt;% 
   as_tibble() %&gt;% 
   slice(-1L)   

# 3. Reshape and join
values %&gt;% 
   pivot_longer(-(1:3)) %&gt;% 
   inner_join(meta,
              c(name = &quot;PatientId&quot;)) %&gt;% 
   rename(PatientId = name) %&gt;% 
   select(LabId:PatientId, Tube:TubePos, Value = value) %&gt;% 
   arrange(PatientId, LabId)

Admittedly, much longer code than the excellent proposal of John, but a good (and verbose) exercise to master the tidyverse 如何在Excel工作表中将具有多个标题行的值矩阵进行”pivot_wider/melt”操作?

huangapple
  • 本文由 发表于 2023年6月15日 00:40:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76475800-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定