错误,使用逗号作为分隔符拆分为新行时。

huangapple go评论55阅读模式
英文:

Error while splitting into new row with comma as delimiter

问题

我有以下的数据框:

temp = structure(list(pid = c("s1", "s1", "s1"), LEFT_GENE = c("PTPRO", "EPS8", "DPY19L2,AC084357.2,AC027667.1"), RIGHT_GENE = c("", "FOx,D", "DPY19L2P2,S100A11P1")), row.names = c(1L, 2L, 3L), class = "data.frame")

我想要将以逗号分隔的每个项拆分为新的行,并创建新的组合。例如,最后一行应该创建6个额外的新行。然而,我遇到了一个我不理解的错误。

temp %>%
  separate_rows(LEFT_GENE:RIGHT_GENE, sep=",") %>%
  data.frame(stringsAsFactors = FALSE)

错误消息是:

Error in `fn()`:
! In row 3, can't recycle input of size 3 to size 2.
Run `rlang::last_error()` to see where the error occurred.

然而,错误似乎来自第3行,因为前两行正常工作。

temp[1:2, ] %>%
  separate_rows(LEFT_GENE:RIGHT_GENE, sep=",") %>%
  data.frame(stringsAsFactors = FALSE)

有人知道问题是什么吗?

英文:

I have the following dataframe

temp = structure(list(pid = c("s1", "s1", "s1"), LEFT_GENE = c("PTPRO", "EPS8", "DPY19L2,AC084357.2,AC027667.1"
), RIGHT_GENE = c("", "FOx,D", "DPY19L2P2,S100A11P1")), row.names = c(1L, 2L, 3L), class = "data.frame")


  pid                     LEFT_GENE          RIGHT_GENE
1  s1                         PTPRO                    
2  s1                          EPS8                 FOx, D
3  s1 DPY19L2,AC084357.2,AC027667.1 DPY19L2P2,S100A11P1

I want to split each item delimited with a comma into a new row and create new combination.
For example, the last row should create 6 new additional rows. However I'm getting this error I don't understand.

temp %>%
  separate_rows(LEFT_GENE:RIGHT_GENE, sep=",") %>%  
  data.frame ( stringsAsFactors = F)

Error in `fn()`:
! In row 3, can't recycle input of size 3 to size 2.
Run `rlang::last_error()` to see where the error occurred.

However the error seems to be coming from row 3 since rows 1:2 works fine

> temp[1:2, 
+      ] %>%
+   separate_rows(LEFT_GENE:RIGHT_GENE, sep=",") %>%  
+   data.frame ( stringsAsFactors = F)
  pid LEFT_GENE RIGHT_GENE
1  s1     PTPRO           
2  s1      EPS8        FOx
3  s1      EPS8          D

Does anyone know what the issue is?

答案1

得分: 3

你只能一次分开一列

     temp %>%
       separate_rows(RIGHT_GENE)%>%
       separate_rows(LEFT_GENE)

    # A tibble: 9 × 3
      pid   LEFT_GENE  RIGHT_GENE 
      <chr> <chr>      <chr>      
    1 s1    PTPRO      ""         
    2 s1    EPS8       "FOx"      
    3 s1    EPS8       "D"        
    4 s1    DPY19L2    "DPY19L2P2"
    5 s1    AC084357.2 "DPY19L2P2"
    6 s1    AC027667.1 "DPY19L2P2"
    7 s1    DPY19L2    "S100A11P1"
    8 s1    AC084357.2 "S100A11P1"
    9 s1    AC027667.1 "S100A11P1"
英文:

You can only separate one column at a time

 temp %>%
   separate_rows(RIGHT_GENE)%>%
   separate_rows(LEFT_GENE)

# A tibble: 9 × 3
  pid   LEFT_GENE  RIGHT_GENE 
  <chr> <chr>      <chr>      
1 s1    PTPRO      ""         
2 s1    EPS8       "FOx"      
3 s1    EPS8       "D"        
4 s1    DPY19L2    "DPY19L2P2"
5 s1    AC084357.2 "DPY19L2P2"
6 s1    AC027667.1 "DPY19L2P2"
7 s1    DPY19L2    "S100A11P1"
8 s1    AC084357.2 "S100A11P1"
9 s1    AC027667.1 "S100A11P1"

答案2

得分: 1

如果我们需要6行,一个选项是

library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
temp %>%
  mutate(across(ends_with("_GENE"), ~ strsplit(.x,  split = ",")), 
  cnt = pmax(lengths(LEFT_GENE), lengths(RIGHT_GENE))) %>%
  mutate(across(ends_with("_GENE"),
    ~ map2(.x, cnt, ~ `length<-`(.x, .y))) %>% 
  select(-cnt) %>% 
  unnest_longer(where(is.list))

-输出

# A tibble: 6 × 3
  pid   LEFT_GENE  RIGHT_GENE
  <chr> <chr>      <chr>     
1 s1    PTPRO      <NA>      
2 s1    EPS8       FOx       
3 s1    <NA>       D         
4 s1    DPY19L2    DPY19L2P2 
5 s1    AC084357.2 S100A11P1 
6 s1    AC027667.1 <NA>

如果NA应该被前一个非NA替代,请在末尾添加fill:

...
%>% fill(ends_with("_GENE"))
# A tibble: 6 × 3
  pid   LEFT_GENE  RIGHT_GENE
  <chr> <chr>      <chr>     
1 s1    PTPRO      <NA>      
2 s1    EPS8       FOx       
3 s1    EPS8       D         
4 s1    DPY19L2    DPY19L2P2 
5 s1    AC084357.2 S100A11P1 
6 s1    AC027667.1 S100A11P1
英文:

If we need 6 rows, an option is

library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
temp %&gt;% 
  mutate(across(ends_with(&quot;_GENE&quot;), ~ strsplit(.x,  split = &quot;,&quot;)), 
  cnt = pmax(lengths(LEFT_GENE), lengths(RIGHT_GENE))) %&gt;% 
  mutate(across(ends_with(&quot;_GENE&quot;),
    ~ map2(.x, cnt, ~ `length&lt;-`(.x, .y)))) %&gt;%
  select(-cnt) %&gt;%
  unnest_longer(where(is.list))

-output

# A tibble: 6 &#215; 3
  pid   LEFT_GENE  RIGHT_GENE
  &lt;chr&gt; &lt;chr&gt;      &lt;chr&gt;     
1 s1    PTPRO      &lt;NA&gt;      
2 s1    EPS8       FOx       
3 s1    &lt;NA&gt;       D         
4 s1    DPY19L2    DPY19L2P2 
5 s1    AC084357.2 S100A11P1 
6 s1    AC027667.1 &lt;NA&gt;      

If the NAs should be replaced by the previous non-NA, add fill at the end

...
%&gt;% fill(ends_with(&quot;_GENE&quot;))
# A tibble: 6 &#215; 3
  pid   LEFT_GENE  RIGHT_GENE
  &lt;chr&gt; &lt;chr&gt;      &lt;chr&gt;     
1 s1    PTPRO      &lt;NA&gt;      
2 s1    EPS8       FOx       
3 s1    EPS8       D         
4 s1    DPY19L2    DPY19L2P2 
5 s1    AC084357.2 S100A11P1 
6 s1    AC027667.1 S100A11P1 

huangapple
  • 本文由 发表于 2023年2月14日 02:04:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/75439645.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定