有没有一种方法可以根据多个条件来折叠和转换数据框?

huangapple go评论90阅读模式
英文:

Is there a way to collapse and convert a dataframe based upon multiple criteria?

问题

我有一个包含不同标记座位的基因型数据集(矩阵)。

我正在尝试将数据集从0/1表示法转换为以下形式:

Genotype Locus_1_1 Locus_1_2 Locus_1_3
1 b c
2 a b
3 a c
4 a b c
5 a a

其中大多数基因型(例如1至3)是二倍体(2n),具有两个不同的等位基因,这些基因用列名的末尾表示为字符串。

个体4是三倍体(3n)个体,具有三个不同的等位基因。

个体5是二倍体(2n)个体,但是在一个等位基因(Locus_1a)上是纯合子,但应在数据集中表示为两次。

数据以标记座位名称呈现在列中,并且根据检测到的等位基因(a、b、c等)在列名末尾具有可变字符串。

我不确定如何执行代码以解决这个问题。

英文:

I have a dataset (matrix) of genotypes at different marker loci.

structure(list(Genotype = 1:5, Locus_1a = c(0L, 1L, 1L, 1L, 1L ), Locus_1b = c(1L, 1L, 0L, 1L, 0L), Locus_1c = c(1L, 0L, 1L, 1L, 0L), Locus_2a = c(0L, 1L, 1L, 0L, 0L), Locus_2b = c(1L, 1L, 0L, 1L, 1L), Locus_2c = c(1L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L))

I am trying to convert the dataset from 0/1 notation to the following:

Genotype Locus_1_1 Locus_1_2 Locus_1_3
1 b c
2 a b
3 a c
4 a b c
5 a a

Where most genotypes (1 through 3, in this example) are diploid (2n) and have two distinct alleles, that is represented as a string at the end of the column name.

Individual 4 is a triploid (3n) individual and has three distinct alleles.

Individual 5 is a diploid (2n) individual, but is homozygous for a single allele (Locus_1a) but should have it presented twice in the dataset.

The data are presented with the marker locus name in columns, with a variable string at the end based upon which allele is detected for that individual (a, b, c, etc.).

I'm not exactly sure how to execute the code for this problem to solve this task.

答案1

得分: 1

df %>%
   pivot_longer(-Genotype) %>%
   filter(value > 0) %>%
   extract(name, "value", "_\\d(.*)") %>%
   distinct() %>%
   mutate(name = row_number(), .by = Genotype) %>%
   pivot_wider(names_prefix = 'Locus_')

生成的数据表:5 × 4

Genotype Locus_1 Locus_2 Locus_3

1 1 b c NA
2 2 a b NA
3 3 a c NA
4 4 a b c
5 5 a b NA


<details>
<summary>英文:</summary>

    df %&gt;%
       pivot_longer(-Genotype) %&gt;%
       filter(value&gt;0)%&gt;%
       extract(name, &quot;value&quot;, &quot;_\\d(.*)&quot;)%&gt;%
       distinct() %&gt;%
       mutate(name = row_number(), .by = Genotype) %&gt;%
       pivot_wider(names_prefix = &#39;Locus_&#39;)

    # A tibble: 5 &#215; 4
      Genotype Locus_1 Locus_2 Locus_3
         &lt;int&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;  
    1        1 b       c       NA     
    2        2 a       b       NA     
    3        3 a       c       NA     
    4        4 a       b       c      
    5        5 a       b       NA     

</details>



huangapple
  • 本文由 发表于 2023年6月6日 04:17:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76409729.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定