2023年6月6日 04:17:07go评论117阅读模式

英文:

Is there a way to collapse and convert a dataframe based upon multiple criteria?

问题

我有一个包含不同标记座位的基因型数据集（矩阵）。

我正在尝试将数据集从0/1表示法转换为以下形式：

Genotype	Locus_1_1	Locus_1_2	Locus_1_3
1	b	c
2	a	b
3	a	c
4	a	b	c
5	a	a

其中大多数基因型（例如1至3）是二倍体（2n），具有两个不同的等位基因，这些基因用列名的末尾表示为字符串。

个体4是三倍体（3n）个体，具有三个不同的等位基因。

个体5是二倍体（2n）个体，但是在一个等位基因（Locus_1a）上是纯合子，但应在数据集中表示为两次。

数据以标记座位名称呈现在列中，并且根据检测到的等位基因（a、b、c等）在列名末尾具有可变字符串。

我不确定如何执行代码以解决这个问题。

英文:

I have a dataset (matrix) of genotypes at different marker loci.

structure(list(Genotype = 1:5, Locus_1a = c(0L, 1L, 1L, 1L, 1L ), Locus_1b = c(1L, 1L, 0L, 1L, 0L), Locus_1c = c(1L, 0L, 1L, 1L, 0L), Locus_2a = c(0L, 1L, 1L, 0L, 0L), Locus_2b = c(1L, 1L, 0L, 1L, 1L), Locus_2c = c(1L, 0L, 1L, 1L, 0L)), class = &quot;data.frame&quot;, row.names = c(NA, -5L))

I am trying to convert the dataset from 0/1 notation to the following:

Genotype	Locus_1_1	Locus_1_2	Locus_1_3
1	b	c
2	a	b
3	a	c
4	a	b	c
5	a	a

Where most genotypes (1 through 3, in this example) are diploid (2n) and have two distinct alleles, that is represented as a string at the end of the column name.

Individual 4 is a triploid (3n) individual and has three distinct alleles.

Individual 5 is a diploid (2n) individual, but is homozygous for a single allele (Locus_1a) but should have it presented twice in the dataset.

The data are presented with the marker locus name in columns, with a variable string at the end based upon which allele is detected for that individual (a, b, c, etc.).

I'm not exactly sure how to execute the code for this problem to solve this task.

答案1

得分: 1

df %>%
   pivot_longer(-Genotype) %>%
   filter(value > 0) %>%
   extract(name, "value", "_\\d(.*)") %>%
   distinct() %>%
   mutate(name = row_number(), .by = Genotype) %>%
   pivot_wider(names_prefix = 'Locus_')

生成的数据表：5 × 4

Genotype Locus_1 Locus_2 Locus_3

1 1 b c NA
2 2 a b NA
3 3 a c NA
4 4 a b c
5 5 a b NA


<details>
<summary>英文:</summary>
    df %&gt;%
       pivot_longer(-Genotype) %&gt;%
       filter(value&gt;0)%&gt;%
       extract(name, &quot;value&quot;, &quot;_\\d(.*)&quot;)%&gt;%
       distinct() %&gt;%
       mutate(name = row_number(), .by = Genotype) %&gt;%
       pivot_wider(names_prefix = &#39;Locus_&#39;)
    # A tibble: 5 &#215; 4
      Genotype Locus_1 Locus_2 Locus_3
         &lt;int&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;  
    1        1 b       c       NA     
    2        2 a       b       NA     
    3        3 a       c       NA     
    4        4 a       b       c      
    5        5 a       b       NA     
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有一种方法可以根据多个条件来折叠和转换数据框？

问题

答案1

生成的数据表：5 × 4

选择R数据框中的列基于另一个数据框中列的数值。

“使用R中的highcharter包分组下钻系列名称”

为什么我的响应式表达式没有筛选数据？

根据R中某一列中特定数量的唯一值，筛选数据框。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。