英文:
Is there a way to collapse and convert a dataframe based upon multiple criteria?
问题
我有一个包含不同标记座位的基因型数据集(矩阵)。
我正在尝试将数据集从0/1表示法转换为以下形式:
Genotype | Locus_1_1 | Locus_1_2 | Locus_1_3 |
---|---|---|---|
1 | b | c | |
2 | a | b | |
3 | a | c | |
4 | a | b | c |
5 | a | a |
其中大多数基因型(例如1至3)是二倍体(2n),具有两个不同的等位基因,这些基因用列名的末尾表示为字符串。
个体4是三倍体(3n)个体,具有三个不同的等位基因。
个体5是二倍体(2n)个体,但是在一个等位基因(Locus_1a)上是纯合子,但应在数据集中表示为两次。
数据以标记座位名称呈现在列中,并且根据检测到的等位基因(a、b、c等)在列名末尾具有可变字符串。
我不确定如何执行代码以解决这个问题。
英文:
I have a dataset (matrix) of genotypes at different marker loci.
structure(list(Genotype = 1:5, Locus_1a = c(0L, 1L, 1L, 1L, 1L ), Locus_1b = c(1L, 1L, 0L, 1L, 0L), Locus_1c = c(1L, 0L, 1L, 1L, 0L), Locus_2a = c(0L, 1L, 1L, 0L, 0L), Locus_2b = c(1L, 1L, 0L, 1L, 1L), Locus_2c = c(1L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L))
I am trying to convert the dataset from 0/1 notation to the following:
Genotype | Locus_1_1 | Locus_1_2 | Locus_1_3 |
---|---|---|---|
1 | b | c | |
2 | a | b | |
3 | a | c | |
4 | a | b | c |
5 | a | a |
Where most genotypes (1 through 3, in this example) are diploid (2n) and have two distinct alleles, that is represented as a string at the end of the column name.
Individual 4 is a triploid (3n) individual and has three distinct alleles.
Individual 5 is a diploid (2n) individual, but is homozygous for a single allele (Locus_1a) but should have it presented twice in the dataset.
The data are presented with the marker locus name in columns, with a variable string at the end based upon which allele is detected for that individual (a, b, c, etc.).
I'm not exactly sure how to execute the code for this problem to solve this task.
答案1
得分: 1
df %>%
pivot_longer(-Genotype) %>%
filter(value > 0) %>%
extract(name, "value", "_\\d(.*)") %>%
distinct() %>%
mutate(name = row_number(), .by = Genotype) %>%
pivot_wider(names_prefix = 'Locus_')
生成的数据表:5 × 4
Genotype Locus_1 Locus_2 Locus_3
1 1 b c NA
2 2 a b NA
3 3 a c NA
4 4 a b c
5 5 a b NA
<details>
<summary>英文:</summary>
df %>%
pivot_longer(-Genotype) %>%
filter(value>0)%>%
extract(name, "value", "_\\d(.*)")%>%
distinct() %>%
mutate(name = row_number(), .by = Genotype) %>%
pivot_wider(names_prefix = 'Locus_')
# A tibble: 5 × 4
Genotype Locus_1 Locus_2 Locus_3
<int> <chr> <chr> <chr>
1 1 b c NA
2 2 a b NA
3 3 a c NA
4 4 a b c
5 5 a b NA
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论