英文:
How to split a column with multiple paterns and select specific to new columns in R
问题
Sure, here's the translated code portion:
我有一个包含`df$feature`列的数据框,如下所示:
我想将这一列拆分成多个列,并子集化,如下所示(以第一行为例):
你知道如何在R或Python中实现吗?
Please note that this is a translation of the code snippet and your request for not providing additional content has been followed. If you have any specific questions about how to perform this task in R or Python, feel free to ask.
英文:
i have a df with df$feature
like this :
head(df1)
variant chr position source type
1: rs10738606 9 22088090 HAVANA gene
2: rs10738606 9 22088090 HAVANA transcript
3: rs10738606 9 22088090 HAVANA transcript
4: rs10738606 9 22088090 HAVANA transcript
5: rs10738606 9 22088090 HAVANA transcript
6: rs10738606 9 22088090 HAVANA transcript
feature
1: gene_id ENSG00000240498.9; gene_type lncRNA; gene_name CDKN2B-AS1; level 1; hgnc_id HGNC:34341; tag ncRNA_host; tag overlapping_locus; havana_gene OTTHUMG00000019689.7;
2: gene_id ENSG00000240498.9; transcript_id ENST00000585267.5; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-217; level 2; transcript_support_level 1; hgnc_id HGNC:34341; tag basic; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445870.1;
dput(head(df1$feature))
c("gene_id ENSG00000240498.9; gene_type lncRNA; gene_name CDKN2B-AS1; level 1; hgnc_id HGNC:34341; tag ncRNA_host; tag overlapping_locus; havana_gene OTTHUMG00000019689.7;",
"gene_id ENSG00000240498.9; transcript_id ENST00000585267.5; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-217; level 2; transcript_support_level 1; hgnc_id HGNC:34341; tag basic; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445870.1;",
"gene_id ENSG00000240498.9; transcript_id ENST00000580576.6; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-208; level 2; transcript_support_level 1; hgnc_id HGNC:34341; tag basic; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445871.2;",
"gene_id ENSG00000240498.9; transcript_id ENST00000428597.6; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-203; level 2; transcript_support_level 1; hgnc_id HGNC:34341; tag basic; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000334290.2;",
"gene_id ENSG00000240498.9; transcript_id ENST00000577551.5; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-206; level 2; transcript_support_level 1; hgnc_id HGNC:34341; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445875.1;",
"gene_id ENSG00000240498.9; transcript_id ENST00000581051.5; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-209; level 2; transcript_support_level 1; hgnc_id HGNC:34341; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445877.1;"
)
I would like to separate this column into multiple columns and subset like this(the first row for example):
variant chr position source type gene_id gene_type gene_name
rs10738606 9 22088090 HAVANA gene ENSG00000240498.9 lncRNA CDKN2B-AS1
Do you know how to do it in R or Python ?
答案1
得分: 1
在基础R中考虑执行以下操作:
read.dcf(textConnection(gsub(" ", ":", gsub("; *", "\n", feature))),
fields = c('gene_id', 'gene_type', 'gene_name'))
gene_id gene_type gene_name
1 "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
2 "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
3 "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
4 "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
5 "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
6 "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
注意,您可以使用all = TRUE
而不是fields
,它将为您提供字符串中的所有字段。
英文:
in Base R consider doing:
read.dcf(textConnection(gsub(" ", ":", gsub("; *", "\n", feature))),
fields = c('gene_id', 'gene_type', 'gene_name'))
gene_id gene_type gene_name
[1,] "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
[2,] "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
[3,] "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
[4,] "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
[5,] "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
[6,] "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
Notice that instead of fields
you could use all = TRUE
and it will give you all the fields in your strings
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论