如何在R中拆分具有多个模式的列并选择特定内容到新列中

huangapple go评论63阅读模式
英文:

How to split a column with multiple paterns and select specific to new columns in R

问题

Sure, here's the translated code portion:

我有一个包含`df$feature`列的数据框,如下所示:

我想将这一列拆分成多个列,并子集化,如下所示(以第一行为例):

你知道如何在R或Python中实现吗?

Please note that this is a translation of the code snippet and your request for not providing additional content has been followed. If you have any specific questions about how to perform this task in R or Python, feel free to ask.

英文:

i have a df with df$feature like this :

head(df1)
      variant chr position source       type
1: rs10738606   9 22088090 HAVANA       gene
2: rs10738606   9 22088090 HAVANA transcript
3: rs10738606   9 22088090 HAVANA transcript
4: rs10738606   9 22088090 HAVANA transcript
5: rs10738606   9 22088090 HAVANA transcript
6: rs10738606   9 22088090 HAVANA transcript
                                                                                                                                                                                                                                                                                                 

feature
1: gene_id ENSG00000240498.9; gene_type lncRNA; gene_name CDKN2B-AS1; level 1; hgnc_id HGNC:34341; tag ncRNA_host; tag overlapping_locus; havana_gene OTTHUMG00000019689.7;
2: gene_id ENSG00000240498.9; transcript_id ENST00000585267.5; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-217; level 2; transcript_support_level 1; hgnc_id HGNC:34341; tag basic; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445870.1;

dput(head(df1$feature))
c("gene_id ENSG00000240498.9; gene_type lncRNA; gene_name CDKN2B-AS1; level 1; hgnc_id HGNC:34341; tag ncRNA_host; tag overlapping_locus; havana_gene OTTHUMG00000019689.7;", 
"gene_id ENSG00000240498.9; transcript_id ENST00000585267.5; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-217; level 2; transcript_support_level 1; hgnc_id HGNC:34341; tag basic; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445870.1;", 
"gene_id ENSG00000240498.9; transcript_id ENST00000580576.6; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-208; level 2; transcript_support_level 1; hgnc_id HGNC:34341; tag basic; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445871.2;", 
"gene_id ENSG00000240498.9; transcript_id ENST00000428597.6; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-203; level 2; transcript_support_level 1; hgnc_id HGNC:34341; tag basic; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000334290.2;", 
"gene_id ENSG00000240498.9; transcript_id ENST00000577551.5; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-206; level 2; transcript_support_level 1; hgnc_id HGNC:34341; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445875.1;", 
"gene_id ENSG00000240498.9; transcript_id ENST00000581051.5; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-209; level 2; transcript_support_level 1; hgnc_id HGNC:34341; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445877.1;"
)

I would like to separate this column into multiple columns and subset like this(the first row for example):

variant    chr position source  type gene_id         gene_type gene_name
rs10738606  9  22088090 HAVANA  gene ENSG00000240498.9	lncRNA CDKN2B-AS1

Do you know how to do it in R or Python ?

答案1

得分: 1

在基础R中考虑执行以下操作:

read.dcf(textConnection(gsub(" ", ":", gsub("; *", "\n", feature))),
       fields = c('gene_id', 'gene_type', 'gene_name'))
   gene_id             gene_type gene_name   
1 "ENSG00000240498.9" "lncRNA"  "CDKN2B-AS1"
2 "ENSG00000240498.9" "lncRNA"  "CDKN2B-AS1"
3 "ENSG00000240498.9" "lncRNA"  "CDKN2B-AS1"
4 "ENSG00000240498.9" "lncRNA"  "CDKN2B-AS1"
5 "ENSG00000240498.9" "lncRNA"  "CDKN2B-AS1"
6 "ENSG00000240498.9" "lncRNA"  "CDKN2B-AS1"

注意,您可以使用all = TRUE而不是fields,它将为您提供字符串中的所有字段。

英文:

in Base R consider doing:

read.dcf(textConnection(gsub(" ", ":", gsub("; *", "\n", feature))),
       fields = c('gene_id', 'gene_type', 'gene_name'))

     gene_id             gene_type gene_name   
[1,] "ENSG00000240498.9" "lncRNA"  "CDKN2B-AS1"
[2,] "ENSG00000240498.9" "lncRNA"  "CDKN2B-AS1"
[3,] "ENSG00000240498.9" "lncRNA"  "CDKN2B-AS1"
[4,] "ENSG00000240498.9" "lncRNA"  "CDKN2B-AS1"
[5,] "ENSG00000240498.9" "lncRNA"  "CDKN2B-AS1"
[6,] "ENSG00000240498.9" "lncRNA"  "CDKN2B-AS1"

Notice that instead of fields you could use all = TRUE and it will give you all the fields in your strings

huangapple
  • 本文由 发表于 2023年7月20日 21:46:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76730550.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定