如何在R中拆分具有多个模式的列并选择特定内容到新列中

huangapple go评论90阅读模式
英文:

How to split a column with multiple paterns and select specific to new columns in R

问题

Sure, here's the translated code portion:

  1. 我有一个包含`df$feature`列的数据框,如下所示:
  2. 我想将这一列拆分成多个列,并子集化,如下所示(以第一行为例):
  3. 你知道如何在RPython中实现吗?

Please note that this is a translation of the code snippet and your request for not providing additional content has been followed. If you have any specific questions about how to perform this task in R or Python, feel free to ask.

英文:

i have a df with df$feature like this :

  1. head(df1)
  2. variant chr position source type
  3. 1: rs10738606 9 22088090 HAVANA gene
  4. 2: rs10738606 9 22088090 HAVANA transcript
  5. 3: rs10738606 9 22088090 HAVANA transcript
  6. 4: rs10738606 9 22088090 HAVANA transcript
  7. 5: rs10738606 9 22088090 HAVANA transcript
  8. 6: rs10738606 9 22088090 HAVANA transcript
  9. feature
  10. 1: gene_id ENSG00000240498.9; gene_type lncRNA; gene_name CDKN2B-AS1; level 1; hgnc_id HGNC:34341; tag ncRNA_host; tag overlapping_locus; havana_gene OTTHUMG00000019689.7;
  11. 2: gene_id ENSG00000240498.9; transcript_id ENST00000585267.5; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-217; level 2; transcript_support_level 1; hgnc_id HGNC:34341; tag basic; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445870.1;
  12. dput(head(df1$feature))
  13. c("gene_id ENSG00000240498.9; gene_type lncRNA; gene_name CDKN2B-AS1; level 1; hgnc_id HGNC:34341; tag ncRNA_host; tag overlapping_locus; havana_gene OTTHUMG00000019689.7;",
  14. "gene_id ENSG00000240498.9; transcript_id ENST00000585267.5; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-217; level 2; transcript_support_level 1; hgnc_id HGNC:34341; tag basic; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445870.1;",
  15. "gene_id ENSG00000240498.9; transcript_id ENST00000580576.6; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-208; level 2; transcript_support_level 1; hgnc_id HGNC:34341; tag basic; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445871.2;",
  16. "gene_id ENSG00000240498.9; transcript_id ENST00000428597.6; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-203; level 2; transcript_support_level 1; hgnc_id HGNC:34341; tag basic; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000334290.2;",
  17. "gene_id ENSG00000240498.9; transcript_id ENST00000577551.5; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-206; level 2; transcript_support_level 1; hgnc_id HGNC:34341; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445875.1;",
  18. "gene_id ENSG00000240498.9; transcript_id ENST00000581051.5; gene_type lncRNA; gene_name CDKN2B-AS1; transcript_type lncRNA; transcript_name CDKN2B-AS1-209; level 2; transcript_support_level 1; hgnc_id HGNC:34341; havana_gene OTTHUMG00000019689.7; havana_transcript OTTHUMT00000445877.1;"
  19. )

I would like to separate this column into multiple columns and subset like this(the first row for example):

  1. variant chr position source type gene_id gene_type gene_name
  2. rs10738606 9 22088090 HAVANA gene ENSG00000240498.9 lncRNA CDKN2B-AS1

Do you know how to do it in R or Python ?

答案1

得分: 1

在基础R中考虑执行以下操作:

  1. read.dcf(textConnection(gsub(" ", ":", gsub("; *", "\n", feature))),
  2. fields = c('gene_id', 'gene_type', 'gene_name'))
  1. gene_id gene_type gene_name
  2. 1 "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
  3. 2 "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
  4. 3 "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
  5. 4 "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
  6. 5 "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
  7. 6 "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"

注意,您可以使用all = TRUE而不是fields,它将为您提供字符串中的所有字段。

英文:

in Base R consider doing:

  1. read.dcf(textConnection(gsub(" ", ":", gsub("; *", "\n", feature))),
  2. fields = c('gene_id', 'gene_type', 'gene_name'))
  3. gene_id gene_type gene_name
  4. [1,] "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
  5. [2,] "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
  6. [3,] "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
  7. [4,] "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
  8. [5,] "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"
  9. [6,] "ENSG00000240498.9" "lncRNA" "CDKN2B-AS1"

Notice that instead of fields you could use all = TRUE and it will give you all the fields in your strings

huangapple
  • 本文由 发表于 2023年7月20日 21:46:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76730550.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定