使用R解析文本文件

huangapple go评论93阅读模式
英文:

Parse text files with R

问题

我正在尝试解析一个类似以下的文本文件:

  1. QUERY Query_3 Peptide 528 AT1G01110.2
  2. DOMAINS
  3. 1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
  4. 1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45
  5. ENDDOMAINS
  6. SITES
  7. ENDSITES
  8. MOTIFS
  9. 1 Query_3 globin helix H G93 101P 412094
  10. 1 Query_3 IQ motif V125 143L 412094
  11. 1 Query_3 globin helix A Q161 173V 412094
  12. ENDMOTIFS
  13. ENDQUERY
  14. QUERY Query_4 Peptide 196 AT1G01160.1
  15. DOMAINS
  16. 1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45
  17. ENDDOMAINS
  18. ENDQUERY
  19. QUERY Query_5 Peptide 308 AT1G01180.1
  20. DOMAINS
  21. 1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167
  22. ENDDOMAINS
  23. ENDQUERY

它基本上是由制表符分隔的行,由描述(例如QUERY,DOMAINS,ENDDOMAINS ...)分隔。我想为QUERYDOMAINS创建两个数据框,如下所示:

  1. # 数据框1 ("QUERY"行):
  2. QUERY Query_3 Peptide 528 AT1G01110.2
  3. QUERY Query_4 Peptide 196 AT1G01160.1
  4. QUERY Query_5 Peptide 308 AT1G01180.1
  5. # 数据框2 ("DOMAINS"之后的行):
  6. 1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
  7. 1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45
  8. 1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45
  9. 1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167

在R中是否有办法做到这一点?谢谢!
(顺便说一句,这是来自rpsbproc的输出,这是一种用于解析RPS-BLAST输出的生物信息学工具,以防有人需要解析输出。)

注意:上述翻译中包含了标记 """,这是HTML中的引号标记。如果您在R中处理文本时需要移除这些标记,您可以使用相应的文本处理函数来替换它们。

英文:

I am trying to parse a text file with lines like this:

  1. QUERY Query_3 Peptide 528 AT1G01110.2
  2. DOMAINS
  3. 1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
  4. 1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45
  5. ENDDOMAINS
  6. SITES
  7. ENDSITES
  8. MOTIFS
  9. 1 Query_3 globin helix H G93 101P 412094
  10. 1 Query_3 IQ motif V125 143L 412094
  11. 1 Query_3 globin helix A Q161 173V 412094
  12. ENDMOTIFS
  13. ENDQUERY
  14. QUERY Query_4 Peptide 196 AT1G01160.1
  15. DOMAINS
  16. 1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45
  17. ENDDOMAINS
  18. ENDQUERY
  19. QUERY Query_5 Peptide 308 AT1G01180.1
  20. DOMAINS
  21. 1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167
  22. ENDDOMAINS
  23. ENDQUERY

It is essentially tab delimited rows separated by descriptions (e.g. QUERY, DOMAINS, ENDDOMAINS ...). I want to make two data frames for QUERY and DOMAINS like:

  1. #data frame 1 ("QUERY" rows):
  2. QUERY Query_3 Peptide 528 AT1G01110.2
  3. QUERY Query_4 Peptide 196 AT1G01160.1
  4. QUERY Query_5 Peptide 308 AT1G01180.1
  5. #data frame 2 (rows after "DOMAINS"):
  6. 1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
  7. 1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45
  8. 1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45
  9. 1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167

Is there a way to do this in R? Thanks!
(BTW, this is an output from rpsbproc, a bioinformatics tool for parsing RPS-BLAST output, just in case someone also needs to parse the output.)

答案1

得分: 2

请看以下翻译:

  1. txt <- readLines("text.txt")
  2. grep("^QUERY", txt, value = TRUE) |>
  3. paste(collapse = "\n") |>
  4. read.table(text = _, header = FALSE)
  5. # V1 V2 V3 V4 V5
  6. # 1 QUERY Query_3 Peptide 528 AT1G01110.2
  7. # 2 QUERY Query_4 Peptide 196 AT1G01160.1
  8. # 3 QUERY Query_5 Peptide 308 AT1G01180.1
  9. split(txt, cumsum(txt == "DOMAINS")) |>
  10. lapply(function(z) if (z[1] == "DOMAINS" && !is.na(end <- which(z[-1] == "ENDDOMAINS"))) z[2:end]) |>
  11. unlist() |>
  12. paste(collapse = "\n") |>
  13. read.table(text = _, header = FALSE)
  14. # V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
  15. # 1 1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
  16. # 2 1 Query_3 Non-specific 412094 93 173 6.07039e-04 42.1551 cd22307 Adgb_C_mid-like NC 45
  17. # 3 1 Query_4 Specific 428268 22 73 8.80840e-19 76.1579 pfam05030 SSXT - 45
  18. # 4 1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167

请注意,我没有翻译代码部分,只提供了代码中的注释和结果。

英文:

Try these:

  1. txt <- readLines("text.txt")
  2. grep("^QUERY", txt, value = TRUE) |>
  3. paste(collapse = "\n") |>
  4. read.table(text = _, header = FALSE)
  5. # V1 V2 V3 V4 V5
  6. # 1 QUERY Query_3 Peptide 528 AT1G01110.2
  7. # 2 QUERY Query_4 Peptide 196 AT1G01160.1
  8. # 3 QUERY Query_5 Peptide 308 AT1G01180.1
  9. split(txt, cumsum(txt == "DOMAINS")) |>
  10. lapply(function(z) if (z[1] == "DOMAINS" && !is.na(end <- which(z[-1] == "ENDDOMAINS"))) z[2:end]) |>
  11. unlist() |>
  12. paste(collapse = "\n") |>
  13. read.table(text = _, header = FALSE)
  14. # V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
  15. # 1 1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
  16. # 2 1 Query_3 Non-specific 412094 93 173 6.07039e-04 42.1551 cd22307 Adgb_C_mid-like NC 45
  17. # 3 1 Query_4 Specific 428268 22 73 8.80840e-19 76.1579 pfam05030 SSXT - 45
  18. # 4 1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167

答案2

得分: 1

你可以尝试这个。

  1. rl <- readlines('foo.dat')
  2. lapply(c('Query.*[Ss]pecific','^QUERY'), \(x) rl[grep(x, rl)]) |&gt; setNames(c('QUERY', 'Domains'))
  3. # $QUERY
  4. # [1] "1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45"
  5. # [2] "1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45"
  6. # [3] "1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45"
  7. # [4] "1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167"
  8. #
  9. # $Domains
  10. # [1] "QUERY Query_3 Peptide 528 AT1G01110.2" "QUERY Query_4 Peptide 196 AT1G01160.1"
  11. # [3] "QUERY Query_5 Peptide 308 AT1G01180.1"

如果你真的想要只有一个列的数据框,可以这样做:

  1. lapply(c('Query.*[Ss]pecific','^QUERY'), \(x) data.frame(v=rl[grep(x, rl)])) |&gt; setNames(c('QUERY', 'Domains'))
英文:

You could try this.

  1. rl &lt;- readlines(&#39;foo.dat&#39;)
  2. lapply(c(&#39;Query.*[Ss]pecific&#39;,&#39;^QUERY&#39;), \(x) rl[grep(x, rl)]) |&gt; setNames(c(&#39;QUERY&#39;, &#39;Domains&#39;))
  3. # $QUERY
  4. # [1] &quot;1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45&quot;
  5. # [2] &quot;1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45&quot;
  6. # [3] &quot;1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45&quot;
  7. # [4] &quot;1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167&quot;
  8. #
  9. # $Domains
  10. # [1] &quot;QUERY Query_3 Peptide 528 AT1G01110.2&quot; &quot;QUERY Query_4 Peptide 196 AT1G01160.1&quot;
  11. # [3] &quot;QUERY Query_5 Peptide 308 AT1G01180.1&quot;

If you really want data frames with just one column, do this:

  1. lapply(c(&#39;Query.*[Ss]pecific&#39;,&#39;^QUERY&#39;), \(x) data.frame(v=rl[grep(x, rl)])) |&gt; setNames(c(&#39;QUERY&#39;, &#39;Domains&#39;))

huangapple
  • 本文由 发表于 2023年2月18日 11:31:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75491003.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定