使用R解析文本文件

huangapple go评论68阅读模式
英文:

Parse text files with R

问题

我正在尝试解析一个类似以下的文本文件:

QUERY   Query_3 Peptide 528 AT1G01110.2
DOMAINS
1   Query_3 Specific    404128  374 470 8.74687e-20 84.2155 pfam13178   DUF4005 C   45
1   Query_3 Non-specific    412094  93  173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC  45
ENDDOMAINS
SITES
ENDSITES
MOTIFS
1   Query_3 globin helix H  G93 101P    412094
1   Query_3 IQ motif    V125    143L    412094
1   Query_3 globin helix A  Q161    173V    412094
ENDMOTIFS
ENDQUERY
QUERY   Query_4 Peptide 196 AT1G01160.1
DOMAINS
1   Query_4 Specific    428268  22  73  8.8084e-19  76.1579 pfam05030   SSXT    -   45
ENDDOMAINS
ENDQUERY
QUERY   Query_5 Peptide 308 AT1G01180.1
DOMAINS
1   Query_5 Specific    433324  139 268 3.13921e-13 64.6367 pfam13578   Methyltransf_24 -   450167
ENDDOMAINS
ENDQUERY

它基本上是由制表符分隔的行,由描述(例如QUERY,DOMAINS,ENDDOMAINS ...)分隔。我想为QUERYDOMAINS创建两个数据框,如下所示:

# 数据框1 ("QUERY"行):
QUERY   Query_3 Peptide 528 AT1G01110.2
QUERY   Query_4 Peptide 196 AT1G01160.1
QUERY   Query_5 Peptide 308 AT1G01180.1

# 数据框2 ("DOMAINS"之后的行):
1   Query_3 Specific    404128  374 470 8.74687e-20 84.2155 pfam13178   DUF4005 C   45
1   Query_3 Non-specific    412094  93  173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC  45
1   Query_4 Specific    428268  22  73  8.8084e-19  76.1579 pfam05030   SSXT    -   45
1   Query_5 Specific    433324  139 268 3.13921e-13 64.6367 pfam13578   Methyltransf_24 -   450167

在R中是否有办法做到这一点?谢谢!
(顺便说一句,这是来自rpsbproc的输出,这是一种用于解析RPS-BLAST输出的生物信息学工具,以防有人需要解析输出。)

注意:上述翻译中包含了标记 """,这是HTML中的引号标记。如果您在R中处理文本时需要移除这些标记,您可以使用相应的文本处理函数来替换它们。

英文:

I am trying to parse a text file with lines like this:

QUERY   Query_3 Peptide 528 AT1G01110.2
DOMAINS
1   Query_3 Specific    404128  374 470 8.74687e-20 84.2155 pfam13178   DUF4005 C   45
1   Query_3 Non-specific    412094  93  173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC  45
ENDDOMAINS
SITES
ENDSITES
MOTIFS
1   Query_3 globin helix H  G93 101P    412094
1   Query_3 IQ motif    V125    143L    412094
1   Query_3 globin helix A  Q161    173V    412094
ENDMOTIFS
ENDQUERY
QUERY   Query_4 Peptide 196 AT1G01160.1
DOMAINS
1   Query_4 Specific    428268  22  73  8.8084e-19  76.1579 pfam05030   SSXT    -   45
ENDDOMAINS
ENDQUERY
QUERY   Query_5 Peptide 308 AT1G01180.1
DOMAINS
1   Query_5 Specific    433324  139 268 3.13921e-13 64.6367 pfam13578   Methyltransf_24 -   450167
ENDDOMAINS
ENDQUERY

It is essentially tab delimited rows separated by descriptions (e.g. QUERY, DOMAINS, ENDDOMAINS ...). I want to make two data frames for QUERY and DOMAINS like:

#data frame 1 ("QUERY" rows):
QUERY   Query_3 Peptide 528 AT1G01110.2
QUERY   Query_4 Peptide 196 AT1G01160.1
QUERY   Query_5 Peptide 308 AT1G01180.1

#data frame 2 (rows after "DOMAINS"):
1   Query_3 Specific    404128  374 470 8.74687e-20 84.2155 pfam13178   DUF4005 C   45
1   Query_3 Non-specific    412094  93  173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC  45
1   Query_4 Specific    428268  22  73  8.8084e-19  76.1579 pfam05030   SSXT    -   45
1   Query_5 Specific    433324  139 268 3.13921e-13 64.6367 pfam13578   Methyltransf_24 -   450167

Is there a way to do this in R? Thanks!
(BTW, this is an output from rpsbproc, a bioinformatics tool for parsing RPS-BLAST output, just in case someone also needs to parse the output.)

答案1

得分: 2

请看以下翻译:

txt <- readLines("text.txt")

grep("^QUERY", txt, value = TRUE) |>
  paste(collapse = "\n") |>
  read.table(text = _, header = FALSE)
#      V1      V2      V3  V4          V5
# 1 QUERY Query_3 Peptide 528 AT1G01110.2
# 2 QUERY Query_4 Peptide 196 AT1G01160.1
# 3 QUERY Query_5 Peptide 308 AT1G01180.1

split(txt, cumsum(txt == "DOMAINS")) |>
  lapply(function(z) if (z[1] == "DOMAINS" && !is.na(end <- which(z[-1] == "ENDDOMAINS"))) z[2:end]) |>
  unlist() |>
  paste(collapse = "\n") |>
  read.table(text = _, header = FALSE)
#   V1      V2           V3     V4  V5  V6          V7      V8        V9             V10 V11    V12
# 1  1 Query_3     Specific 404128 374 470 8.74687e-20 84.2155 pfam13178         DUF4005   C     45
# 2  1 Query_3 Non-specific 412094  93 173 6.07039e-04 42.1551   cd22307 Adgb_C_mid-like  NC     45
# 3  1 Query_4     Specific 428268  22  73 8.80840e-19 76.1579 pfam05030            SSXT   -     45
# 4  1 Query_5     Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24   - 450167

请注意,我没有翻译代码部分,只提供了代码中的注释和结果。

英文:

Try these:

txt <- readLines("text.txt")

grep("^QUERY", txt, value = TRUE) |>
  paste(collapse = "\n") |>
  read.table(text = _, header = FALSE)
#      V1      V2      V3  V4          V5
# 1 QUERY Query_3 Peptide 528 AT1G01110.2
# 2 QUERY Query_4 Peptide 196 AT1G01160.1
# 3 QUERY Query_5 Peptide 308 AT1G01180.1

split(txt, cumsum(txt == "DOMAINS")) |>
  lapply(function(z) if (z[1] == "DOMAINS" && !is.na(end <- which(z[-1] == "ENDDOMAINS"))) z[2:end]) |>
  unlist() |>
  paste(collapse = "\n") |>
  read.table(text = _, header = FALSE)
#   V1      V2           V3     V4  V5  V6          V7      V8        V9             V10 V11    V12
# 1  1 Query_3     Specific 404128 374 470 8.74687e-20 84.2155 pfam13178         DUF4005   C     45
# 2  1 Query_3 Non-specific 412094  93 173 6.07039e-04 42.1551   cd22307 Adgb_C_mid-like  NC     45
# 3  1 Query_4     Specific 428268  22  73 8.80840e-19 76.1579 pfam05030            SSXT   -     45
# 4  1 Query_5     Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24   - 450167

答案2

得分: 1

你可以尝试这个。

rl <- readlines('foo.dat')

lapply(c('Query.*[Ss]pecific','^QUERY'), \(x) rl[grep(x, rl)]) |&gt; setNames(c('QUERY', 'Domains'))
# $QUERY
# [1] "1   Query_3 Specific    404128  374 470 8.74687e-20 84.2155 pfam13178   DUF4005 C   45"            
# [2] "1   Query_3 Non-specific    412094  93  173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC  45"    
# [3] "1   Query_4 Specific    428268  22  73  8.8084e-19  76.1579 pfam05030   SSXT    -   45"            
# [4] "1   Query_5 Specific    433324  139 268 3.13921e-13 64.6367 pfam13578   Methyltransf_24 -   450167"
# 
# $Domains
# [1] "QUERY   Query_3 Peptide 528 AT1G01110.2" "QUERY   Query_4 Peptide 196 AT1G01160.1"
# [3] "QUERY   Query_5 Peptide 308 AT1G01180.1"

如果你真的想要只有一个列的数据框,可以这样做:

lapply(c('Query.*[Ss]pecific','^QUERY'), \(x) data.frame(v=rl[grep(x, rl)])) |&gt; setNames(c('QUERY', 'Domains'))
英文:

You could try this.

rl &lt;- readlines(&#39;foo.dat&#39;)

lapply(c(&#39;Query.*[Ss]pecific&#39;,&#39;^QUERY&#39;), \(x) rl[grep(x, rl)]) |&gt; setNames(c(&#39;QUERY&#39;, &#39;Domains&#39;))
# $QUERY
# [1] &quot;1   Query_3 Specific    404128  374 470 8.74687e-20 84.2155 pfam13178   DUF4005 C   45&quot;            
# [2] &quot;1   Query_3 Non-specific    412094  93  173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC  45&quot;    
# [3] &quot;1   Query_4 Specific    428268  22  73  8.8084e-19  76.1579 pfam05030   SSXT    -   45&quot;            
# [4] &quot;1   Query_5 Specific    433324  139 268 3.13921e-13 64.6367 pfam13578   Methyltransf_24 -   450167&quot;
# 
# $Domains
# [1] &quot;QUERY   Query_3 Peptide 528 AT1G01110.2&quot; &quot;QUERY   Query_4 Peptide 196 AT1G01160.1&quot;
# [3] &quot;QUERY   Query_5 Peptide 308 AT1G01180.1&quot;

If you really want data frames with just one column, do this:

lapply(c(&#39;Query.*[Ss]pecific&#39;,&#39;^QUERY&#39;), \(x) data.frame(v=rl[grep(x, rl)])) |&gt; setNames(c(&#39;QUERY&#39;, &#39;Domains&#39;))

huangapple
  • 本文由 发表于 2023年2月18日 11:31:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75491003.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定