英文:
Parse text files with R
问题
我正在尝试解析一个类似以下的文本文件:
QUERY Query_3 Peptide 528 AT1G01110.2
DOMAINS
1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45
ENDDOMAINS
SITES
ENDSITES
MOTIFS
1 Query_3 globin helix H G93 101P 412094
1 Query_3 IQ motif V125 143L 412094
1 Query_3 globin helix A Q161 173V 412094
ENDMOTIFS
ENDQUERY
QUERY Query_4 Peptide 196 AT1G01160.1
DOMAINS
1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45
ENDDOMAINS
ENDQUERY
QUERY Query_5 Peptide 308 AT1G01180.1
DOMAINS
1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167
ENDDOMAINS
ENDQUERY
它基本上是由制表符分隔的行,由描述(例如QUERY,DOMAINS,ENDDOMAINS ...)分隔。我想为QUERY和DOMAINS创建两个数据框,如下所示:
# 数据框1 ("QUERY"行):
QUERY Query_3 Peptide 528 AT1G01110.2
QUERY Query_4 Peptide 196 AT1G01160.1
QUERY Query_5 Peptide 308 AT1G01180.1
# 数据框2 ("DOMAINS"之后的行):
1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45
1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45
1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167
在R中是否有办法做到这一点?谢谢!
(顺便说一句,这是来自rpsbproc的输出,这是一种用于解析RPS-BLAST输出的生物信息学工具,以防有人需要解析输出。)
注意:上述翻译中包含了标记 """,这是HTML中的引号标记。如果您在R中处理文本时需要移除这些标记,您可以使用相应的文本处理函数来替换它们。
英文:
I am trying to parse a text file with lines like this:
QUERY Query_3 Peptide 528 AT1G01110.2
DOMAINS
1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45
ENDDOMAINS
SITES
ENDSITES
MOTIFS
1 Query_3 globin helix H G93 101P 412094
1 Query_3 IQ motif V125 143L 412094
1 Query_3 globin helix A Q161 173V 412094
ENDMOTIFS
ENDQUERY
QUERY Query_4 Peptide 196 AT1G01160.1
DOMAINS
1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45
ENDDOMAINS
ENDQUERY
QUERY Query_5 Peptide 308 AT1G01180.1
DOMAINS
1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167
ENDDOMAINS
ENDQUERY
It is essentially tab delimited rows separated by descriptions (e.g. QUERY, DOMAINS, ENDDOMAINS ...). I want to make two data frames for QUERY and DOMAINS like:
#data frame 1 ("QUERY" rows):
QUERY Query_3 Peptide 528 AT1G01110.2
QUERY Query_4 Peptide 196 AT1G01160.1
QUERY Query_5 Peptide 308 AT1G01180.1
#data frame 2 (rows after "DOMAINS"):
1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45
1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45
1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167
Is there a way to do this in R? Thanks!
(BTW, this is an output from rpsbproc, a bioinformatics tool for parsing RPS-BLAST output, just in case someone also needs to parse the output.)
答案1
得分: 2
请看以下翻译:
txt <- readLines("text.txt")
grep("^QUERY", txt, value = TRUE) |>
paste(collapse = "\n") |>
read.table(text = _, header = FALSE)
# V1 V2 V3 V4 V5
# 1 QUERY Query_3 Peptide 528 AT1G01110.2
# 2 QUERY Query_4 Peptide 196 AT1G01160.1
# 3 QUERY Query_5 Peptide 308 AT1G01180.1
split(txt, cumsum(txt == "DOMAINS")) |>
lapply(function(z) if (z[1] == "DOMAINS" && !is.na(end <- which(z[-1] == "ENDDOMAINS"))) z[2:end]) |>
unlist() |>
paste(collapse = "\n") |>
read.table(text = _, header = FALSE)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
# 1 1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
# 2 1 Query_3 Non-specific 412094 93 173 6.07039e-04 42.1551 cd22307 Adgb_C_mid-like NC 45
# 3 1 Query_4 Specific 428268 22 73 8.80840e-19 76.1579 pfam05030 SSXT - 45
# 4 1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167
请注意,我没有翻译代码部分,只提供了代码中的注释和结果。
英文:
Try these:
txt <- readLines("text.txt")
grep("^QUERY", txt, value = TRUE) |>
paste(collapse = "\n") |>
read.table(text = _, header = FALSE)
# V1 V2 V3 V4 V5
# 1 QUERY Query_3 Peptide 528 AT1G01110.2
# 2 QUERY Query_4 Peptide 196 AT1G01160.1
# 3 QUERY Query_5 Peptide 308 AT1G01180.1
split(txt, cumsum(txt == "DOMAINS")) |>
lapply(function(z) if (z[1] == "DOMAINS" && !is.na(end <- which(z[-1] == "ENDDOMAINS"))) z[2:end]) |>
unlist() |>
paste(collapse = "\n") |>
read.table(text = _, header = FALSE)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
# 1 1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
# 2 1 Query_3 Non-specific 412094 93 173 6.07039e-04 42.1551 cd22307 Adgb_C_mid-like NC 45
# 3 1 Query_4 Specific 428268 22 73 8.80840e-19 76.1579 pfam05030 SSXT - 45
# 4 1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167
答案2
得分: 1
你可以尝试这个。
rl <- readlines('foo.dat')
lapply(c('Query.*[Ss]pecific','^QUERY'), \(x) rl[grep(x, rl)]) |> setNames(c('QUERY', 'Domains'))
# $QUERY
# [1] "1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45"
# [2] "1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45"
# [3] "1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45"
# [4] "1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167"
#
# $Domains
# [1] "QUERY Query_3 Peptide 528 AT1G01110.2" "QUERY Query_4 Peptide 196 AT1G01160.1"
# [3] "QUERY Query_5 Peptide 308 AT1G01180.1"
如果你真的想要只有一个列的数据框,可以这样做:
lapply(c('Query.*[Ss]pecific','^QUERY'), \(x) data.frame(v=rl[grep(x, rl)])) |> setNames(c('QUERY', 'Domains'))
英文:
You could try this.
rl <- readlines('foo.dat')
lapply(c('Query.*[Ss]pecific','^QUERY'), \(x) rl[grep(x, rl)]) |> setNames(c('QUERY', 'Domains'))
# $QUERY
# [1] "1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45"
# [2] "1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45"
# [3] "1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45"
# [4] "1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167"
#
# $Domains
# [1] "QUERY Query_3 Peptide 528 AT1G01110.2" "QUERY Query_4 Peptide 196 AT1G01160.1"
# [3] "QUERY Query_5 Peptide 308 AT1G01180.1"
If you really want data frames with just one column, do this:
lapply(c('Query.*[Ss]pecific','^QUERY'), \(x) data.frame(v=rl[grep(x, rl)])) |> setNames(c('QUERY', 'Domains'))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论