可以使用Entrez Direct查询多个核苷酸访问版本标识符而不使用epost吗?

huangapple go评论51阅读模式
英文:

Can I use Entrez Direct to query multiple nucleotide accession version identifiers against a database without using epost?

问题

以下是代码的翻译:

我从NCBI的blast下载了一个命中表格(使用核苷酸收藏数据库和megablast程序进行核苷酸blast),并使用awk按访问版本标识对其进行排序。

```bash
awk -F "\t" 'NF>1{print}' unsorted_input.txt | sort -k2 > sorted_output.txt

然后,我使用Entrez Direct使用访问版本标识来提取每个比对的受试组织:

awk -F "\t" 'NF>1{print $2}' unsorted_input.txt | epost -db nucleotide | efetch -format docsum | xtract -pattern DocumentSummary -element Organism | sort | paste sorted_output.txt - > final_output.txt

这个命令能够提取一些比对的受试组织数据,但不是所有的。我注意到对于epost无法工作的比对,单独使用esearch查询是有效的:

esearch -db nucleotide -query "accession_version_identifier" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism

因此,我尝试使用循环来采用这种方法,使用每行的访问版本标识符(第二列)来提取受试组织名称,如下所示:

while IFS=$'\t' read -r -a myArray
do
 esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism > "output.txt"
done < input.txt

然而,这只返回了第一行的受试组织。如何将这应用于每一行,将所有受试组织存储在同一个文件中?

输入文件的前几行如下所示。它是以制表符分隔的:

ce1e013e-c4c5-47f9-b041-521ee293c4f0	AB002282.1	91.217	649	24	22	41	676	8	636	0.0	854
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	84.615	117	9	6	118	228	17668	17781	5.16e-19	108
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	84.615	117	9	6	118	228	20740	20853	5.16e-19	108
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	84.615	117	9	6	118	228	23812	23925	5.16e-19	108
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	84.615	117	9	6	118	228	26884	26997	5.16e-19	108
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	84.615	117	9	6	118	228	29956	30069	5.16e-19	108
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	85.345	116	9	6	118	228	33027	33139	1.11e-20	113
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	87.000	100	7	5	132	228	14613	14709	5.16e-19	108
8e8ac3f3-63f6-4519-ad25-287a25169f87	AB850654.1	88.262	4660	175	260	16	4401	103840	108401	0.0	5232
c4233926-9f23-46c4-bc4d-5702f47885bd	AB850654.1	89.958	4272	119	235	1	4042	104203	108394	0.0	5227
876d8f20-9d36-4207-8754-0924d99a6c46	AC019188.6	91.855	221	4	7	3	210	78509	78290	1.39e-75	296

<details>
<summary>英文:</summary>

I have downloaded a hit table from blast NCBI (Nucleotide blast using the nucleotide collection database and megablast program) and used awk to order it by accession version identities.

awk -F "\t" 'NF>1{print}' unsorted_input.txt | sort -k2 > sorted_output.txt


I then used Entrez Direct to use the accession version identifiers to extract the subject organism of each alignment:

awk -F "\t" 'NF>1{print $2}' unsorted_input.txt | epost -db nucleotide | efetch -format docsum | xtract -pattern DocumentSummary -element Organism | sort | paste sorted_output.txt - > final_output.txt


This command was able to extract the subject organism data for some alignments but not all. I noticed that for alignments that epost did not work for, individually querying them with esearch did work:

esearch -db nucleotide -query "accession_version_identifier" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism


So, I attempted to use this approach with a loop, using the accession version identifier (second column) of each line to extract the subject organism name as such:

while IFS=$'\t' read -r -a myArray
do
esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism > "output.txt"
done < input.txt


However, this only returned the subject organism of the first row. How can I apply this to every row, storing all subject organisms in the same file?

The first few lines of the input file can be found below. It is tab delimited:

ce1e013e-c4c5-47f9-b041-521ee293c4f0 AB002282.1 91.217 649 24 22 41 676 8 636 0.0 854
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 17668 17781 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 20740 20853 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 23812 23925 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 26884 26997 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 29956 30069 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 85.345 116 9 6 118 228 33027 33139 1.11e-20 113
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 87.000 100 7 5 132 228 14613 14709 5.16e-19 108
8e8ac3f3-63f6-4519-ad25-287a25169f87 AB850654.1 88.262 4660 175 260 16 4401 103840 108401 0.0 5232
c4233926-9f23-46c4-bc4d-5702f47885bd AB850654.1 89.958 4272 119 235 1 4042 104203 108394 0.0 5227
876d8f20-9d36-4207-8754-0924d99a6c46 AC019188.6 91.855 221 4 7 3 210 78509 78290 1.39e-75 296


</details>


# 答案1
**得分**: 0

我已解决了这个问题:

```bash
while IFS=$'\t' read -r -a myArray
do
 echo | esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Title,Organism >> output.txt 
done < input.txt
英文:

I have fixed the problem:

while IFS=$'\t' read -r -a myArray
do
echo | esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Title,Organism >> output.txt
done < input.txt

huangapple
  • 本文由 发表于 2023年2月9日 00:44:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75389016.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定