英文:
Can I use Entrez Direct to query multiple nucleotide accession version identifiers against a database without using epost?
问题
以下是代码的翻译:
我从NCBI的blast下载了一个命中表格(使用核苷酸收藏数据库和megablast程序进行核苷酸blast),并使用awk按访问版本标识对其进行排序。
```bash
awk -F "\t" 'NF>1{print}' unsorted_input.txt | sort -k2 > sorted_output.txt
然后,我使用Entrez Direct使用访问版本标识来提取每个比对的受试组织:
awk -F "\t" 'NF>1{print $2}' unsorted_input.txt | epost -db nucleotide | efetch -format docsum | xtract -pattern DocumentSummary -element Organism | sort | paste sorted_output.txt - > final_output.txt
这个命令能够提取一些比对的受试组织数据,但不是所有的。我注意到对于epost无法工作的比对,单独使用esearch查询是有效的:
esearch -db nucleotide -query "accession_version_identifier" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism
因此,我尝试使用循环来采用这种方法,使用每行的访问版本标识符(第二列)来提取受试组织名称,如下所示:
while IFS=$'\t' read -r -a myArray
do
esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism > "output.txt"
done < input.txt
然而,这只返回了第一行的受试组织。如何将这应用于每一行,将所有受试组织存储在同一个文件中?
输入文件的前几行如下所示。它是以制表符分隔的:
ce1e013e-c4c5-47f9-b041-521ee293c4f0 AB002282.1 91.217 649 24 22 41 676 8 636 0.0 854
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 17668 17781 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 20740 20853 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 23812 23925 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 26884 26997 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 29956 30069 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 85.345 116 9 6 118 228 33027 33139 1.11e-20 113
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 87.000 100 7 5 132 228 14613 14709 5.16e-19 108
8e8ac3f3-63f6-4519-ad25-287a25169f87 AB850654.1 88.262 4660 175 260 16 4401 103840 108401 0.0 5232
c4233926-9f23-46c4-bc4d-5702f47885bd AB850654.1 89.958 4272 119 235 1 4042 104203 108394 0.0 5227
876d8f20-9d36-4207-8754-0924d99a6c46 AC019188.6 91.855 221 4 7 3 210 78509 78290 1.39e-75 296
<details>
<summary>英文:</summary>
I have downloaded a hit table from blast NCBI (Nucleotide blast using the nucleotide collection database and megablast program) and used awk to order it by accession version identities.
awk -F "\t" 'NF>1{print}' unsorted_input.txt | sort -k2 > sorted_output.txt
I then used Entrez Direct to use the accession version identifiers to extract the subject organism of each alignment:
awk -F "\t" 'NF>1{print $2}' unsorted_input.txt | epost -db nucleotide | efetch -format docsum | xtract -pattern DocumentSummary -element Organism | sort | paste sorted_output.txt - > final_output.txt
This command was able to extract the subject organism data for some alignments but not all. I noticed that for alignments that epost did not work for, individually querying them with esearch did work:
esearch -db nucleotide -query "accession_version_identifier" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism
So, I attempted to use this approach with a loop, using the accession version identifier (second column) of each line to extract the subject organism name as such:
while IFS=$'\t' read -r -a myArray
do
esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism > "output.txt"
done < input.txt
However, this only returned the subject organism of the first row. How can I apply this to every row, storing all subject organisms in the same file?
The first few lines of the input file can be found below. It is tab delimited:
ce1e013e-c4c5-47f9-b041-521ee293c4f0 AB002282.1 91.217 649 24 22 41 676 8 636 0.0 854
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 17668 17781 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 20740 20853 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 23812 23925 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 26884 26997 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 29956 30069 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 85.345 116 9 6 118 228 33027 33139 1.11e-20 113
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 87.000 100 7 5 132 228 14613 14709 5.16e-19 108
8e8ac3f3-63f6-4519-ad25-287a25169f87 AB850654.1 88.262 4660 175 260 16 4401 103840 108401 0.0 5232
c4233926-9f23-46c4-bc4d-5702f47885bd AB850654.1 89.958 4272 119 235 1 4042 104203 108394 0.0 5227
876d8f20-9d36-4207-8754-0924d99a6c46 AC019188.6 91.855 221 4 7 3 210 78509 78290 1.39e-75 296
</details>
# 答案1
**得分**: 0
我已解决了这个问题:
```bash
while IFS=$'\t' read -r -a myArray
do
echo | esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Title,Organism >> output.txt
done < input.txt
英文:
I have fixed the problem:
while IFS=$'\t' read -r -a myArray
do
echo | esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Title,Organism >> output.txt
done < input.txt
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论