2023年2月9日 00:44:44go评论55阅读模式

英文:

Can I use Entrez Direct to query multiple nucleotide accession version identifiers against a database without using epost?

问题

以下是代码的翻译：

我从NCBI的blast下载了一个命中表格（使用核苷酸收藏数据库和megablast程序进行核苷酸blast），并使用awk按访问版本标识对其进行排序。

```bash
awk -F "\t" 'NF>1{print}' unsorted_input.txt | sort -k2 > sorted_output.txt

然后，我使用Entrez Direct使用访问版本标识来提取每个比对的受试组织：

awk -F "\t" 'NF>1{print $2}' unsorted_input.txt | epost -db nucleotide | efetch -format docsum | xtract -pattern DocumentSummary -element Organism | sort | paste sorted_output.txt - > final_output.txt

这个命令能够提取一些比对的受试组织数据，但不是所有的。我注意到对于epost无法工作的比对，单独使用esearch查询是有效的：

esearch -db nucleotide -query "accession_version_identifier" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism

因此，我尝试使用循环来采用这种方法，使用每行的访问版本标识符（第二列）来提取受试组织名称，如下所示：

while IFS=$'\t' read -r -a myArray
do
 esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism > "output.txt"
done < input.txt

然而，这只返回了第一行的受试组织。如何将这应用于每一行，将所有受试组织存储在同一个文件中？

输入文件的前几行如下所示。它是以制表符分隔的：

ce1e013e-c4c5-47f9-b041-521ee293c4f0	AB002282.1	91.217	649	24	22	41	676	8	636	0.0	854
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	84.615	117	9	6	118	228	17668	17781	5.16e-19	108
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	84.615	117	9	6	118	228	20740	20853	5.16e-19	108
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	84.615	117	9	6	118	228	23812	23925	5.16e-19	108
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	84.615	117	9	6	118	228	26884	26997	5.16e-19	108
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	84.615	117	9	6	118	228	29956	30069	5.16e-19	108
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	85.345	116	9	6	118	228	33027	33139	1.11e-20	113
c10d7882-cc00-4ee2-8643-9b27fef66e83	AB828191.1	87.000	100	7	5	132	228	14613	14709	5.16e-19	108
8e8ac3f3-63f6-4519-ad25-287a25169f87	AB850654.1	88.262	4660	175	260	16	4401	103840	108401	0.0	5232
c4233926-9f23-46c4-bc4d-5702f47885bd	AB850654.1	89.958	4272	119	235	1	4042	104203	108394	0.0	5227
876d8f20-9d36-4207-8754-0924d99a6c46	AC019188.6	91.855	221	4	7	3	210	78509	78290	1.39e-75	296


<details>
<summary>英文:</summary>

I have downloaded a hit table from blast NCBI (Nucleotide blast using the nucleotide collection database and megablast program) and used awk to order it by accession version identities.

awk -F "\t" 'NF>1{print}' unsorted_input.txt | sort -k2 > sorted_output.txt


I then used Entrez Direct to use the accession version identifiers to extract the subject organism of each alignment:

awk -F "\t" 'NF>1{print $2}' unsorted_input.txt | epost -db nucleotide | efetch -format docsum | xtract -pattern DocumentSummary -element Organism | sort | paste sorted_output.txt - > final_output.txt


This command was able to extract the subject organism data for some alignments but not all. I noticed that for alignments that epost did not work for, individually querying them with esearch did work:

esearch -db nucleotide -query "accession_version_identifier" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism


So, I attempted to use this approach with a loop, using the accession version identifier (second column) of each line to extract the subject organism name as such:

while IFS=$'\t' read -r -a myArray
do
esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism > "output.txt"
done < input.txt


However, this only returned the subject organism of the first row. How can I apply this to every row, storing all subject organisms in the same file?

The first few lines of the input file can be found below. It is tab delimited:

ce1e013e-c4c5-47f9-b041-521ee293c4f0 AB002282.1 91.217 649 24 22 41 676 8 636 0.0 854
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 17668 17781 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 20740 20853 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 23812 23925 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 26884 26997 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 29956 30069 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 85.345 116 9 6 118 228 33027 33139 1.11e-20 113
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 87.000 100 7 5 132 228 14613 14709 5.16e-19 108
8e8ac3f3-63f6-4519-ad25-287a25169f87 AB850654.1 88.262 4660 175 260 16 4401 103840 108401 0.0 5232
c4233926-9f23-46c4-bc4d-5702f47885bd AB850654.1 89.958 4272 119 235 1 4042 104203 108394 0.0 5227
876d8f20-9d36-4207-8754-0924d99a6c46 AC019188.6 91.855 221 4 7 3 210 78509 78290 1.39e-75 296


</details>


# 答案1
**得分**: 0

我已解决了这个问题：

```bash
while IFS=$'\t' read -r -a myArray
do
 echo | esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Title,Organism >> output.txt 
done < input.txt

英文:

I have fixed the problem:

while IFS=$'\t' read -r -a myArray
do
echo | esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Title,Organism >> output.txt
done < input.txt

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

可以使用Entrez Direct查询多个核苷酸访问版本标识符而不使用epost吗？

问题

如何使我的CSV比较结果适用于三个单独的列，而不是一个列？

在另一个文件中匹配特定字符串的序列，过滤Fasta文件。

在Awk中拆分SAM文件，保留N行作为标题。

Python program to generate a single species matrix file from multiple sample-wise species count files (using Pandas?)

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论