如何使用Biopython通过基因名检索NCBI Entrez摘要?

huangapple go评论80阅读模式
英文:

How to retrieve NCBI Entrez summary using gene name with Biopython?

问题

我已经探索了各种在线选项和解决方案,但似乎无法完全弄清楚这个问题。我是新手使用Entrez,所以不完全了解它的工作原理,但以下是我的尝试。

我的目标是打印出在线摘要,例如对于Kat2a,我想要它打印出“启动H3组蛋白乙酰转移酶活性;染色质结合活性;以及组蛋白乙酰转移酶活性(H4-K12特异性)。参与了多个过程”……等等,来自NCBI上的摘要部分。

from Bio import Entrez

def get_summary(gene_name):
    Entrez.email = 'x'

    query = f'{gene_name}[Gene Name]'
    handle = Entrez.esearch(db='gene', term=query)
    record = Entrez.read(handle)
    handle.close()

    NCBI_ids = record['IdList']
    for id in NCBI_ids:
        handle = Entrez.esummary(db='gene', id=id)
        record = Entrez.read(handle)
        print(record['Summary'])
    return 0
英文:

I've explored a variety of options and solutions online, but I can't seem to quite figure this out. I'm new to using Entrez so I don't fully understand how it works, but below was my attempt.

My goal would be to print out the online summary, so for instance for Kat2a I'd want it to print out 'Enables H3 histone acetyltransferase activity; chromatin binding activity; and histone acetyltransferase activity (H4-K12 specific). Involved in several processes' ...etc, from the summary section on NCBI.

from Bio import Entrez

def get_summary(gene_name):
    Entrez.email = 'x'

    query = f'{gene_name}[Gene Name]'
    handle = Entrez.esearch(db='gene', term=query)
    record = Entrez.read(handle)
    handle.close()

    NCBI_ids = record['IdList']
    for id in NCBI_ids:
        handle = Entrez.esummary(db='gene', id=id)
        record = Entrez.read(handle)
        print(record['Summary'])
    return 0

答案1

得分: 1

使用Biopython来获取与提供的基因名称相关的所有基因ID,并收集每个ID的所有基因摘要

你在正确的道路上。下面是一个进一步完善你在问题中提出的方法的示例。下面的函数(当然还可以进行更多自定义)考虑了默认的Entrez.esearch最大返回20个基因ID(默认情况下覆盖为100个),还执行了查询本身,通过生物体进行过滤(除非将默认的'human'设置为None)。

import time
import xmltodict
from collections import defaultdict
from Bio import Entrez

def get_entrez_gene_summary(gene_name, email, organism="human", max_gene_ids=100):
    """
    返回来自Entrez Gene数据库的提供输入基因的'摘要'内容。返回的所有基因ID
    与输入的gene_name相关,并且它们的文档摘要将会被'获取'。
    
    参数:
        gene_name (string): 官方(HGNC)基因名称
        email (string): 请求所需的电子邮件
        organism (string, optional): 默认为human。仅筛选与生物体匹配的结果。设置为None以返回未经过滤的所有生物体。
        max_gene_ids (int, optional): 设置要返回的Gene ID结果的数量(允许的绝对最大值为10K)。
        
    返回:
        dict: 与gene_name相关的所有基因ID的摘要(其中:键 → [orgn][gene name],
                      值 → 基因摘要)
    """
    Entrez.email = email

    query = (
        f"{gene_name}[Gene Name]"
        if not organism
        else f"({gene_name}[Gene Name]) AND {organism}[Organism]"
    )
    handle = Entrez.esearch(db="gene", term=query, retmax=max_gene_ids)
    record = Entrez.read(handle)
    handle.close()

    gene_summaries = defaultdict(dict)
    gene_ids = record["IdList"]

    print(
        f"{len(gene_ids)} gene IDs returned associated with gene {gene_name}."
    )
    for gene_id in gene_ids:
        print(f"\tRetrieving summary for {gene_id}...")
        handle = Entrez.efetch(db="gene", id=gene_id, rettype="docsum")
        gene_dict = xmltodict.parse(
            "".join([x.decode(encoding="utf-8") for x in handle.readlines()]),
            dict_constructor=dict,
        )
        gene_docsum = gene_dict["eSummaryResult"]["DocumentSummarySet"][
            "DocumentSummary"
        ]
        name = gene_docsum.get("Name")
        summary = gene_docsum.get("Summary")
        gene_organism = gene_docsum.get("Organism")["CommonName"]
        gene_summaries[gene_organism][name] = summary
        handle.close()
        time.sleep(0.34)  # 请求NCBI的速率限制为每秒3次

    return gene_summaries


示例1 - 获取KAT2A的基因摘要

>>> email = # [插入私人电子邮件]
>>> gene_summaries = get_entrez_gene_summary("KAT2A", email)

仅返回一个基因摘要(请记住,默认是organism='human'):

1. KAT2A
KAT2A,或GCN5,是一个组蛋白乙酰转移酶(HAT),主要作为转录激活剂发挥作用。它还通过促进NF-kappa-B亚单位RELA(见MIM 164014)的泛素化以HAT独立的方式,作为NF-kappa-B的抑制剂(Mao等人,2009年[PubMed 19339690])。[由OMIM提供,2009年9月]

示例2 - 使用通配符并为单个生物体接收多个基因

例如,可以使用查询ALDH*(星号表示通配符)获取所有人类醛脱氢酶基因的基因摘要:

>>> email = # 输入私人电子邮件
>>> gene_summaries = get_entrez_gene_summary("ALDH*", email, max_gene_ids=50)

与基因ALDH*相关的28个基因ID返回。

1. ALDH2
这个蛋白质属于醛脱氢酶家族。醛脱氢酶是酒精代谢的主要氧化途径中的第二个酶。肝脏中有两种主要的醛脱氢酶同功酶,胞质和线粒体,它们由不同的基因编码,可以通过它们的电泳迁移率,动力学性质和亚细胞定位来区分。该基因编码胞质同功酶。小鼠研究表明,通过其在视黄醛代谢中的作用,该基因也可能参与调节高脂饮食对代谢反应的影响。[由RefSeq提供,2016年11月]
2. ALDH1A1
这个基因编码的蛋白属于醛脱氢酶家族。醛脱氢酶是酒精代谢的主要途径中酒精脱氢酶后的下一个酶。肝脏中有两种主要的醛脱氢酶同功酶,胞质和线粒体,它们由不同的基因编码,可以通过它们的电泳迁移率,动力学性质和亚细胞定位来区分。该基因编码胞质同功酶。小鼠研究表明,通过其在视黄醛代
<details>
<summary>英文:</summary>
# Using Biopython to fetch all gene IDs associated with a provided gene name&#185; and gathering all gene summaries per ID&#178; 
* [1]: Using `Bio.Entrez.esearch`
* [2]: Using `Bio.Entrez.efetch`
You were on the right track. Here is one example that further fleshes out the approach you initiated and provide in your question. The function below (still, more customization of course could be done) takes into account the default `Entrez.esearch` max returned Gene IDs of 20 (overriding by default to 100), and also performs the query itself filtering by organism (unless the default &#39;human&#39; is set to `None`).
~~~python
import time
import xmltodict
from collections import defaultdict
from Bio import Entrez
def get_entrez_gene_summary(
gene_name, email, organism=&quot;human&quot;, max_gene_ids=100
):
&quot;&quot;&quot;Returns the &#39;Summary&#39; contents for provided input
gene from the Entrez Gene database. All gene IDs 
returned for input gene_name will have their docsum
summaries &#39;fetched&#39;.
Args:
gene_name (string): Official (HGNC) gene name 
(e.g., &#39;KAT2A&#39;)
email (string): Required email for making requests
organism (string, optional): defaults to human. 
Filters results only to match organism. Set to None
to return all organism unfiltered.
max_gene_ids (int, optional): Sets the number of Gene
ID results to return (absolute max allowed is 10K).
Returns:
dict: Summaries for all gene IDs associated with 
gene_name (where: keys → [orgn][gene name],
values → gene summary)
&quot;&quot;&quot;
Entrez.email = email
query = (
f&quot;{gene_name}[Gene Name]&quot;
if not organism
else f&quot;({gene_name}[Gene Name]) AND {organism}[Organism]&quot;
)
handle = Entrez.esearch(db=&quot;gene&quot;, term=query, retmax=max_gene_ids)
record = Entrez.read(handle)
handle.close()
gene_summaries = defaultdict(dict)
gene_ids = record[&quot;IdList&quot;]
print(
f&quot;{len(gene_ids)} gene IDs returned associated with gene {gene_name}.&quot;
)
for gene_id in gene_ids:
print(f&quot;\tRetrieving summary for {gene_id}...&quot;)
handle = Entrez.efetch(db=&quot;gene&quot;, id=gene_id, rettype=&quot;docsum&quot;)
gene_dict = xmltodict.parse(
&quot;&quot;.join([x.decode(encoding=&quot;utf-8&quot;) for x in handle.readlines()]),
dict_constructor=dict,
)
gene_docsum = gene_dict[&quot;eSummaryResult&quot;][&quot;DocumentSummarySet&quot;][
&quot;DocumentSummary&quot;
]
name = gene_docsum.get(&quot;Name&quot;)
summary = gene_docsum.get(&quot;Summary&quot;)
gene_organism = gene_docsum.get(&quot;Organism&quot;)[&quot;CommonName&quot;]
gene_summaries[gene_organism][name] = summary
handle.close()
time.sleep(0.34)  # Requests to NCBI are rate limited to 3 per second
return gene_summaries
~~~
----------
## Example 1 – Fetching the gene summary for KAT2A
~~~python
&gt;&gt;&gt; email = # [insert private email]
&gt;&gt;&gt; gene_summaries = get_entrez_gene_summary(&quot;KAT2A&quot;, email)
~~~
returns just one gene summary (remember the default is `organism=&#39;human&#39;`):
~~~none
1. KAT2A
KAT2A, or GCN5, is a histone acetyltransferase (HAT) that functions primarily as a transcriptional activator. It also functions as a repressor of NF-kappa-B (see MIM 164011) by promoting ubiquitination of the NF-kappa-B subunit RELA (MIM 164014) in a HAT-independent manner (Mao et al., 2009 [PubMed 19339690]).[supplied by OMIM, Sep 2009]
~~~
----------
## Example 2 – Using wildcards and receiving many genes for a single organism
For example, gene summaries for all human aldehyde dehydrogenase genes can be obtained using the query `ALDH*` (the asterisk representing a wildcard):
~~~python
&gt;&gt;&gt; email = # enter private email
&gt;&gt;&gt; gene_summaries = get_entrez_gene_summary(&quot;ALDH*&quot;, email, max_gene_ids=50)
28 gene IDs returned associated with gene ALDH*.
Retrieving summary for 217...
Retrieving summary for 216...
Retrieving summary for 501...
Retrieving summary for 220...
Retrieving summary for 224...
Retrieving summary for 7915...
Retrieving summary for 218...
Retrieving summary for 5832...
Retrieving summary for 219...
Retrieving summary for 10840...
Retrieving summary for 8854...
Retrieving summary for 8540...
Retrieving summary for 223...
Retrieving summary for 8659...
Retrieving summary for 4329...
Retrieving summary for 221...
Retrieving summary for 222...
Retrieving summary for 126133...
Retrieving summary for 160428...
Retrieving summary for 64577...
Retrieving summary for 541...
Retrieving summary for 100862662...
Retrieving summary for 544...
Retrieving summary for 543...
Retrieving summary for 542...
Retrieving summary for 101927751...
Retrieving summary for 283665...
Retrieving summary for 100874204...
&gt;&gt;&gt; for i, (k, v) in enumerate(gene_summaries[&quot;human&quot;].items()):
...    print(f&quot;{i+1}. {k}&quot;)
...    print(v, end=&quot;\n\n&quot;)
~~~
~~~none
1. ALDH2
This protein belongs to the aldehyde dehydrogenase family of proteins. Aldehyde dehydrogenase is the second enzyme of the major oxidative pathway of alcohol metabolism. Two major liver isoforms of aldehyde dehydrogenase, cytosolic and mitochondrial, can be distinguished by their electrophoretic mobilities, kinetic properties, and subcellular localizations. Most Caucasians have two major isozymes, while approximately 50% of East Asians have the cytosolic isozyme but not the mitochondrial isozyme. A remarkably higher frequency of acute alcohol intoxication among East Asians than among Caucasians could be related to the absence of a catalytically active form of the mitochondrial isozyme. The increased exposure to acetaldehyde in individuals with the catalytically inactive form may also confer greater susceptibility to many types of cancer. This gene encodes a mitochondrial isoform, which has a low Km for acetaldehydes, and is localized in mitochondrial matrix. Alternative splicing results in multiple transcript variants encoding distinct isoforms.[provided by RefSeq, Nov 2016]
2. ALDH1A1
The protein encoded by this gene belongs to the aldehyde dehydrogenase family. Aldehyde dehydrogenase is the next enzyme after alcohol dehydrogenase in the major pathway of alcohol metabolism. There are two major aldehyde dehydrogenase isozymes in the liver, cytosolic and mitochondrial, which are encoded by distinct genes, and can be distinguished by their electrophoretic mobility, kinetic properties, and subcellular localization. This gene encodes the cytosolic isozyme. Studies in mice show that through its role in retinol metabolism, this gene may also be involved in the regulation of the metabolic responses to high-fat diet. [provided by RefSeq, Mar 2011]
3. ALDH7A1
The protein encoded by this gene is a member of subfamily 7 in the aldehyde dehydrogenase gene family. These enzymes are thought to play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. This particular member has homology to a previously described protein from the green garden pea, the 26g pea turgor protein. It is also involved in lysine catabolism that is known to occur in the mitochondrial matrix. Recent reports show that this protein is found both in the cytosol and the mitochondria, and the two forms likely arise from the use of alternative translation initiation sites. An additional variant encoding a different isoform has also been found for this gene. Mutations in this gene are associated with pyridoxine-dependent epilepsy. Several related pseudogenes have also been identified. [provided by RefSeq, Jan 2011]
4. ALDH1A3
This gene encodes an aldehyde dehydrogenase enzyme that uses retinal as a substrate. Mutations in this gene have been associated with microphthalmia, isolated 8, and expression changes have also been detected in tumor cells. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jun 2014]
5. ALDH3A2
Aldehyde dehydrogenase isozymes are thought to play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. This gene product catalyzes the oxidation of long-chain aliphatic aldehydes to fatty acid. Mutations in the gene cause Sjogren-Larsson syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2008]
6. ALDH5A1
This protein belongs to the aldehyde dehydrogenase family of proteins. This gene encodes a mitochondrial NAD(+)-dependent succinic semialdehyde dehydrogenase. A deficiency of this enzyme, known as 4-hydroxybutyricaciduria, is a rare inborn error in the metabolism of the neurotransmitter 4-aminobutyric acid (GABA). In response to the defect, physiologic fluids from patients accumulate GHB, a compound with numerous neuromodulatory properties. Two transcript variants encoding distinct isoforms have been identified for this gene. [provided by RefSeq, Jul 2008]
7. ALDH3A1
Aldehyde dehydrogenases oxidize various aldehydes to the corresponding acids. They are involved in the detoxification of alcohol-derived acetaldehyde and in the metabolism of corticosteroids, biogenic amines, neurotransmitters, and lipid peroxidation. The enzyme encoded by this gene forms a cytoplasmic homodimer that preferentially oxidizes aromatic and medium-chain (6 carbons or more) saturated and unsaturated aldehyde substrates. It is thought to promote resistance to UV and 4-hydroxy-2-nonenal-induced oxidative damage in the cornea. The gene is located within the Smith-Magenis syndrome region on chromosome 17. Multiple alternatively spliced variants, encoding the same protein, have been identified. [provided by RefSeq, Sep 2008]
8. ALDH18A1
This gene is a member of the aldehyde dehydrogenase family and encodes a bifunctional ATP- and NADPH-dependent mitochondrial enzyme with both gamma-glutamyl kinase and gamma-glutamyl phosphate reductase activities. The encoded protein catalyzes the reduction of glutamate to delta1-pyrroline-5-carboxylate, a critical step in the de novo biosynthesis of proline, ornithine and arginine. Mutations in this gene lead to hyperammonemia, hypoornithinemia, hypocitrullinemia, hypoargininemia and hypoprolinemia and may be associated with neurodegeneration, cataracts and connective tissue diseases. Alternatively spliced transcript variants, encoding different isoforms, have been described for this gene. [provided by RefSeq, Jul 2008]
9. ALDH1B1
This protein belongs to the aldehyde dehydrogenases family of proteins. Aldehyde dehydrogenase is the second enzyme of the major oxidative pathway of alcohol metabolism. This gene does not contain introns in the coding sequence. The variation of this locus may affect the development of alcohol-related problems. [provided by RefSeq, Jul 2008]
10. ALDH1L1
The protein encoded by this gene catalyzes the conversion of 10-formyltetrahydrofolate, nicotinamide adenine dinucleotide phosphate (NADP+), and water to tetrahydrofolate, NADPH, and carbon dioxide. The encoded protein belongs to the aldehyde dehydrogenase family. Loss of function or expression of this gene is associated with decreased apoptosis, increased cell motility, and cancer progression. There is an antisense transcript that overlaps on the opposite strand with this gene locus. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jun 2012]
11. ALDH1A2
This protein belongs to the aldehyde dehydrogenase family of proteins. The product of this gene is an enzyme that catalyzes the synthesis of retinoic acid (RA) from retinaldehyde. Retinoic acid, the active derivative of vitamin A (retinol), is a hormonal signaling molecule that functions in developing and adult tissues. The studies of a similar mouse gene suggest that this enzyme and the cytochrome CYP26A1, concurrently establish local embryonic retinoic acid levels which facilitate posterior organ development and prevent spina bifida. Four transcript variants encoding distinct isoforms have been identified for this gene. [provided by RefSeq, May 2011]
12. AGPS
This gene is a member of the FAD-binding oxidoreductase/transferase type 4 family. It encodes a protein that catalyzes the second step of ether lipid biosynthesis in which acyl-dihydroxyacetonephosphate (DHAP) is converted to alkyl-DHAP by the addition of a long chain alcohol and the removal of a long-chain acid anion. The protein is localized to the inner aspect of the peroxisomal membrane and requires FAD as a cofactor. Mutations in this gene have been associated with rhizomelic chondrodysplasia punctata, type 3 and Zellweger syndrome. [provided by RefSeq, Jul 2008]
13. ALDH9A1
This protein belongs to the aldehyde dehydrogenase family of proteins. It has a high activity for oxidation of gamma-aminobutyraldehyde and other amino aldehydes. The enzyme catalyzes the dehydrogenation of gamma-aminobutyraldehyde to gamma-aminobutyric acid (GABA). This isozyme is a tetramer of identical 54-kD subunits. [provided by RefSeq, Jul 2008]
14. ALDH4A1
This protein belongs to the aldehyde dehydrogenase family of proteins. This enzyme is a mitochondrial matrix NAD-dependent dehydrogenase which catalyzes the second step of the proline degradation pathway, converting pyrroline-5-carboxylate to glutamate. Deficiency of this enzyme is associated with type II hyperprolinemia, an autosomal recessive disorder characterized by accumulation of delta-1-pyrroline-5-carboxylate (P5C) and proline. Alternatively spliced transcript variants encoding different isoforms have been identified for this gene. [provided by RefSeq, Jun 2009]
15. ALDH6A1
This gene encodes a member of the aldehyde dehydrogenase protein family. The encoded protein is a mitochondrial methylmalonate semialdehyde dehydrogenase that plays a role in the valine and pyrimidine catabolic pathways. This protein catalyzes the irreversible oxidative decarboxylation of malonate and methylmalonate semialdehydes to acetyl- and propionyl-CoA. Methylmalonate semialdehyde dehydrogenase deficiency is characterized by elevated beta-alanine, 3-hydroxypropionic acid, and both isomers of 3-amino and 3-hydroxyisobutyric acids in urine organic acids. Alternate splicing results in multiple transcript variants. [provided by RefSeq, Jun 2013]
16. ALDH3B1
This gene encodes a member of the aldehyde dehydrogenase protein family. Aldehyde dehydrogenases are a family of isozymes that may play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. The encoded protein is able to oxidize long-chain fatty aldehydes in vitro, and may play a role in protection from oxidative stress. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Feb 2014]
17. ALDH3B2
This gene encodes a member of the aldehyde dehydrogenase family, a group of isozymes that may play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. The gene of this particular family member is over 10 kb in length. Altered methylation patterns at this locus have been observed in spermatozoa derived from patients exhibiting reduced fecundity. [provided by RefSeq, Aug 2017]
18. ALDH16A1
This gene encodes a member of the aldehyde dehydrogenase superfamily. The family members act on aldehyde substrates and use nicotinamide adenine dinucleotide phosphate (NADP) as a cofactor. This gene is conserved in chimpanzee, dog, cow, mouse, rat, and zebrafish. The protein encoded by this gene interacts with maspardin, a protein that when truncated is responsible for Mast syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Apr 2010]
19. ALDH1L2
This gene encodes a member of both the aldehyde dehydrogenase superfamily and the formyl transferase superfamily. This member is the mitochondrial form of 10-formyltetrahydrofolate dehydrogenase (FDH), which converts 10-formyltetrahydrofolate to tetrahydrofolate and CO2 in an NADP(+)-dependent reaction, and plays an essential role in the distribution of one-carbon groups between the cytosolic and mitochondrial compartments of the cell. Alternatively spliced transcript variants have been found for this gene.[provided by RefSeq, Oct 2010]
20. ALDH8A1
This gene encodes a member of the aldehyde dehydrogenase family of proteins. The encoded protein has been implicated in the synthesis of 9-cis-retinoic acid and in the breakdown of the amino acid tryptophan. This enzyme converts 9-cis-retinal into the retinoid X receptor ligand 9-cis-retinoic acid, and has approximately 40-fold higher activity with 9-cis-retinal than with all-trans-retinal. In addition, this enzyme has been shown to catalyze the conversion of 2-aminomuconic semialdehyde to 2-aminomuconate in the kynurenine pathway of tryptophan catabolism. [provided by RefSeq, Jul 2018]
21. ALDH7A1P1
None
22. ALDH1L1-AS2
None
23. ALDH7A1P4
None
24. ALDH7A1P3
None
25. ALDH7A1P2
None
26. ALDH1A3-AS1
None
27. ALDH1A2-AS1
None
28. ALDH1L1-AS1
None
~~~
----------
## Example 3 – Receiving _thousands_ of genes across all organisms (unfiltered)
Setting `organism=None` in the provided Python function and `max_gene_ids=10000` for the same query (`gene_name=&#39;ALDH*&#39;`) results in 9010 returned Gene IDs (i.e., 9,010 ALDH-family genes among all organisms in the Entrez Gene DB, currently). 
E.g.,:
~~~python
&gt;&gt;&gt; gene_summaries = get_entrez_gene_summary(&quot;ALDH*&quot;, email, organism=None, max_gene_ids=10000)
9010 gene IDs returned associated with gene ALDH*.
Retrieving summary for 217...
Retrieving summary for 216...
Retrieving summary for 19378...
Retrieving summary for 11669...
[...]
~~~
</details>

huangapple
  • 本文由 发表于 2023年7月20日 15:09:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76727448.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定