python 如何使用Entrez检索使用基因名称的NCBI摘要？

2ledvvac 于 2023-08-02 发布在 Python

关注(0)|答案(2)|浏览(131)

我已经在网上探索了各种各样的选择和解决方案，但我似乎不能完全弄清楚这一点。我刚开始使用Entrez，所以我不完全了解它是如何工作的，但下面是我的尝试。
我的目标是打印出在线摘要，例如对于Kat 2a，我希望它打印出“启用H3组蛋白乙酰转移酶活性;染色质结合活性;和组蛋白乙酰转移酶活性（H4-K12特异性）。参与了几个过程'...等，从NCBI上的摘要部分。

def get_summary(gene_name):
    Entrez.email = 'x'

    query = f'{gene_name}[Gene Name]'
    handle = Entrez.esearch(db='gene', term=query)
    record = Entrez.read(handle)
    handle.close()

    NCBI_ids = record['IdList']
    for id in NCBI_ids:
        handle = Entrez.esummary(db='gene', id=id)
        record = Entrez.read(handle)
        print(record['Summary'])
    return 0

字符串

python

来源：https://stackoverflow.com/questions/76727448/how-to-retrieve-ncbi-summary-using-gene-name-with-entrez

2条答案

按热度按时间

rxztt3cl1#

使用Biopython获取与提供的基因名称¹相关的所有基因ID，并收集每个ID²的所有基因摘要

[1]：使用Bio.Entrez.esearch
[2]：使用Bio.Entrez.efetch

你的方向是对的。下面是一个例子，进一步充实了你在问题中提出和提供的方法：

import time
import xmltodict

from Bio import Entrez

def get_entrez_gene_summary(gene_name, email):
    """Returns the 'Summary' contents for provided input
    gene from the Entrez Gene database. All gene IDs 
    returned for input gene_name will have their docsum
    summaries 'fetched'.
    
    Args:
        gene_name (string): Official (HGNC) gene name 
        (e.g., 'KAT2A')
        email (string): Required email for making requests
    
    Returns:
        dict: Summaries for all gene IDs associated with 
        gene_name (where: keys → gene_ids, values → summary)
    """
    Entrez.email = email

    query = f"{gene_name}[Gene Name]"
    handle = Entrez.esearch(db="gene", term=query)
    record = Entrez.read(handle)
    handle.close()

    gene_summaries = {}
    gene_ids = record["IdList"]

    print(
        f"{len(gene_ids)} gene IDs returned associated with gene {gene_name}."
    )
    for gene_id in gene_ids:
        print(f"\tRetrieving summary for {gene_id}...")
        handle = Entrez.efetch(db="gene", id=gene_id, rettype="docsum")
        gene_dict = xmltodict.parse(
            "".join([x.decode(encoding="utf-8") for x in handle.readlines()]),
            dict_constructor=dict,
        )
        gene_docsum = gene_dict["eSummaryResult"]["DocumentSummarySet"][
            "DocumentSummary"
        ]
        summary = gene_docsum.get("Summary")
        gene_summaries[gene_id] = summary
        handle.close()
        time.sleep(0.34)  # Requests to NCBI are rate limited to 3 per second

    return gene_summaries

字符串
这导致以下功能行为：

>>> email = # [insert private email here]
>>> gene_summaries = get_entrez_gene_summary("Kat2a", email)
20 gene IDs returned associated with gene Kat2a.
    Retrieving summary for 131367786...
    Retrieving summary for 2648...
    Retrieving summary for 14534...
    Retrieving summary for 303539...
    Retrieving summary for 374232...
    Retrieving summary for 555517...
    Retrieving summary for 514420...
    Retrieving summary for 454677...
    Retrieving summary for 100492735...
    Retrieving summary for 490971...
    Retrieving summary for 106047988...
    Retrieving summary for 552646...
    Retrieving summary for 100404275...
    Retrieving summary for 101670315...
    Retrieving summary for 108901253...
    Retrieving summary for 102311953...
    Retrieving summary for 102480159...
    Retrieving summary for 118289508...
    Retrieving summary for 103189181...
    Retrieving summary for 100774478...
>>> gene_summaries
{'131367786': None,
 '2648': 'KAT2A, or GCN5, is a histone acetyltransferase (HAT) that functions primarily as a transcriptional activator. It also functions as a repressor of NF-kappa-B (see MIM 164011) by promoting ubiquitination of the NF-kappa-B subunit RELA (MIM 164014) in a HAT-independent manner (Mao et al., 2009 [PubMed 19339690]).[supplied by OMIM, Sep 2009]',
 '14534': 'Enables H3 histone acetyltransferase activity; chromatin binding activity; and histone acetyltransferase activity (H4-K12 specific). Involved in several processes, including long-term memory; positive regulation of macromolecule metabolic process; and regulation of regulatory T cell differentiation. Acts upstream of or within several processes, including brain development; chordate embryonic development; and histone acetylation. Located in mitotic spindle and nucleus. Part of ATAC complex and SAGA complex. Is expressed in several structures, including alimentary system; central nervous system; early conceptus; genitourinary system; and hemolymphoid system gland. Orthologous to human KAT2A (lysine acetyltransferase 2A). [provided by Alliance of Genome Resources, Apr 2022]',
 '303539': 'Enables chromatin binding activity and protein phosphatase binding activity. Involved in several processes, including alpha-tubulin acetylation; intracellular distribution of mitochondria; and positive regulation of cardiac muscle cell differentiation. Located in chromatin and nucleus. Orthologous to human KAT2A (lysine acetyltransferase 2A). [provided by Alliance of Genome Resources, Apr 2022]',
 '374232': None,
 '555517': 'Predicted to enable N-acyltransferase activity; chromatin binding activity; and transcription coactivator activity. Involved in several processes, including histone acetylation; regulation of bone development; and regulation of cartilage development. Acts upstream of or within bone morphogenesis. Predicted to be located in centrosome and nucleus. Predicted to be part of histone acetyltransferase complex. Is expressed in brain; fin; head; heart; and otic vesicle. Orthologous to human KAT2A (lysine acetyltransferase 2A). [provided by Alliance of Genome Resources, Apr 2022]',
 '514420': None,
 '454677': None,
 '100492735': None,
 '490971': None,
 '106047988': None,
 '552646': None,
 '100404275': None,
 '101670315': None,
 '108901253': None,
 '102311953': None,
 '102480159': None,
 '118289508': None,
 '103189181': None,
 '100774478': None}

型

查看摘要

例如，以下附加代码：

for k,v in gene_summaries.items():
    if v is not None:
        print(k)
        print(v, end="\n\n")

型
给出了更可读的基因摘要输出：

KAT2A

2648

KAT 2A或GCN 5是组蛋白乙酰转移酶（HAT），其主要起转录激活剂的作用。它还通过以HAT非依赖性方式促进NF-κ-B亚基RELA（MIM 164014）的泛素化而作为NF-κ-B的阻遏物（参见MIM 164011）起作用（Mao等人，2009 [PubMed 19339690]）。

14534

使H3组蛋白乙酰转移酶具有活性;染色质结合活性;和组蛋白乙酰转移酶活性（H4-K12特异性）。参与几个过程，包括长期记忆;大分子代谢过程的正调控;和调节性T细胞分化的调节。作用于包括大脑发育在内的几个过程的上游或内部;脊索动物胚胎发育;和组蛋白乙酰化。位于有丝分裂的纺锤体和细胞核中。ATAC复合物和 Saga 复合物的一部分。表现在多个结构中，包括消化系统;中枢神经系统;早期孕体;泌尿生殖系统;和血淋巴系统腺。与人KAT 2A（赖氨酸乙酰转移酶2A）直系同源。[由基因组资源联盟提供，2022年4月]

303539

使染色质结合活性和蛋白磷酸酶结合活性。参与几个过程，包括α-微管蛋白乙酰化;线粒体的细胞内分布;和心肌细胞分化的正调控。位于染色质和细胞核中。与人KAT 2A（赖氨酸乙酰转移酶2A）直系同源。[由基因组资源联盟提供，2022年4月]

555517

预测能够实现N-酰基转移酶活性;染色质结合活性;和转录辅激活因子活性。参与几个过程，包括组蛋白乙酰化;调节骨发育;和调节软骨发育。作用于骨形态发生的上游或内部。预测定位于中心体和细胞核。预测为组蛋白乙酰转移酶复合物的一部分。在大脑中表达;鳍;头;心;和耳泡。与人KAT 2A（赖氨酸乙酰转移酶2A）直系同源。[由基因组资源联盟提供，2022年4月]

编辑-关于Entrez基因数据库数据结构的更多细节&如何按生物体过滤结果（例如，仅限人类）

例如，这里是每个Gene ID返回的XML数据树（转换为Python dict）的每个['eSummaryResult']['DocumentSummarySet']['DocumentSummary']分支下包含的完整数据（没有其他重要的索引分支）：

@uid
2648

ChrSort
17

ChrStart
42113110

Chromosome
17

CurrentID
0

Description
lysine acetyltransferase 2A

GeneWeight
11383

GeneticSource
genomic

GenomicInfo
{'GenomicInfoType': {'ChrLoc': '17', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121366', 'ChrStop': '42113110', 'ExonCount': '19'}}

LocationHist
{'LocationHistType': [{'AnnotationRelease': 'RS_2023_03', 'AssemblyAccVer': 'GCF_000001405.40', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121366', 'ChrStop': '42113110'}, {'AnnotationRelease': 'RS_2023_03', 'AssemblyAccVer': 'GCF_009914755.1', 'ChrAccVer': 'NC_060941.1', 'ChrStart': '42977869', 'ChrStop': '42969612'}, {'AnnotationRelease': '110', 'AssemblyAccVer': 'GCF_000001405.40', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121366', 'ChrStop': '42113110'}, {'AnnotationRelease': '110', 'AssemblyAccVer': 'GCF_009914755.1', 'ChrAccVer': 'NC_060941.1', 'ChrStart': '42977869', 'ChrStop': '42969612'}, {'AnnotationRelease': '109.20211119', 'AssemblyAccVer': 'GCF_000001405.39', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121366', 'ChrStop': '42113110'}, {'AnnotationRelease': '109.20210514', 'AssemblyAccVer': 'GCF_000001405.39', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121366', 'ChrStop': '42113110'}, {'AnnotationRelease': '109.20210226', 'AssemblyAccVer': 'GCF_000001405.39', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121366', 'ChrStop': '42113110'}, {'AnnotationRelease': '109.20201120', 'AssemblyAccVer': 'GCF_000001405.39', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121366', 'ChrStop': '42113110'}, {'AnnotationRelease': '109.20200815', 'AssemblyAccVer': 'GCF_000001405.39', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121366', 'ChrStop': '42113110'}, {'AnnotationRelease': '109.20200522', 'AssemblyAccVer': 'GCF_000001405.39', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121366', 'ChrStop': '42113110'}, {'AnnotationRelease': '109.20200228', 'AssemblyAccVer': 'GCF_000001405.39', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121366', 'ChrStop': '42113110'}, {'AnnotationRelease': '109.20191205', 'AssemblyAccVer': 'GCF_000001405.39', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121366', 'ChrStop': '42113110'}, {'AnnotationRelease': '109.20190905', 'AssemblyAccVer': 'GCF_000001405.39', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121408', 'ChrStop': '42113110'}, {'AnnotationRelease': '109.20190607', 'AssemblyAccVer': 'GCF_000001405.39', 'ChrAccVer': 'NC_000017.11', 'ChrStart': '42121408', 'ChrStop': '42113110'}, {'AnnotationRelease': '105.20220307', 'AssemblyAccVer': 'GCF_000001405.25', 'ChrAccVer': 'NC_000017.10', 'ChrStart': '40273384', 'ChrStop': '40265128'}, {'AnnotationRelease': '105.20220307', 'AssemblyAccVer': 'GCF_000001405.25', 'ChrAccVer': 'NW_003571052.1', 'ChrStart': '408008', 'ChrStop': '399752'}, {'AnnotationRelease': '105.20201022', 'AssemblyAccVer': 'GCF_000001405.25', 'ChrAccVer': 'NC_000017.10', 'ChrStart': '40273384', 'ChrStop': '40265128'}, {'AnnotationRelease': '105.20201022', 'AssemblyAccVer': 'GCF_000001405.25', 'ChrAccVer': 'NW_003571052.1', 'ChrStart': '408008', 'ChrStop': '399752'}, {'AnnotationRelease': '105', 'AssemblyAccVer': 'GCF_000001405.25', 'ChrAccVer': 'NC_000017.10', 'ChrStart': '40273381', 'ChrStop': '40265128'}, {'AnnotationRelease': '105', 'AssemblyAccVer': 'GCF_000001405.25', 'ChrAccVer': 'NW_003571052.1', 'ChrStart': '408005', 'ChrStop': '399752'}, {'AnnotationRelease': '105', 'AssemblyAccVer': 'GCF_000002125.1', 'ChrAccVer': 'AC_000149.1', 'ChrStart': '36038463', 'ChrStop': '36030208'}, {'AnnotationRelease': '105', 'AssemblyAccVer': 'GCF_000306695.2', 'ChrAccVer': 'NC_018928.2', 'ChrStart': '40509156', 'ChrStop': '40500903'}]}

MapLocation
17q21.2

Mim
{'int': '602301'}

Name
KAT2A

NomenclatureName
lysine acetyltransferase 2A

NomenclatureStatus
Official

NomenclatureSymbol
KAT2A

Organism
{'ScientificName': 'Homo sapiens', 'CommonName': 'human', 'TaxID': '9606'}

OtherAliases
GCN5, GCN5L2, PCAF-b, hGCN5

OtherDesignations
histone acetyltransferase KAT2A|GCN5 (general control of amino-acid synthesis, yeast, homolog)-like 2|General control of amino acid synthesis, yeast, homolog-like 2|K(lysine) acetyltransferase 2A|STAF97|general control of amino acid synthesis protein 5-like 2|histone acetyltransferase GCN5|histone glutaryltransferase KAT2A|histone succinyltransferase KAT2A|hsGCN5

Status
0

Summary
KAT2A, or GCN5, is a histone acetyltransferase (HAT) that functions primarily as a transcriptional activator. It also functions as a repressor of NF-kappa-B (see MIM 164011) by promoting ubiquitination of the NF-kappa-B subunit RELA (MIM 164014) in a HAT-independent manner (Mao et al., 2009 [PubMed 19339690]).[supplied by OMIM, Sep 2009]

型
下面的代码被执行：

gene_keys = sorted(
    list(
        gene_dicts[gene_ids[0]]["eSummaryResult"]["DocumentSummarySet"][
            "DocumentSummary"
        ].keys()
    )
)
for k in gene_keys:
    print(k)
    print(
        gene_dicts[gene_ids[0]]["eSummaryResult"]["DocumentSummarySet"][
            "DocumentSummary"
        ][k],
        end="\n\n",
    )

型

修改函数为按生物体过滤

所以，假设你只想返回人类基因的结果，你可以这样修改上面显示的函数：

def get_entrez_gene_summary(gene_name, email, organism="human"):
    """Returns the 'Summary' contents for provided input
    gene from the Entrez Gene database. All gene IDs 
    returned for input gene_name will have their docsum
    summaries 'fetched'.
    
    Args:
        gene_name (string): Official (HGNC) gene name 
           (e.g., 'KAT2A')
        email (string): Required email for making requests
        organism (string): Optional; common name of organism;
           defaults to human. Filters results only to match 
           organism; set to None to return all organisms 
           unfiltered.
    
    Returns:
        dict: Summaries for all gene IDs associated with 
           gene_name (where: keys → gene_ids, values → summary)
    """
    Entrez.email = email

    query = f"{gene_name}[Gene Name]"
    handle = Entrez.esearch(db="gene", term=query)
    record = Entrez.read(handle)
    handle.close()
    
    gene_summaries = {}
#     gene_dicts = {}
    gene_ids = record["IdList"]

    print(
        f"{len(gene_ids)} gene IDs returned associated with gene {gene_name}."
    )
    for gene_id in gene_ids:
        print(f"\tRetrieving summary for {gene_id}...")
        handle = Entrez.efetch(db="gene", id=gene_id, rettype="docsum")
        gene_dict = xmltodict.parse(
            "".join([x.decode(encoding="utf-8") for x in handle.readlines()]),
            dict_constructor=dict,
        )
#         gene_dicts[gene_id] = gene_dict
        gene_docsum = gene_dict["eSummaryResult"]["DocumentSummarySet"][
            "DocumentSummary"
        ]
        summary = gene_docsum.get("Summary")
        gene_organism = gene_docsum.get("Organism")['CommonName']
        if organism and gene_organism != organism:
            print(f"\t\tSkipping {gene_id} | ⚠️ Not {organism}: {gene_organism}")
            continue
        gene_summaries[gene_id] = {gene_organism: summary}
        handle.close()
        time.sleep(0.34)  # Requests to NCBI are rate limited to 3 per second

    return gene_summaries

型
这导致以下修改的行为：

20 gene IDs returned associated with gene Kat2a.
    Retrieving summary for 2648...
    Retrieving summary for 14534...
        Skipping 14534 | ⚠️ Not human: house mouse
    Retrieving summary for 303539...
        Skipping 303539 | ⚠️ Not human: Norway rat
    Retrieving summary for 374232...
        Skipping 374232 | ⚠️ Not human: chicken
    Retrieving summary for 555517...
        Skipping 555517 | ⚠️ Not human: zebrafish
    Retrieving summary for 514420...
        Skipping 514420 | ⚠️ Not human: cattle
    Retrieving summary for 454677...
        Skipping 454677 | ⚠️ Not human: chimpanzee
    Retrieving summary for 100492735...
        Skipping 100492735 | ⚠️ Not human: tropical clawed frog
    Retrieving summary for 490971...
        Skipping 490971 | ⚠️ Not human: dog
    Retrieving summary for 106047988...
        Skipping 106047988 | ⚠️ Not human: Swan goose
    Retrieving summary for 552646...
        Skipping 552646 | ⚠️ Not human: honey bee
    Retrieving summary for 100404275...
        Skipping 100404275 | ⚠️ Not human: white-tufted-ear marmoset
    Retrieving summary for 101670315...
        Skipping 101670315 | ⚠️ Not human: domestic ferret
    Retrieving summary for 108901253...
        Skipping 108901253 | ⚠️ Not human: barramundi perch
    Retrieving summary for 102311953...
        Skipping 102311953 | ⚠️ Not human: Burton's mouthbrooder
    Retrieving summary for 103189181...
        Skipping 103189181 | ⚠️ Not human: elephant shark
    Retrieving summary for 102480159...
        Skipping 102480159 | ⚠️ Not human: Chinese tree shrew
    Retrieving summary for 100774478...
        Skipping 100774478 | ⚠️ Not human: Chinese hamster
    Retrieving summary for 129098420...
        Skipping 129098420 | ⚠️ Not human: sablefish
    Retrieving summary for 122967192...
        Skipping 122967192 | ⚠️ Not human: yellowfin tuna

>>> gene_summaries # Human-only filtered
{'2648': {'human': 'KAT2A, or GCN5, is a histone acetyltransferase (HAT) that functions primarily as a transcriptional activator. It also functions as a repressor of NF-kappa-B (see MIM 164011) by promoting ubiquitination of the NF-kappa-B subunit RELA (MIM 164014) in a HAT-independent manner (Mao et al., 2009 [PubMed 19339690]).[supplied by OMIM, Sep 2009]'}}

赞(0）回复(0）举报 2023-08-02

t2a7ltrp2#

最后，这里是同一函数的进一步增强版本（并且是推荐使用的代码;我留下前面的答案是为了演示）

这个最终版本（当然还可以进行更多的定制）考虑了默认的Entrez.esearch最大返回基因ID 20（默认覆盖为100），并且还执行了查询本身按生物体过滤（除非默认的“人类”设置为None）。

import time
import xmltodict

from collections import defaultdict

from Bio import Entrez

def get_entrez_gene_summary(
    gene_name, email, organism="human", max_gene_ids=100
):
    """Returns the 'Summary' contents for provided input
    gene from the Entrez Gene database. All gene IDs 
    returned for input gene_name will have their docsum
    summaries 'fetched'.
    
    Args:
        gene_name (string): Official (HGNC) gene name 
           (e.g., 'KAT2A')
        email (string): Required email for making requests
        organism (string, optional): defaults to human. 
           Filters results only to match organism. Set to None
           to return all organism unfiltered.
        max_gene_ids (int, optional): Sets the number of Gene
           ID results to return (absolute max allowed is 10K).
        
    Returns:
        dict: Summaries for all gene IDs associated with 
           gene_name (where: keys → [orgn][gene name],
                      values → gene summary)
    """
    Entrez.email = email

    query = (
        f"{gene_name}[Gene Name]"
        if not organism
        else f"({gene_name}[Gene Name]) AND {organism}[Organism]"
    )
    handle = Entrez.esearch(db="gene", term=query, retmax=max_gene_ids)
    record = Entrez.read(handle)
    handle.close()

    gene_summaries = defaultdict(dict)
    gene_ids = record["IdList"]

    print(
        f"{len(gene_ids)} gene IDs returned associated with gene {gene_name}."
    )
    for gene_id in gene_ids:
        print(f"\tRetrieving summary for {gene_id}...")
        handle = Entrez.efetch(db="gene", id=gene_id, rettype="docsum")
        gene_dict = xmltodict.parse(
            "".join([x.decode(encoding="utf-8") for x in handle.readlines()]),
            dict_constructor=dict,
        )
        gene_docsum = gene_dict["eSummaryResult"]["DocumentSummarySet"][
            "DocumentSummary"
        ]
        name = gene_docsum.get("Name")
        summary = gene_docsum.get("Summary")
        gene_organism = gene_docsum.get("Organism")["CommonName"]
        gene_summaries[gene_organism][name] = summary
        handle.close()
        time.sleep(0.34)  # Requests to NCBI are rate limited to 3 per second

    return gene_summaries

字符串
例如，可以使用查询ALDH*（星号表示通配符）获得所有人醛脱氢酶基因的基因总结：

>>> email = # enter private email
>>> gene_summaries = get_entrez_gene_summary("ALDH*", email, max_gene_ids=50)
28 gene IDs returned associated with gene ALDH*.
    Retrieving summary for 217...
    Retrieving summary for 216...
    Retrieving summary for 501...
    Retrieving summary for 220...
    Retrieving summary for 224...
    Retrieving summary for 7915...
    Retrieving summary for 218...
    Retrieving summary for 5832...
    Retrieving summary for 219...
    Retrieving summary for 10840...
    Retrieving summary for 8854...
    Retrieving summary for 8540...
    Retrieving summary for 223...
    Retrieving summary for 8659...
    Retrieving summary for 4329...
    Retrieving summary for 221...
    Retrieving summary for 222...
    Retrieving summary for 126133...
    Retrieving summary for 160428...
    Retrieving summary for 64577...
    Retrieving summary for 541...
    Retrieving summary for 100862662...
    Retrieving summary for 544...
    Retrieving summary for 543...
    Retrieving summary for 542...
    Retrieving summary for 101927751...
    Retrieving summary for 283665...
    Retrieving summary for 100874204...
>>> for i, (k, v) in enumerate(gene_summaries["human"].items()):
...    print(f"{i+1}. {k}")
...    print(v, end="\n\n")

1. ALDH2
This protein belongs to the aldehyde dehydrogenase family of proteins. Aldehyde dehydrogenase is the second enzyme of the major oxidative pathway of alcohol metabolism. Two major liver isoforms of aldehyde dehydrogenase, cytosolic and mitochondrial, can be distinguished by their electrophoretic mobilities, kinetic properties, and subcellular localizations. Most Caucasians have two major isozymes, while approximately 50% of East Asians have the cytosolic isozyme but not the mitochondrial isozyme. A remarkably higher frequency of acute alcohol intoxication among East Asians than among Caucasians could be related to the absence of a catalytically active form of the mitochondrial isozyme. The increased exposure to acetaldehyde in individuals with the catalytically inactive form may also confer greater susceptibility to many types of cancer. This gene encodes a mitochondrial isoform, which has a low Km for acetaldehydes, and is localized in mitochondrial matrix. Alternative splicing results in multiple transcript variants encoding distinct isoforms.[provided by RefSeq, Nov 2016]

2. ALDH1A1
The protein encoded by this gene belongs to the aldehyde dehydrogenase family. Aldehyde dehydrogenase is the next enzyme after alcohol dehydrogenase in the major pathway of alcohol metabolism. There are two major aldehyde dehydrogenase isozymes in the liver, cytosolic and mitochondrial, which are encoded by distinct genes, and can be distinguished by their electrophoretic mobility, kinetic properties, and subcellular localization. This gene encodes the cytosolic isozyme. Studies in mice show that through its role in retinol metabolism, this gene may also be involved in the regulation of the metabolic responses to high-fat diet. [provided by RefSeq, Mar 2011]

3. ALDH7A1
The protein encoded by this gene is a member of subfamily 7 in the aldehyde dehydrogenase gene family. These enzymes are thought to play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. This particular member has homology to a previously described protein from the green garden pea, the 26g pea turgor protein. It is also involved in lysine catabolism that is known to occur in the mitochondrial matrix. Recent reports show that this protein is found both in the cytosol and the mitochondria, and the two forms likely arise from the use of alternative translation initiation sites. An additional variant encoding a different isoform has also been found for this gene. Mutations in this gene are associated with pyridoxine-dependent epilepsy. Several related pseudogenes have also been identified. [provided by RefSeq, Jan 2011]

4. ALDH1A3
This gene encodes an aldehyde dehydrogenase enzyme that uses retinal as a substrate. Mutations in this gene have been associated with microphthalmia, isolated 8, and expression changes have also been detected in tumor cells. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jun 2014]

5. ALDH3A2
Aldehyde dehydrogenase isozymes are thought to play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. This gene product catalyzes the oxidation of long-chain aliphatic aldehydes to fatty acid. Mutations in the gene cause Sjogren-Larsson syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2008]

6. ALDH5A1
This protein belongs to the aldehyde dehydrogenase family of proteins. This gene encodes a mitochondrial NAD(+)-dependent succinic semialdehyde dehydrogenase. A deficiency of this enzyme, known as 4-hydroxybutyricaciduria, is a rare inborn error in the metabolism of the neurotransmitter 4-aminobutyric acid (GABA). In response to the defect, physiologic fluids from patients accumulate GHB, a compound with numerous neuromodulatory properties. Two transcript variants encoding distinct isoforms have been identified for this gene. [provided by RefSeq, Jul 2008]

7. ALDH3A1
Aldehyde dehydrogenases oxidize various aldehydes to the corresponding acids. They are involved in the detoxification of alcohol-derived acetaldehyde and in the metabolism of corticosteroids, biogenic amines, neurotransmitters, and lipid peroxidation. The enzyme encoded by this gene forms a cytoplasmic homodimer that preferentially oxidizes aromatic and medium-chain (6 carbons or more) saturated and unsaturated aldehyde substrates. It is thought to promote resistance to UV and 4-hydroxy-2-nonenal-induced oxidative damage in the cornea. The gene is located within the Smith-Magenis syndrome region on chromosome 17. Multiple alternatively spliced variants, encoding the same protein, have been identified. [provided by RefSeq, Sep 2008]

8. ALDH18A1
This gene is a member of the aldehyde dehydrogenase family and encodes a bifunctional ATP- and NADPH-dependent mitochondrial enzyme with both gamma-glutamyl kinase and gamma-glutamyl phosphate reductase activities. The encoded protein catalyzes the reduction of glutamate to delta1-pyrroline-5-carboxylate, a critical step in the de novo biosynthesis of proline, ornithine and arginine. Mutations in this gene lead to hyperammonemia, hypoornithinemia, hypocitrullinemia, hypoargininemia and hypoprolinemia and may be associated with neurodegeneration, cataracts and connective tissue diseases. Alternatively spliced transcript variants, encoding different isoforms, have been described for this gene. [provided by RefSeq, Jul 2008]

9. ALDH1B1
This protein belongs to the aldehyde dehydrogenases family of proteins. Aldehyde dehydrogenase is the second enzyme of the major oxidative pathway of alcohol metabolism. This gene does not contain introns in the coding sequence. The variation of this locus may affect the development of alcohol-related problems. [provided by RefSeq, Jul 2008]

10. ALDH1L1
The protein encoded by this gene catalyzes the conversion of 10-formyltetrahydrofolate, nicotinamide adenine dinucleotide phosphate (NADP+), and water to tetrahydrofolate, NADPH, and carbon dioxide. The encoded protein belongs to the aldehyde dehydrogenase family. Loss of function or expression of this gene is associated with decreased apoptosis, increased cell motility, and cancer progression. There is an antisense transcript that overlaps on the opposite strand with this gene locus. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jun 2012]

11. ALDH1A2
This protein belongs to the aldehyde dehydrogenase family of proteins. The product of this gene is an enzyme that catalyzes the synthesis of retinoic acid (RA) from retinaldehyde. Retinoic acid, the active derivative of vitamin A (retinol), is a hormonal signaling molecule that functions in developing and adult tissues. The studies of a similar mouse gene suggest that this enzyme and the cytochrome CYP26A1, concurrently establish local embryonic retinoic acid levels which facilitate posterior organ development and prevent spina bifida. Four transcript variants encoding distinct isoforms have been identified for this gene. [provided by RefSeq, May 2011]

12. AGPS
This gene is a member of the FAD-binding oxidoreductase/transferase type 4 family. It encodes a protein that catalyzes the second step of ether lipid biosynthesis in which acyl-dihydroxyacetonephosphate (DHAP) is converted to alkyl-DHAP by the addition of a long chain alcohol and the removal of a long-chain acid anion. The protein is localized to the inner aspect of the peroxisomal membrane and requires FAD as a cofactor. Mutations in this gene have been associated with rhizomelic chondrodysplasia punctata, type 3 and Zellweger syndrome. [provided by RefSeq, Jul 2008]

13. ALDH9A1
This protein belongs to the aldehyde dehydrogenase family of proteins. It has a high activity for oxidation of gamma-aminobutyraldehyde and other amino aldehydes. The enzyme catalyzes the dehydrogenation of gamma-aminobutyraldehyde to gamma-aminobutyric acid (GABA). This isozyme is a tetramer of identical 54-kD subunits. [provided by RefSeq, Jul 2008]

14. ALDH4A1
This protein belongs to the aldehyde dehydrogenase family of proteins. This enzyme is a mitochondrial matrix NAD-dependent dehydrogenase which catalyzes the second step of the proline degradation pathway, converting pyrroline-5-carboxylate to glutamate. Deficiency of this enzyme is associated with type II hyperprolinemia, an autosomal recessive disorder characterized by accumulation of delta-1-pyrroline-5-carboxylate (P5C) and proline. Alternatively spliced transcript variants encoding different isoforms have been identified for this gene. [provided by RefSeq, Jun 2009]

15. ALDH6A1
This gene encodes a member of the aldehyde dehydrogenase protein family. The encoded protein is a mitochondrial methylmalonate semialdehyde dehydrogenase that plays a role in the valine and pyrimidine catabolic pathways. This protein catalyzes the irreversible oxidative decarboxylation of malonate and methylmalonate semialdehydes to acetyl- and propionyl-CoA. Methylmalonate semialdehyde dehydrogenase deficiency is characterized by elevated beta-alanine, 3-hydroxypropionic acid, and both isomers of 3-amino and 3-hydroxyisobutyric acids in urine organic acids. Alternate splicing results in multiple transcript variants. [provided by RefSeq, Jun 2013]

16. ALDH3B1
This gene encodes a member of the aldehyde dehydrogenase protein family. Aldehyde dehydrogenases are a family of isozymes that may play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. The encoded protein is able to oxidize long-chain fatty aldehydes in vitro, and may play a role in protection from oxidative stress. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Feb 2014]

17. ALDH3B2
This gene encodes a member of the aldehyde dehydrogenase family, a group of isozymes that may play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. The gene of this particular family member is over 10 kb in length. Altered methylation patterns at this locus have been observed in spermatozoa derived from patients exhibiting reduced fecundity. [provided by RefSeq, Aug 2017]

18. ALDH16A1
This gene encodes a member of the aldehyde dehydrogenase superfamily. The family members act on aldehyde substrates and use nicotinamide adenine dinucleotide phosphate (NADP) as a cofactor. This gene is conserved in chimpanzee, dog, cow, mouse, rat, and zebrafish. The protein encoded by this gene interacts with maspardin, a protein that when truncated is responsible for Mast syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Apr 2010]

19. ALDH1L2
This gene encodes a member of both the aldehyde dehydrogenase superfamily and the formyl transferase superfamily. This member is the mitochondrial form of 10-formyltetrahydrofolate dehydrogenase (FDH), which converts 10-formyltetrahydrofolate to tetrahydrofolate and CO2 in an NADP(+)-dependent reaction, and plays an essential role in the distribution of one-carbon groups between the cytosolic and mitochondrial compartments of the cell. Alternatively spliced transcript variants have been found for this gene.[provided by RefSeq, Oct 2010]

20. ALDH8A1
This gene encodes a member of the aldehyde dehydrogenase family of proteins. The encoded protein has been implicated in the synthesis of 9-cis-retinoic acid and in the breakdown of the amino acid tryptophan. This enzyme converts 9-cis-retinal into the retinoid X receptor ligand 9-cis-retinoic acid, and has approximately 40-fold higher activity with 9-cis-retinal than with all-trans-retinal. In addition, this enzyme has been shown to catalyze the conversion of 2-aminomuconic semialdehyde to 2-aminomuconate in the kynurenine pathway of tryptophan catabolism. [provided by RefSeq, Jul 2018]

21. ALDH7A1P1
None

22. ALDH1L1-AS2
None

23. ALDH7A1P4
None

24. ALDH7A1P3
None

25. ALDH7A1P2
None

26. ALDH1A3-AS1
None

27. ALDH1A2-AS1
None

28. ALDH1L1-AS1
None

的数据
在所提供的Python函数中设置organism=None，并为同一查询（gene_name='ALDH*'）设置max_gene_ids=10000，结果返回9,010个基因ID（即，目前Entrez基因数据库中所有生物体中的9，010个ALDH家族基因）。
例如：

>>> gene_summaries = get_entrez_gene_summary("ALDH*", email, organism=None, max_gene_ids=10000)
9010 gene IDs returned associated with gene ALDH*.
    Retrieving summary for 217...
    Retrieving summary for 216...
    Retrieving summary for 19378...
    Retrieving summary for 11669...
[...]

型