Rentrez从R的NCBI中提取了错误的数据？

lymnna71 于 2023-01-28 发布在其他

关注(0)|答案(1)|浏览(151)

我正在尝试下载华盛顿州内大肠杆菌样本的序列数据--大约有1283个序列，我知道这个数量很多。我遇到的问题是entrez_search和/或entrez_fetch似乎提取了错误的数据。例如，下面的R代码提取了1283个ID，但当我对这些ID使用entrez_fetch时，我得到的序列数据来自鸡和玉米以及其他非大肠杆菌的东西。

search <- entrez_search(db = "biosample", 
                        term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
                        retmax = 9999, use_history = T)

类似地，我尝试手动提取一个样本的序列作为测试，当我在the NCBI website上搜索登录号SAMN 30954130时，我看到了大肠杆菌样本的元数据，当我使用下面的代码时，我看到了鸡的元数据：

search <- entrez_search(db = "biosample", 
                        term = "SAMN30954130[ACCN]",
                        retmax = 9999, use_history = T)
fetch_test <- entrez_fetch(db = "nucleotide",
                           id = search$ids,
                           rettype = "xml")
fetch_list <- xmlToList(fetch_test)

来源：https://stackoverflow.com/questions/75254015/rentrez-is-pulling-the-wrong-data-from-ncbi-in-r

1条答案

按热度按时间

3j86kqsm1#

这里的问题是，您正在使用一个生物样本UID来查询核苷酸数据库。然而，该UID随后被解释为核苷酸UID，因此您得到的序列记录与原始生物样本查询无关。
在这种情况下需要使用entrez_link，它使用UID链接两个数据库之间的记录。
例如，您的生物样本登录号SAMN30954130具有生物样本UID 30954130。您可以将其链接到核苷酸，如下所示：

nuc_links <- entrez_link(dbfrom='biosample', id=30954130, db='nuccore')

你可以得到相应的核苷酸UID（s）如下：

nuc_links$links$biosample_nuccore

[1] "2307876014"

然后：

fetch_test <- entrez_fetch(db = "nucleotide",
                           id = 2307876014,
                           rettype = "xml")

这在rentrez tutorial的"查找交叉引用"一节中介绍。

赞(0）回复(0）举报 2023-01-28

我来回答

Rentrez从R的NCBI中提取了错误的数据？

1条答案

相关问题

热门标签

最新问答