regex 从文本中提取doi(数字对象标识符)

7cwmlq89  于 2022-12-27  发布在  其他
关注(0)|答案(1)|浏览(139)

我有一个文本块,还有数千个,其中包含一些研究的引用。其中一个示例如下所示:

txt = '<div>1. <em>Nationella riktlinjer för rörelseorganens sjukdomar</em> (Swedish National Guidelines). 2012, The National Board of Health and Welfare. doi:10.1097/BRS.0b013e31829ff095 https://www.socialstyrelsen.se/publikationer2012/2012-5-1</a></div><div>2. Jevsevar, D.S., et al., <em>The American Academy of Orthopaedic Surgeons evidence-based guideline on: treatment of osteoarthritis of the knee, 2nd edition.</em> J Bone Joint Surg Am, 2013. <strong>95</strong>(20): p. 1885-6. <a href="http://www.ncbi.nlm.nih.gov/pubmed/24288804" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">http://www.ncbi.nlm.nih.gov/pubmed/24288804</a></div><div>3. Namba, R.S., et al., <em>Obesity and perioperative morbidity in total hip and total knee arthroplasty patients.</em> J Arthroplasty, 2005. <strong>20</strong>(7 Suppl 3): p. 46-50. <a href="https://dx.doi.org/10.1016/j.arth.2005.04.023" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">https://dx.doi.org/10.1016/j.arth.2005.04.023</a></div><div>4. Peter, W.F., et al., <em>Physiotherapy in hip and knee osteoarthritis: development of a practice guideline concerning initial assessment, treatment and evaluation.</em> Acta Reumatol Port, 2011. <strong>36</strong>(3): p. 268-81. <a href="http://www.ncbi.nlm.nih.gov/pubmed/22113602" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">http://www.ncbi.nlm.nih.gov/pubmed/22113602</a></div><div>5. Santoso, M.B. and L. Wu, <em>Unicompartmental knee arthroplasty, is it superior to high tibial osteotomy in treating unicompartmental osteoarthritis? A meta-analysis and systemic review.</em>&nbsp;J Orthop Surg Res, 2017. <strong>12</strong>(1): p. 50.&nbsp;<a href="https://dx.doi.org/10.1186/s13018-017-0552-9" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">https://dx.doi.org/10.1186/s13018-017-0552-9</a></div><div>6. Management of osteoarthritis. NICE guidelines. NICE Pathway last updated: 22 January 2019. <a href="https://pathways.nice.org.uk/pathways/osteoarthritis/management-of-osteoarthritis.pdf" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">https://pathways.nice.org.uk/pathways/osteoarthritis/management-of-osteoarthritis.pdf</a></div><div>&nbsp;</div>'

这个文本包含了doi的几个链接和键。我怎样才能得到所有这些,也许是在一个列表中,比如

['doi:10.1097/BRS.0b013e31829ff095',
'https://dx.doi.org/10.1016/j.arth.2005.04.023',
'https://dx.doi.org/10.1016/j.arth.2005.04.023',
'https://dx.doi.org/10.1186/s13018-017-0552-9',
]

我已经查找了几个相同的正则表达式,但无济于事。如:

import re
exp = "10.\\d{4,9}/[-._;()/:a-z0-9A-Z]+"
pattern = re.compile(exp)

pattern.findall(txt)

这将返回一个空列表。

djmepvbi

djmepvbi1#

多亏了@ wiktor-stribiew,我才让它工作起来。

exp = "10.\\d{4,9}/[-._;()/:a-z0-9A-Z]+"
pattern = re.compile(exp)
 
print( pattern.findall(txt) )

['10.1097/BRS.0b013e31829ff095', '10.1016/j.arth.2005.04.023', '10.1016/j.arth.2005.04.023', '10.1186/s13018-017-0552-9', '10.1186/s13018-017-0552-9']

相关问题