regex 使用Python从Markdown中提取URL和锚文本

fcg9iug3 于 2023-11-20 发布在 Python

关注(0)|答案(3)|浏览(139)

我正在尝试从Markdown中提取锚文本和相关的URL。我看到了this问题。不幸的是，answer似乎没有完全回答我想要的。
在Markdown中，有两种方法可以插入链接：

示例一：

[anchor text](http://my.url)

字符串

示例二：

[anchor text][2]

   [1]: http://my.url

型
我的脚本看起来像这样（注意我使用的是regex，而不是re）：

import regex
body_markdown = "This is an [inline link](http://google.com). This is a [non inline link][4]\r\n\r\n  [1]: http://yahoo.com"

rex = """(?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])"""
pattern = regex.compile(rex)
matches = regex.findall(pattern, body_markdown, overlapped=True)
for m in matches:
    print m

型
这将产生输出：

('http://google.com', 'http://google.com')
('http://yahoo.com', 'http://yahoo.com')

型
我的预期输出是：

('inline link', 'http://google.com')
('non inline link', 'http://yahoo.com')

型
如何从Markdown中正确捕获锚文本？

regex

来源：https://stackoverflow.com/questions/30734682/extracting-url-and-anchor-text-from-markdown-using-python

3条答案

按热度按时间

3lxsmp7m1#

如何从Markdown中正确捕获锚文本？
将其解析为结构化格式（例如html），然后使用适当的工具提取链接标签和地址。

import markdown
from lxml import etree

body_markdown = "This is an [inline link](http://google.com). This is a [non inline link][1]\r\n\r\n  [1]: http://yahoo.com"

doc = etree.fromstring(markdown.markdown(body_markdown))
for link in doc.xpath('//a'):
  print link.text, link.get('href')

字符串
这让我明白：

inline link http://google.com
non inline link http://yahoo.com

型
另一种选择是编写自己的Markdown解析器，这似乎是错误的地方集中精力。

赞(0）回复(0）举报 2023-11-20

wribegjk2#

修改@mreinhardt解决方案以返回所有对(text, link)的列表（而不是dict）：

import re
    
INLINE_LINK_RE = re.compile(r'\[([^\]]+)\]\(([^)]+)\)')
FOOTNOTE_LINK_TEXT_RE = re.compile(r'\[([^\]]+)\]\[(\d+)\]')
FOOTNOTE_LINK_URL_RE = re.compile(r'\[(\d+)\]:\s+(\S+)')

def find_md_links(md):
    """ Return dict of links in markdown """

    links = list(INLINE_LINK_RE.findall(md))
    footnote_links = dict(FOOTNOTE_LINK_TEXT_RE.findall(md))
    footnote_urls = dict(FOOTNOTE_LINK_URL_RE.findall(md))

    for key in footnote_links.keys():
        links.append((footnote_links[key], footnote_urls[footnote_links[key]]))

    return links

字符串
我在python 3中使用重复的链接进行测试：

[h](http://google.com) and [h](https://goog.e.com)

型

赞(0）回复(0）举报 2023-11-20

bweufnob3#

你可以用几个简单的re模式来实现：

import re

INLINE_LINK_RE = re.compile(r'\[([^\]]+)\]\(([^)]+)\)')
FOOTNOTE_LINK_TEXT_RE = re.compile(r'\[([^\]]+)\]\[(\d+)\]')
FOOTNOTE_LINK_URL_RE = re.compile(r'\[(\d+)\]:\s+(\S+)')

def find_md_links(md):
    """ Return dict of links in markdown """

    links = dict(INLINE_LINK_RE.findall(md))
    footnote_links = dict(FOOTNOTE_LINK_TEXT_RE.findall(md))
    footnote_urls = dict(FOOTNOTE_LINK_URL_RE.findall(md))

    for key, value in footnote_links.iteritems():
        footnote_links[key] = footnote_urls[value]
    links.update(footnote_links)

    return links

字符串
然后你可以像这样使用它：

>>> body_markdown = """
... This is an [inline link](http://google.com).
... This is a [footnote link][1].
...
... [1]: http://yahoo.com
... """
>>> links = find_md_links(body_markdown)
>>> links
{'footnote link': 'http://yahoo.com', 'inline link': 'http://google.com'}
>>> links.values()
['http://yahoo.com', 'http://google.com']

型

赞(0）回复(0）举报 2023-11-20

我来回答

regex 使用Python从Markdown中提取URL和锚文本

示例一：

示例二：

3条答案

相关问题

热门标签

最新问答