Regexpr，如果组前面有大括号并且只匹配大括号第一部分内的文本，则排除这些组

ef1yzkbh 于 2023-10-22 发布在其他

关注(0)|答案(1)|浏览(106)

我正在编写一个Python脚本来解析Wikipedia文章，这个过程的一部分就是解析链接。我试着写一个正则表达式，以这种方式匹配：

[[:Category:Anarchism by country|Anarchism by country]] -> :Category:Anarchism by country
[[Squatting|squat]] -> Squatting
[[File:Jarach and Zerzan.JPG|thumb|Lawrence Jarach (left) and [[John Zerzan]] (right) -> John Zerzan
* {{cite book |last=Avrich |first=Paul |author-link=Paul Avrich |title=[[Anarchist Voices: An Oral History of Anarchism in America]] |year=1996 |publisher=[[Princeton University Press]] |isbn=978-0-691-04494-1-> Unmatched，begins with * {{（引用）

我已经达到了\[\[([^|\]]+)(?:\|[^|\]]+)?\]\]，它在上面的3个例子中起作用，但在引用中它与标题和出版商相匹配。我知道（我认为）我需要一个负的前瞻来防止最后一个例子中的任何匹配。我对正则表达式很不好，所以任何建议都将非常感谢。

regex

来源：https://stackoverflow.com/questions/77031024/regexpr-which-excludes-groups-if-they-are-precedeeded-by-curly-brackets-and-only

1条答案

按热度按时间

ru9i0ody1#

Wikitext相当复杂，不应该单独使用正则表达式进行解析。相反，使用一个成熟的解析器，比如mwparserfromhell：

import mwparserfromhell as mph

def get_links_outside_of_templates(text):
  tree = mph.parse(text)
  # Lazily filter out all top-level links
  links = tree.ifilter_wikilinks(recursive = False)
    
  for link in links:
    if link.title.startswith('File'):
      # If this is a File link, recursively parse its "text".
      yield from get_links_outside_of_templates(link.text)
    else:
      yield link.title

print([*get_links_outside_of_templates(text)])

对于以下wikitext（部分由ChatGPT生成）：

'''Squatting''' may refer to [[Squatting|squat]], the act of occupying an abandoned or unused property without legal permission.

== Foo ==

[[File:Jarach and Zerzan.JPG|thumb|Lawrence Jarach (left) and [[John Zerzan]] (right)]]

Lorem ipsum dolor sit [[amet]], consectetur adipiscing elit. Vestibulum interdum, neque nec aliquet venenatis, tortor erat commodo nulla, id imperdiet mi urna eget nunc.

== References ==
* {{cite book
  |last=Avrich |first=Paul |author-link=Paul Avrich
  |title=[[Anarchist Voices: An Oral History of Anarchism in America]]
  |year=1996 |publisher=[[Princeton University Press]]
  |isbn=978-0-691-04494-1
  }}

[[:Category:Anarchism by country|Anarchism by country]]

.它输出：

['Squatting', 'John Zerzan', 'amet', ':Category:Anarchism by country']

不幸的是，mwparserfromhelldoesn't recognize namespaces，所以如果你要使用它，你必须自己检查File链接。我在上面的函数中使用了一个粗略的.startswith('File')，但是你可能想做一个更好的检查，因为命名空间名称是不区分大小写的：file和fIlE都是有效的，与File的含义相同。

赞(0）回复(0）举报 2023-10-22

我来回答

Regexpr，如果组前面有大括号并且只匹配大括号第一部分内的文本，则排除这些组

1条答案

相关问题

热门标签

最新问答