Regexpr,如果组前面有大括号并且只匹配大括号第一部分内的文本,则排除这些组

ef1yzkbh  于 2023-10-22  发布在  其他
关注(0)|答案(1)|浏览(107)

我正在编写一个Python脚本来解析Wikipedia文章,这个过程的一部分就是解析链接。我试着写一个正则表达式,以这种方式匹配:

  • [[:Category:Anarchism by country|Anarchism by country]] -> :Category:Anarchism by country
  • [[Squatting|squat]] -> Squatting
  • [[File:Jarach and Zerzan.JPG|thumb|Lawrence Jarach (left) and [[John Zerzan]] (right) -> John Zerzan
  • * {{cite book |last=Avrich |first=Paul |author-link=Paul Avrich |title=[[Anarchist Voices: An Oral History of Anarchism in America]] |year=1996 |publisher=[[Princeton University Press]] |isbn=978-0-691-04494-1-> Unmatched,begins with * {{(引用)

我已经达到了\[\[([^|\]]+)(?:\|[^|\]]+)?\]\],它在上面的3个例子中起作用,但在引用中它与标题和出版商相匹配。我知道(我认为)我需要一个负的前瞻来防止最后一个例子中的任何匹配。我对正则表达式很不好,所以任何建议都将非常感谢。

ru9i0ody

ru9i0ody1#

Wikitext相当复杂,不应该单独使用正则表达式进行解析。相反,使用一个成熟的解析器,比如mwparserfromhell

import mwparserfromhell as mph

def get_links_outside_of_templates(text):
  tree = mph.parse(text)
  # Lazily filter out all top-level links
  links = tree.ifilter_wikilinks(recursive = False)
    
  for link in links:
    if link.title.startswith('File'):
      # If this is a File link, recursively parse its "text".
      yield from get_links_outside_of_templates(link.text)
    else:
      yield link.title

print([*get_links_outside_of_templates(text)])

对于以下wikitext(部分由ChatGPT生成):

'''Squatting''' may refer to [[Squatting|squat]], the act of occupying an abandoned or unused property without legal permission.

== Foo ==

[[File:Jarach and Zerzan.JPG|thumb|Lawrence Jarach (left) and [[John Zerzan]] (right)]]

Lorem ipsum dolor sit [[amet]], consectetur adipiscing elit. Vestibulum interdum, neque nec aliquet venenatis, tortor erat commodo nulla, id imperdiet mi urna eget nunc.

== References ==
* {{cite book
  |last=Avrich |first=Paul |author-link=Paul Avrich
  |title=[[Anarchist Voices: An Oral History of Anarchism in America]]
  |year=1996 |publisher=[[Princeton University Press]]
  |isbn=978-0-691-04494-1
  }}

[[:Category:Anarchism by country|Anarchism by country]]

.它输出:

['Squatting', 'John Zerzan', 'amet', ':Category:Anarchism by country']

不幸的是,mwparserfromhelldoesn't recognize namespaces,所以如果你要使用它,你必须自己检查File链接。我在上面的函数中使用了一个粗略的.startswith('File'),但是你可能想做一个更好的检查,因为命名空间名称是不区分大小写的:filefIlE都是有效的,与File的含义相同。

相关问题