regex Javascript -扫描与特定标签对应的文本

aurhwmvo  于 2023-03-13  发布在  Java
关注(0)|答案(5)|浏览(374)

我有下面的文字:

This is a code update

* Official Name:  Noner

* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021

* Agency:  Agency Ni

* Reference: 

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Citation: WAC 51-52 / WSR 23-02-055

* Draft Doc Title: 

 WSR 23-02-055 (#1)

* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)

* Final Doc Title: 

   IECC Com Update(#1)

   IECC Res Update (#2)

   IECC Res Update (#3)

* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)

* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)

https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&fdsfullfdsf=true&pfdsfdf=true  (#2)

* Effective Date: January 4, 2023

我想提取对应于标签'Reference:'的信息,但下面的代码只给了我一行。我想扫描所有文本,直到它遇到星号符号。

//Extract Reference    
var reference = description.search("Reference:");
if(reference != -1){
  reference = description.match(/(?<=^\* Reference\s*:)[\s]*[\n]*[^\n\r]*/m);  
  reference  = reference?.[0].trim();   
}else{
  reference = '';
}
console.log('Reference: ' + reference);

预期输出:

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
z9smfwbn

z9smfwbn1#

您可以使用以下正则表达式:

(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)

匹配:

  • (?:^|\n)\*\s*Reference:\s** Reference:位于行首
  • ([\s\S]*?):最小字符数,在组1中捕获
  • (?=\s*\n\*|$):空格、换行符和*或行尾的正向前看

regex101上的正则表达式演示

text = `* Agency:  Agency Ni

* Reference: 

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Citation: WAC 51-52 / WSR 23-02-055
`

reference = text.match(/(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)/)?.[1] ?? ''

console.log(reference)
fquxozlt

fquxozlt2#

您可以简单地使用lookarounds:

(?<=Reference: )(.|\n)*?(?=\*)

然后调整输出。
playground的代码示例:

const text = `This is a code update

* Official Name:  Noner

* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021

* Agency:  Agency Ni

* Reference: 

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Citation: WAC 51-52 / WSR 23-02-055

* Draft Doc Title: 

 WSR 23-02-055 (#1)

* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)

* Final Doc Title: 

   IECC Com Update(#1)

   IECC Res Update (#2)

   IECC Res Update (#3)

* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)

* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)

https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&fdsfullfdsf=true&pfdsfdf=true  (#2)

* Effective Date: January 4, 2023`

const patt = /(?<=Reference: )(.|\n)*?(?=\*)/gm
console.log(text.match(patt)?.[0].trim())
sxissh06

sxissh063#

请尝试使用此:

const match = description.match(/(?<=\* Reference\s*:)[\s\S]*?\*/)[0].slice(0, -1);
console.log(match)

此RegEx使用您在问题中使用的look behind,但会匹配下一个*字符之前的任何字符(包括换行符)。我使用.slice()来删除最后一个字符,因为RegEx也会匹配最后一个*字符。
我将这段代码与其他答案进行了基准测试,发现它的速度最快,大约快了50%(see benchmark here)。

ngynwnxp

ngynwnxp4#

在Javascript中查看:

(?<=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)

说明

  • (?<=正后视,Assert从当前位置向左为:
  • ^\* Reference\s*:\s*在字符串开头的可选空白字符之间匹配* Reference后跟:
  • )关闭后视
  • \S匹配非空白字符
  • [^]*?匹配包括换行符在内的任何字符,尽可能少
  • (?=正向前看,Assert右侧为:
  • ^\s*\*匹配字符串开头的可选空白字符,后跟*
  • )关闭前瞻

参见regex demo

const regex = /(?<=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)/gm;
const s = `This is a code update

* Official Name:  Noner

* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021

* Agency:  Agency Ni

* Reference: 

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Citation: WAC 51-52 / WSR 23-02-055

* Draft Doc Title: 

 WSR 23-02-055 (#1)

* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)

* Final Doc Title: 

   IECC Com Update(#1)

   IECC Res Update (#2)

   IECC Res Update (#3)

* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)

* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)

https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&fdsfullfdsf=true&pfdsfdf=true  (#2)

* Effective Date: January 4, 2023
`;

const m = s.match(regex);
if (m) console.log(m[0]);
i86rm4rw

i86rm4rw5#

我决定遵循@Nick的想法,不对“subject”字符串做任何假设。
我提出了两种较为宽松的方法,只要它们有效:

  • 当没有 Reference 项时(返回空字符串),
  • Reference 项具有空内容时,
  • 以及当 Reference 项内容在串的末尾(因此后面没有其它项)时。

第一种方法适用于所有情况,无论其内容如何:

let ref_pat = /^\* Reference:\s*(.*\S(?:\s+.*\S)*?)??\s*(?:^\*|(?![\s\S]))/m;
let reference = description.match(ref_pat)?.[1] ?? '';

如果您假设内容不包含星号字符,则可以使用第二种更有效的模式:

let ref_pat = /^\* Reference:\s*([^*]*[^*\s])/m;
let reference = description.match(ref_pat)?.[1] ?? '';

这是唯一的突破,但这一个是从更简单。
无论您选择哪一个,结果都已被修整。

相关问题