Unicode标志不适用于Javascript中的RegEx

unftdfkk  于 2022-12-01  发布在  Java
关注(0)|答案(1)|浏览(150)

我的程式码无法侦测运算子与非英文字符的搭配使用:

const OPERATOR_REGEX = new RegExp(
  /(?!\B"[^"|“|”]*)\b(and|or|not|exclude)(?=.*[\s])\b(?![^"|“|”]*"\B)/,
  'giu'
);

const query1 = '(Java or "化粧" or 化粧品)';
const query2 = '(Java or 化粧 or 化粧品)';

console.log(query1.split(OPERATOR_REGEX));
console.log(query2.split(OPERATOR_REGEX));

https://codepen.io/thewebtud/pen/vYraavd?editors=1111
而相同的代码使用unicode标志成功检测到www.example.com上的所有操作符regex101.com:https://regex101.com/r/FC84BH/1
如何为JS修复此问题?

waxmsbnn

waxmsbnn1#

请记住

  • \bw顺序b边界)可以写成(?:(?<=^)(?=\w)|(?<=\w)(?=$)|(?<=\W)(?=\w)|(?<=\w)(?=\W)),并且
  • \Bnon-worderb边界)可以写成(?:(?<=^)(?=\W)|(?<=\W)(?=$)|(?<=\W)(?=\W)|(?<=\w)(?=\w))

并且支持Unicode的\w模式是[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}](请参阅Replace certain arabic words in text string using Javascript),下面是ECMAScript 2018+解决方案:

const w = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`;
const nw = String.raw`[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`;
const uwb = String.raw`(?:(?<=^)(?=${w})|(?<=${w})(?=$)|(?<=${nw})(?=${w})|(?<=${w})(?=${nw}))`;
const unwb = String.raw`(?:(?<=^)(?=${nw})|(?<=${nw})(?=$)|(?<=${nw})(?=${nw})|(?<=${w})(?=${w}))`;

const OPERATOR_REGEX = new RegExp(
  String.raw`(?!${unwb}"[^"“”]*)${uwb}(and|or|not|exclude)(?=.*\s)${uwb}(?![^"“”]*"${unwb})`,
  'giu'
);

const query1 = '(Java or "化粧" or 化粧品)';
const query2 = '(Java or 化粧 or 化粧品)';

console.log(query1.split(OPERATOR_REGEX));
console.log(query2.split(OPERATOR_REGEX));

相关问题