javascript 确定正则表达式的最短可能匹配

x6h2sr28  于 2023-01-24  发布在  Java
关注(0)|答案(2)|浏览(205)
    • bounty将在5天后过期**。回答此问题可获得+50声望奖励。cael ras希望引起更多人关注此问题。

我有一个随机排列的正则表达式数组,如下所示:

let patterns = [
    /foo+ba+r/,
    /foo/,
    /foo+bar/,
    /foobar/,
    /m[eo]{4,}w/,
    /boo/,
    /fooo*/,
    /meow/
]

我不确定这是否可行,但我想写一个算法,将正则表达式从最不贪婪到最贪婪排序,如下所示:

[
    /foo/,
    /boo/,
    /fooo*/,
    /meow/,
    /foobar/,
    /foo+bar/,
    /m[eo]{4,}w/,
    /foo+ba+r/
]

我可以想象这样的排序可以这样实现:

patterns.sort((p1, p2) { return p1.greediness() - p2.greediness() });

但是在RegExpr类中不存在名为greediness的方法。
理想情况下,greediness方法将返回最少可能匹配的字符数,即:

/foo/.greediness() == 3
/boo/.greediness() == 3
/fooo*/.greediness() == 3
/meow/.greediness() == 4
/foobar/.greediness() == 6
/foo+bar/.greediness() == 6
/m[eo]{4,}w/.greediness() == 6
/foo+ba+r/.greediness() == 6

你对这个问题的解决方案是什么?

u5rb5r59

u5rb5r591#

这确实是一个非常困难的问题,它需要解析正则表达式的能力(如果正则表达式的规则被简化,只允许输入"正则"字符和特殊的regex特殊字符()[]|*+?,那么构造这样一个解析器就不会太困难了)。
1.将正则表达式转换为非确定性有限自动机(NFA)。这一步需要一个好的正则表达式解析器,但一旦你有了它,NFA的构造就很简单了。当然,如果你能找到一个现成的正则表达式来实现NFA,那就太理想了。
1.构造NFA的有向加权图表示,对表示字符转换的边赋予权重1,对表示ε转换的边赋予权重0。
1.使用Dijkstra算法找到从NFA的初始状态到最终状态的最短路径。
让我们以正则表达式m[eo]{2,}w为例,将其转换为一个NFA,并在适当的边上标记权重,在边下标记导致状态转换的字符,我们得到:

如果一条边由[from-state, to-state, weight]组成的元素的长度为3的数组定义,则上述有向图的边的数组将是:

const edges = [
    [0, 1, 1],
    [1, 2, 0],
    [1, 3, 0],
    [2, 4, 1],
    [3, 5, 1],
    [4, 6, 0],
    [5, 6, 0],
    [6, 7, 0],
    [6, 8, 0],
    [7, 9, 1],
    [8, 10, 1],
    [9, 11, 0],
    [10, 11, 0],
    [11, 6, 0],
    [11, 12, 1]
];

应用Dijkstra算法以获得从状态0到状态12的最短路径产生长度为4的以下路径:
0 -> 1 -> 3 -> 5 -> 6 -> 8 -> 10 -> 11 -> 12
因此正则表达式识别的最短字符串将是4。
因此,现在您需要做的就是找到或编写一个JavaScript正则表达式,用于NFA算法和Dijkstra算法。

    • 更新**

如果你正在创建自己的正则表达式解析器,那么你实际上可以跳过创建NFA和Dijkstra算法,而是计算长度。下面的内容并不意味着是一个完整的解析器。例如,它不支持命名组,并且它只识别基本的"stuff"。
See Demo

/*
Grammar (my own extended BNF notation):

E -> T E'
E' -> '|' E | epsilon
T -> START F END
START -> '^' | epsilon
END -> '$' | epsilon
F -> (SINGLE_CHAR FOLLOWER) | '('['?:']? E ')' FOLLOWER))+
SINGLE_CHAR -> CHAR | [ BRACKET_CHARS ]
FOLLOWER -> EXPRESSION_FOLLOWER NON-GREEDY
BRACKET_CHARS -> BRACKET_CHAR BRACKET_CHARS | epsilon
BRACKET_CHAR -> CHAR | HYPHEN
EXPRESSION_FOLLOWER -> '*' | '+' | '?' | {number[,number]} | epsilon
GREEDY -> '?' | epsilon
*/

const EOF = 0;
const CHAR = 1;
const ASTERISK = 2;
const PLUS = 3;
const CARAT = 4;
const LEFT_BRACE = 5;
const RIGHT_BRACE = 6;
const COMMA = 7;
const QUESTION_MARK = 8;
const LEFT_BRACKET = 9;
const RIGHT_BRACKET = 10;
const LEFT_PARENTHESIS = 11;
const RIGHT_PARENTHESIS = 12;
const DOLLAR_SIGN = 13;
const OR = 14;

let current_char;
let current_token = null;
let tokenizer;

function* lexer(s) {
    const l = s.length;
    let i = 0;
    while(i < l) {
        current_char = s[i];
        //console.log(actual_char);
        switch(current_char) {
            case '?': yield QUESTION_MARK; break;
            case '*': yield ASTERISK; break;
            case '+': yield PLUS; break;
            case '{': yield LEFT_BRACE; break;
            case '}': yield RIGHT_BRACE; break;
            case '[': yield LEFT_BRACKET; break;
            case ']': yield RIGHT_BRACKET; break;
            case '(': yield LEFT_PARENTHESIS; break;
            case ')': yield RIGHT_PARENTHESIS; break;
            case ',': yield COMMA; break;
            case '^': yield CARAT; break;
            case '$': yield DOLLAR_SIGN; break;
            case '|': yield OR; break;
            case '\\': yield CHAR; i += 1; break;
            default: yield CHAR; break;
        }
        i += 1;
    }
    yield EOF;
}

function START() {
    if (current_token === CARAT) {
        current_token = tokenizer.next().value;
    }
}

function END() {
    if (current_token === DOLLAR_SIGN) {
        current_token = tokenizer.next().value;
    }
}

function FOLLOWER() {
    // return a multiplier

    if (current_token === QUESTION_MARK || current_token === ASTERISK || current_token === PLUS) {
        const l = current_token === PLUS ? 1 : 0;
        current_token = tokenizer.next().value;
        if (current_token === QUESTION_MARK) { // non-greedy
            current_token = tokenizer.next().value;
        }
        return l;
    }

    if (current_token === LEFT_BRACE) {
        current_token = tokenizer.next().value;
        let s = '';
        while (current_token !== RIGHT_BRACE && current_token !== EOF) {
            if (current_token === EOF) {
                throw 'syntax error';
            }
            s += current_char;
            current_token = tokenizer.next().value;
        }
        current_token = tokenizer.next().value;
        const matches = s.match(/^(\d)+(,\d*)?$/);
        if (matches === null) {
            throw 'synatx error';
        }
        return parseInt(matches[0]);
    }

    return 1;
}

function F() {
    let l = 0;

    do {
        if (current_token === CHAR || current_token === LEFT_BRACKET) {
            if (current_token == CHAR) {
                current_token = tokenizer.next().value;
            }
            else {
                current_token = tokenizer.next().value;
                if (current_token === CARAT) {
                    current_token = tokenizer.next().value;
                }
                while (current_token !== RIGHT_BRACKET && current_token !== EOF) {
                    current_token = tokenizer.next().value;
                }
                if (current_token !== RIGHT_BRACKET) {
                    throw 'syntax error';
                }
                current_token = tokenizer.next().value;
            }
            const multiplier = FOLLOWER();
            l += multiplier;
        }
        else if (current_token === LEFT_PARENTHESIS) {
            current_token = tokenizer.next().value;
            if (current_token === QUESTION_MARK) { // non-capturing group
                current_token = tokenizer.next().value;
                if (current_token !== CHAR || current_char !== ':') {
                    throw 'syntax error';
                }
                current_token = tokenizer.next().value;
            }
            const this_l = E();
            if (current_token !== RIGHT_PARENTHESIS) {
                throw 'synatx error';
            }
            current_token = tokenizer.next().value;
            const multiplier = FOLLOWER();
            l += this_l * multiplier;
        }
        else {
            throw 'syntax error';
        }
    } while (current_token === CHAR || current_token === LEFT_BRACKET || current_token === LEFT_PARENTHESIS);

    return l;
}

function T() {
    START();
    const l = F();
    END();
    return l;
}

function E() {
    let min_l = T();
    while (current_token === OR) {
        current_token = tokenizer.next().value;
        const l = T();
        if (l < min_l) {
            min_l = l;
        }
    }
    return min_l;
}

function parse(s) {
    tokenizer = lexer(s);
    current_token = tokenizer.next().value;
    const l = E();
    if (current_token !== EOF) {
        throw 'syntax error';
    }
    return l;
}

let patterns = [
  /foo+ba+r/,
  /foo+ba+r/,
  /foo/,
  /foo+bar/,
  /foobar/,
  /m([eo]{4,})w/,
  /m(?:[eo]{4,})w/,
  /boo/,
  /fooo*/,
  /meow/
];

RegExp.prototype.greediness = function () {
  return parse(this.source);
};

for (const pattern of patterns) {
    console.log(`pattern = ${pattern.source}, greediness = ${pattern.greediness()}`);
}

图纸:

pattern = foo+ba+r, greediness = 6
pattern = foo+ba+r, greediness = 6
pattern = foo, greediness = 3
pattern = foo+bar, greediness = 6
pattern = foobar, greediness = 6
pattern = m([eo]{4,})w, greediness = 6
pattern = m(?:[eo]{4,})w, greediness = 6
pattern = boo, greediness = 3
pattern = fooo*, greediness = 3
pattern = meow, greediness = 4
sg3maiej

sg3maiej2#

正如Pointy在评论中所说,这是一个难题。
下面是解决方案的开始:

const greediness = (s) =>
  s .toString () .slice (1, -1)
    .replace (/\[[^\]]+]/g, 'X')
    .replace (/.\{((\d+)(,\d*))\}/g, (s, a, ds, _) => 'X' .repeat (Number (ds)))
    .replace (/.\+/g, 'X')
    .replace (/.\?/g, '')
    .replace (/.\*/g, '')
    .length

const sortByGreediness = (patterns) =>
  [...patterns] .sort ((a, b) => greediness (a) - greediness (b))
    // .map (s => [s, greediness (s)])  // to show sizes

const patterns = [/foo+ba+r/, /foo/, /foo+bar/, /foobar/, /m[eo]{4,}w/, /boo/, /fooo*/, /meow/]

console .log (sortByGreediness (patterns))

我们只需要将正则表达式的文本用可能匹配的最少字符数替换量词及其前面的字符,对于[eo]X{4,}这样的块,我们做了类似的操作。
我们可能会经历这样的步骤:

m[eo]{4,}wp+u+r?r*
mX{4,}wp+u+r?r*
mXXXXwp+u+r?r*
mXXXXXwXr?r*
mXXXXXwXr*
mXXXXXwX
  - length 7

但这并没有触及正则表达式内部的复杂性,甚至没有尝试处理捕获组。我认为这几乎不可能在完整的正则表达式规范中实现,但也许这可以扩展到您所需要的。
(If你会变得更复杂,你可能想用类似下面的代码重复做这个,或者用一个while循环代替它的递归。

const greediness = (s, next = s .toString () .slice (1, -1), prev = null) =>
  next === prev
    ? next .length
    : greediness (s, next
        .replace (/\[[^\]]+]/g, 'X')
        .replace (/.\{((\d+)(,\d*))\}/g, (s, a, ds, _) => 'X' .repeat (Number (ds)))
        .replace (/.\+/g, 'X')
        .replace (/.\?/g, '')
        .replace (/.\*/g, '')
      , next)

相关问题