regex 尝试从长文本中获取匹配片段时的正则表达式性能问题

yzckvree  于 2023-08-08  发布在  其他
关注(0)|答案(1)|浏览(110)

我试图用下面的正则表达式得到一个匹配的单词沿着它前面和后面的一些单词(最多5个):

const start = Date.now();
console.log("There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).".match(/(\S*\s*){0,5}lorem(\s*\S*){0,5}/giu));

console.log(Date.now() - start);

字符串
这似乎是非常缓慢的,它需要数百毫秒。是否有什么我错过了,这可以在性能方面得到改善?

z9smfwbn

z9smfwbn1#

似乎OP想要在lorem前后占用最多5个单词。因此,如果我们将\S*\s*更改为\S+\s+,则正则表达式的速度会快100倍。OP的正则表达式也会因为单引号而失败:

and a search for 'lorem ipsum' will uncover many web

字符串
我的正则表达式也失败了,所以添加了\S*lorem\S*
我们也可以省略捕捉单词。
最后一个正则表达式:

/(?:\S+\s+){0,5}\S*lorem\S*(?:\s+\S+){0,5}/giu


的数据

<script benchmark data-count="100">

const str = "There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like). and lorem again.";

// @benchmark original
str.match(/(\S*\s*){0,5}lorem(\s*\S*){0,5}/giu)

// @benchmark Alexander
str.match(/(?:\S+\s+){0,5}\S*lorem\S*(?:\s+\S+){0,5}/giu)

</script>
<script src="https://cdn.jsdelivr.net/gh/silentmantra/benchmark/loader.js"></script>

相关问题