regex 比较golang和node.js之间的正则表达式性能

rfbsl7qr 于 2023-04-13 发布在 Go

关注(0)|答案(2)|浏览(205)

我想比较使用非贪婪匹配器.*?和负查找（例如[^/]+/）之间的正则表达式性能，并编写了以下node.js脚本：

const URL = "https://this-is-some-hostname-test/login?ReturnUrl=/"

const regex1 = new RegExp("^https?://.*?/(.*)");
const regex2 = new RegExp("^https?://[^/]+/(.*)$");

const iterations = 10000000;

let start = new Date();
for (let i = 0; i < iterations; i++) {
  regex1.test(URL);
}
console.log("Time taken for non-greedy match: %s", (new Date()).getTime() - start);

start = new Date();
for (let i = 0; i < iterations; i++) {
  regex2.test(URL);
}
console.log("Time taken for negative match: %s", (new Date()).getTime() - start);

这将报告以下结果：

node regexTest.js 
Time taken for non-greedy match: 386
Time taken for negative match: 305

然后我想，让我们检查一下，仅仅是为了它，在golang中也是一样的。我期待golang比node.js快得多，但令我惊讶的是，它实际上慢了很多-事实上慢了2个数量级！
下面是我的golang实现：

package main

import (
    "fmt"
    "regexp"
    "time"
)

func main() {
    URL := "https://this-is-some-hostname-test/login?ReturnUrl=/"
    URLBytes := []byte(URL)

    regex1 := regexp.MustCompile("^https?://.*?/(.*)")
    regex2 := regexp.MustCompile("^https?://[^/]+/(.*)$")

    iterations := 10000000

    start := time.Now()
    for i := 0; i < iterations; i++ {
        regex1.Match(URLBytes)
    }
    fmt.Printf("Time taken for non-greedy match: %s\n", time.Since(start))

    start = time.Now()
    for i := 0; i < iterations; i++ {
        regex2.Match(URLBytes)
    }
    fmt.Printf("Time taken for negative match: %s\n", time.Since(start))
}

结果是：

go run main.go 
Time taken for non-greedy match: 8.041351917s
Time taken for negative match: 7.905017458s

我一开始尝试了regex.Compile和regex.MatchString，但后来决定使用MustCompile和Match([]byte)来消除任何错误检查和类型转换。

regex

来源：https://stackoverflow.com/questions/75998348/compare-regex-performance-between-golang-and-node-js

2条答案

按热度按时间

h22fl7wq1#

在Go中，使用Go标准库测试包进行基准测试（ns是纳秒）。

$ go test rx_test.go -run=! -bench=. -benchmem
Benchmark1-8  1546998  779.8 ns/op  0 B/op  0 allocs/op
Benchmark2-8  1378963  867.9 ns/op  0 B/op  0 allocs/op
$

与Node.js的时间相当。

526.2 ns/op
512.7 ns/op

rx_test.go：

package main

import (
    "regexp"
    "testing"
)

var regex1 = regexp.MustCompile("^https?://.*?/(.*)")
var regex2 = regexp.MustCompile("^https?://[^/]+/(.*)$")

var URL = []byte(`https://this-is-some-hostname-test/login?ReturnUrl=/`)

var sinkMatch bool

func Benchmark1(b *testing.B) {
    var match bool
    for i := 0; i < b.N; i++ {
        match = regex1.Match(URL)
    }
    sinkMatch = match
}

func Benchmark2(b *testing.B) {
    var match bool
    for i := 0; i < b.N; i++ {
        match = regex2.Match(URL)
    }
    sinkMatch = match
}

赞(0）回复(0）举报 2023-04-13

e5nqia272#

Go正则表达式很慢，但并没有那么慢。
你的基准测试有很多问题可以解释这一点。主要的一个是确保编译器不会编译掉你的代码！
因为regex1、regex2和URL都是Node.js中的常量...

for (let i = 0; i < iterations; i++) {
  regex1.test(URL);
}

。。。可以优化为。。

match := regex1.test(URL);
for (let i = 0; i < iterations; i++) {
  match
}

而且因为没有对匹配做任何事情，所以整个循环可以被优化掉。
它们在Go基准测试中不是常量。
为了避免这些问题。

使用非常量输入，例如来自文件的输入。
对结果执行某些操作，例如递增计数器。

我使用this list of URLs作为输入。

const fs = require('fs');

const regex1 = new RegExp("^https?://.*?/(.*)");
const regex2 = new RegExp("^https?://[^/]+/(.*)$");

const iterations = 1000;

var lines = fs.readFileSync('sample.txt').toString().split("\n");

let start = new Date();
let counter = 0;
for (let i = 0; i < iterations; i++) {
  for(j in lines) {
    if( regex1.test(lines[j]) ) {
      counter++;
    }
  }
}
let end = new Date();
console.log("Time taken for non-greedy match to match %d: %s ms", counter, end - start);

start = new Date();
counter = 0;
for (let i = 0; i < iterations; i++) {
  for(j in lines) {
    if( regex1.test(lines[j]) ) {
      counter++;
    }
  }
}
end = new Date();
console.log("Time taken for greedy match to match %d: %s ms", counter, end - start);

package main

import (
    "fmt"
    "time"
    "bufio"
    "os"
    "regexp"
)

func main() {
    file, _ := os.Open("sample.txt")
    defer file.Close()

    var lines []string
    scanner := bufio.NewScanner(file)
    for scanner.Scan() {
        lines = append(lines, scanner.Text())
    }

    regex1 := regexp.MustCompile("^https?://.*?/(.*)")
    regex2 := regexp.MustCompile("^https?://[^/]+/(.*)$")

    const iterations = 1000

    start := time.Now()
    counter := 0
    for i := 0; i < iterations; i++ {
      for _, line := range lines {
        if regex1.MatchString(line) {
          counter++;
        }
      }
    }
    fmt.Printf("Time taken for non-greedy to match %d: %s\n", counter, time.Since(start))

    start = time.Now()
    counter = 0
    for i := 0; i < iterations; i++ {
      for _, line := range lines {
        if regex2.MatchString(line) {
          counter++;
        }
      }
    }
    fmt.Printf("Time taken for negative match %d: %s\n", counter, time.Since(start))
}

这使得150 ms比950 ms更合理。仍然令人惊讶，但在置信范围内。正则表达式引擎通常是用C编写和优化的，所以这与语言无关，而与正则表达式引擎的优化程度有关。
此外，“挂钟”时间不适合用于基准测试。使用单时钟。您的Go基准测试使用单时钟，而您的Node使用挂钟。我在这里没有解决这个问题。

赞(0）回复(0）举报 2023-04-13

我来回答

regex 比较golang和node.js之间的正则表达式性能

2条答案

相关问题

热门标签

最新问答