运行时:并行JSON基准测试在ARM64上的go版本>1.15时CPU使用率较高

azpvetkf  于 23天前  发布在  Go
关注(0)|答案(7)|浏览(18)

我发现了一个ARM64平台上的CPU性能下降问题。最新版本的go语言中的并行json基准测试与旧版本1.15相比,导致了50%以上的CPU使用率增加

在ARM64上可以很容易地重现这个问题:

  1. 安装golang包
  2. 下载go-jsonbench:https://github.com/kpango/go-json-bench
  3. cd go-json-bench && make bench 并观察CPU使用情况

我的机器配置:

Architecture:           aarch64
  CPU op-mode(s):       64-bit
  Byte Order:           Little Endian
CPU(s):                 128
  On-line CPU(s) list:  0-127
Vendor ID:              HiSilicon
  BIOS Vendor ID:       HiSilicon
  Model name:           Kunpeng-920
    BIOS Model name:    Kunpeng 920-6426
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 64
    Socket(s):          2
    Stepping:           0x1
    Frequency boost:    disabled
    CPU max MHz:        2600.0000
    CPU min MHz:        200.0000
    BogoMIPS:           200.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
Caches (sum of all):
  L1d:                  8 MiB (128 instances)
  L1i:                  8 MiB (128 instances)
  L2:                   64 MiB (128 instances)
  L3:                   128 MiB (4 instances)
NUMA:
  NUMA node(s):         4
  NUMA node0 CPU(s):    0-31
  NUMA node1 CPU(s):    32-63
  NUMA node2 CPU(s):    64-95
  NUMA node3 CPU(s):    96-127

基准测试结果如下:

对于go 1.15.7,CPU使用率为大约2000%:

对于最新版本的go,CPU使用率为大约3000%:

最后,通过git bisect搜索,我发现fe7ff71185cf是导致性能下降的第一个提交,但我不清楚为什么这个提交会对CPU性能产生如此大的影响。

cclgggtu

cclgggtu1#

你能展示使用benchstat的基准结果吗?基准执行时间是否发生变化?谢谢。

6g8kf2rb

6g8kf2rb2#

Could you show the benchmark results using benchstat format? Does the benchmark execution time change? Thanks.
The benchmark result for go 1.15.7:

go test -count=10 -run=NONE -bench . -benchmem
goos: linux
goarch: arm64
pkg: github.com/kpango/go-json-bench
BenchmarkJSONUnmarshalParallel-128       1281096               833 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1486346               778 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1454241               719 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1371498               882 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1000000              1004 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128        726432              1708 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1335273               923 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1290433               938 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1275134               944 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1136780               909 ns/op             896 B/op          9 allocs/op
BenchmarkJSONDecodeParallel-128           988424              1171 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1118308              1120 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1030258              1236 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1040060              1128 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128           994029              1089 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1099704              1137 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1095230              1121 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1045125              1048 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1054767              1144 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1021688              1063 ns/op            1080 B/op         10 allocs/op

The benchmark result for go latest version:

go test -count=10 -run=NONE -bench . -benchmem
goos: linux
goarch: arm64
pkg: github.com/kpango/go-json-bench
BenchmarkJSONUnmarshalParallel-128        814312              3953 ns/op             913 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1000000              1026 ns/op             912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1221180               993.7 ns/op           912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1000000              1024 ns/op             912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1220218               986.9 ns/op           912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1000000              1021 ns/op             912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1000000              1006 ns/op             912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1205456              1022 ns/op             912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1232067              1000 ns/op             912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1204812               958.3 ns/op           912 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1261 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1210 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1220 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1235 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1271 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1278 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1265 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1260 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1247 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1262 ns/op            1096 B/op         11 allocs/op

The benchmark execution time change using by benchstat:

goos: linux
goarch: arm64
pkg: github.com/kpango/go-json-bench
                          │ go-1.15.7.txt │            go-latest.txt             │
                          │    sec/op     │    sec/op     vs base                │
JSONUnmarshalParallel-128    916.0n ± 15%   1013.5n ± 3%  +10.64% (p=0.004 n=10)
JSONDecodeParallel-128       1.125µ ±  5%    1.261µ ± 3%  +12.09% (p=0.000 n=10)
geomean                      1.015µ          1.130µ       +11.37%

                          │ go-1.15.7.txt │            go-latest.txt            │
                          │     B/op      │     B/op      vs base               │
JSONUnmarshalParallel-128      896.0 ± 0%     912.0 ± 0%  +1.79% (p=0.000 n=10)
JSONDecodeParallel-128       1.055Ki ± 0%   1.070Ki ± 0%  +1.48% (p=0.000 n=10)
geomean                        983.7          999.8       +1.63%

                          │ go-1.15.7.txt │            go-latest.txt            │
                          │   allocs/op   │  allocs/op   vs base                │
JSONUnmarshalParallel-128      9.000 ± 0%   10.000 ± 0%  +11.11% (p=0.000 n=10)
JSONDecodeParallel-128         10.00 ± 0%    11.00 ± 0%  +10.00% (p=0.000 n=10)
geomean                        9.487         10.49       +10.55%
sg2wtvxw

sg2wtvxw3#

有人能帮忙解释一下这个问题吗?

46scxncf

46scxncf4#

cc @golang/arm @golang/compiler

o2rvlv0m

o2rvlv0m5#

我的测试结果与你的略有不同。

  1. 无论这些基准测试是使用go1.15还是tip go运行,CPU利用率都非常高。tip go的CPU利用率比go1.15高约50%,这与你观察到的结果一致。
    go1.15: 1697%
    tip go: 2294%
  2. tip go的性能优于go1.15。
# benchstat go115 tip
goos: linux
goarch: arm64
                         │    go115    │                 tip                  │
                         │   sec/op    │    sec/op      vs base               │
JSONUnmarshal-96           4.810µ ± 3%   3.613µ ±   4%  -24.89% (p=0.002 n=6)
JSONDecode-96              4.265µ ± 1%   3.894µ ±   4%   -8.70% (p=0.002 n=6)
JSONUnmarshalParallel-96   2.294µ ± 5%   1.118µ ± 130%        ~ (p=0.065 n=6)
JSONDecodeParallel-96      1.268µ ± 4%   1.281µ ±   4%        ~ (p=0.310 n=6)
geomean                    2.779µ        2.118µ         -23.79%

                         │    go115     │                 tip                 │
                         │     B/op     │     B/op      vs base               │
JSONUnmarshal-96            2432.0 ± 0%     912.0 ± 0%  -62.50% (p=0.002 n=6)
JSONDecode-96              1.055Ki ± 0%   1.070Ki ± 0%   +1.48% (p=0.002 n=6)
JSONUnmarshalParallel-96    2432.0 ± 0%     912.0 ± 0%  -62.50% (p=0.002 n=6)
JSONDecodeParallel-96      1.055Ki ± 0%   1.070Ki ± 0%   +1.48% (p=0.002 n=6)
geomean                    1.583Ki          999.8       -38.31%

                         │   go115    │                 tip                 │
                         │ allocs/op  │ allocs/op   vs base                 │
JSONUnmarshal-96           10.00 ± 0%   10.00 ± 0%        ~ (p=1.000 n=6) ¹
JSONDecode-96              10.00 ± 0%   11.00 ± 0%  +10.00% (p=0.002 n=6)
JSONUnmarshalParallel-96   10.00 ± 0%   10.00 ± 0%        ~ (p=1.000 n=6) ¹
JSONDecodeParallel-96      10.00 ± 0%   11.00 ± 0%  +10.00% (p=0.002 n=6)
geomean                    10.00        10.49        +4.88%
¹ all samples are equal

我没有深入研究为什么CPU利用率如此之高,但测试数据看起来合理?
我的测试机器与你的有些不同。

Architecture:           aarch64
  CPU op-mode(s):       64-bit
  Byte Order:           Little Endian
CPU(s):                 96
  On-line CPU(s) list:  0-95
Vendor ID:              HiSilicon
  Model name:           Kunpeng-920
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 48
    Socket(s):          2
    Stepping:           0x1
    BogoMIPS:           200.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asi
                        mddp asimdfhm
Caches (sum of all):
  L1d:                  6 MiB (96 instances)
  L1i:                  6 MiB (96 instances)
  L2:                   48 MiB (96 instances)
  L3:                   96 MiB (4 instances)
NUMA:
  NUMA node(s):         4
  NUMA node0 CPU(s):    0-23
  NUMA node1 CPU(s):    24-47
  NUMA node2 CPU(s):    48-71
  NUMA node3 CPU(s):    72-95
Vulnerabilities:
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec store bypass:    Not affected
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Not affected
  Srbds:                Not affected
  Tsx async abort:      Not affected
q3qa4bjr

q3qa4bjr6#

我的测试结果与你的略有不同。

  1. 无论这些基准测试是在go1.15还是tip go下运行,CPU利用率都非常高。tip go的CPU利用率比go1.15高约50%,这与你观察到的结果一致。
    go1.15: 1697%
    tip go: 2294%
  2. tip go的性能优于go1.15。
# benchstat go115 tip
goos: linux
goarch: arm64
                         │    go115    │                 tip                  │
                         │   sec/op    │    sec/op      vs base               │
JSONUnmarshal-96           4.810µ ± 3%   3.613µ ±   4%  -24.89% (p=0.002 n=6)
JSONDecode-96              4.265µ ± 1%   3.894µ ±   4%   -8.70% (p=0.002 n=6)
JSONUnmarshalParallel-96   2.294µ ± 5%   1.118µ ± 130%        ~ (p=0.065 n=6)
JSONDecodeParallel-96      1.268µ ± 4%   1.281µ ±   4%        ~ (p=0.310 n=6)
geomean                    2.779µ        2.118µ         -23.79%

                         │    go115     │                 tip                 │
                         │     B/op     │     B/op      vs base               │
JSONUnmarshal-96            2432.0 ± 0%     912.0 ± 0%  -62.50% (p=0.002 n=6)
JSONDecode-96              1.055Ki ± 0%   1.070Ki ± 0%   +1.48% (p=0.002 n=6)
JSONUnmarshalParallel-96    2432.0 ± 0%     912.0 ± 0%  -62.50% (p=0.002 n=6)
JSONDecodeParallel-96      1.055Ki ± 0%   1.070Ki ± 0%   +1.48% (p=0.002 n=6)
geomean                    1.583Ki          999.8       -38.31%

                         │   go115    │                 tip                 │
                         │ allocs/op  │ allocs/op   vs base                 │
JSONUnmarshal-96           10.00 ± 0%   10.00 ± 0%        ~ (p=1.000 n=6) ¹
JSONDecode-96              10.00 ± 0%   11.00 ± 0%  +10.00% (p=0.002 n=6)
JSONUnmarshalParallel-96   10.00 ± 0%   10.00 ± 0%        ~ (p=1.000 n=6) ¹
JSONDecodeParallel-96      10.00 ± 0%   11.00 ± 0%  +10.00% (p=0.002 n=6)
geomean                    10.00        10.49        +4.88%
¹ all samples are equal

我没有深入研究为什么CPU利用率如此之高,但测试数据看起来合理?
我的测试机器与你的有些不同。

Architecture:           aarch64
  CPU op-mode(s):       64-bit
  Byte Order:           Little Endian
CPU(s):                 96
  On-line CPU(s) list:  0-95
Vendor ID:              HiSilicon
  Model name:           Kunpeng-920
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 48
    Socket(s):          2
    Stepping:           0x1
    BogoMIPS:           200.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asi
                        mddp asimdfhm
Caches (sum of all):
  L1d:                  6 MiB (96 instances)
  L1i:                  6 MiB (96 instances)
  L2:                   48 MiB (96 instances)
  L3:                   96 MiB (4 instances)
NUMA:
  NUMA node(s):         4
  NUMA node0 CPU(s):    0-23
  NUMA node1 CPU(s):    24-47
  NUMA node2 CPU(s):    48-71
  NUMA node3 CPU(s):    72-95
Vulnerabilities:
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec store bypass:    Not affected
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Not affected
  Srbds:                Not affected
  Tsx async abort:      Not affected

也许你可以在提交之前尝试使用git checkout再次进行测试 runtime: add consistent heap statistics
结果是CPU利用率较低,但性能与tip go几乎相同。

相关问题