Improve Performance for SRS

jjjwad0x  于 2022-12-31  发布在  其他
关注(0)|答案(4)|浏览(220)

性能优化是无止境的话题,需要持续的优化。SRS2做过一次较大的性能优化,从3k提升到7k,后续还需要不断的优化,会把优化过程和数据贴这个Issue。

之前SRS2做过部分优化,参考:

Play RTMP benchmark

The data for playing RTMP was benchmarked by [SB][srs-bench]:

UpdateSRSClientsTypeCPUMemoryCommit
2014-12-072.0.6710k(10000)players95%656MBcode
2014-12-052.0.579.0k(9000)players90%468MBcode
2014-12-052.0.558.0k(8000)players89%360MBcode
2014-11-222.0.307.5k(7500)players87%320MBcode
2014-11-132.0.156.0k(6000)players82%203MBcode
2014-11-122.0.143.5k(3500)players95%78MBcode
2014-11-122.0.142.7k(2700)players69%59MB-
2014-11-112.0.122.7k(2700)players85%66MB-
2014-11-111.0.52.7k(2700)players85%66MB-
2014-07-120.9.1562.7k(2700)players89%61MBcode
2014-07-120.9.1561.8k(1800)players68%38MB-
2013-11-280.5.01.8k(1800)players90%41M-

Publish RTMP benchmark

The data for publishing RTMP was benchmarked by [SB][srs-bench]:

UpdateSRSClientsTypeCPUMemoryCommit
2014-12-042.0.524.0k(4000)publishers80%331MBcode
2014-12-042.0.512.5k(2500)publishers91%259MBcode
2014-12-042.0.492.5k(2500)publishers95%404MBcode
2014-12-042.0.491.4k(1400)publishers68%144MB-
2014-12-032.0.481.4k(1400)publishers95%140MBcode
2014-12-032.0.471.4k(1400)publishers95%140MB-
2014-12-032.0.471.2k(1200)publishers84%76MBcode
2014-12-032.0.121.2k(1200)publishers96%43MB-
2014-12-031.0.101.2k(1200)publishers96%43MB-

Play HTTP FLV benchmark

The data for playing HTTP FLV was benchmarked by [SB][srs-bench]:

UpdateSRSClientsTypeCPUMemoryCommit
2014-05-252.0.1716.0k(6000)players84%297MBcode
2014-05-242.0.1703.0k(3000)players89%96MBcode
2014-05-242.0.1693.0k(3000)players94%188MBcode
2014-05-242.0.1682.3k(2300)players92%276MBcode
2014-05-242.0.1671.0k(1000)players82%86MB-

Latency benchmark

The latency between encoder and player with realtime config([CN][v3_CN_LowLatency], [EN][v3_EN_LowLatency]):
|

UpdateSRSVP6H.264VP6+MP3H.264+MP3
2014-12-162.0.720.1s0.4s0.8s0.6s
2014-12-122.0.700.1s0.4s1.0s0.9s
2014-12-031.0.100.4s0.4s0.9s1.2s
r6l8ljro

r6l8ljro1#

SRS4: Refine ST Iterate Coroutines Performance

ST有个优化,可能能提升5%到10%,主要是优化迭代coroutines时的问题,数据参考:ossrs/state-threads#5 (comment)

这个优化改动较大,所以不会在SRS3上,预计会在SRS4上。

MacPro信息:

  • macOS Mojave
  • Version 10.14.6 (18G3020)
  • MacBook Pro (Retina, 15-inch, Mid 2015)
  • Processor: 2.2 GHz Intel Core i7
  • Memory: 16 GB 1600 MHz DDR3

Docker信息:

  • Docker Desktop 2.2.0.3(42716)
  • Engine: 19.03.5
  • Resources: CPUs 4, Memory 2GB, Swap 1GB
    Note: SRS绑定到CPU0,SB绑定到CPU2-3。

SRS3 for Playing Baseline

SRS3,没有这个优化的版本,可以作为性能基线,看这个PR相对优化了多少。

Mac:trunk chengli.ycl$ docker exec -it git top
top - 03:44:38 up 14:03,  0 users,  load average: 1.72, 1.71, 1.74
Tasks:  12 total,   1 running,  11 sleeping,   0 stopped,   0 zombie
%Cpu0  : 44.7 us, 14.9 sy,  0.0 ni, 32.5 id,  0.0 wa,  0.0 hi,  7.8 si,  0.0 st
%Cpu1  :  1.5 us,  2.9 sy,  0.0 ni, 95.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu2  : 21.2 us, 11.2 sy,  0.0 ni, 67.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu3  : 16.0 us,  8.4 sy,  0.0 ni, 75.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem :  2037260 total,   490352 free,  1188940 used,   357968 buff/cache
KiB Swap:  1048572 total,  1028092 free,    20480 used.   704796 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
 6654 root      20   0  463540 331388   2960 S  24.6 16.3  21:00.42 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
 6606 root      20   0  449600 317332   2824 S  20.6 15.6  20:56.26 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
11191 root      20   0 1339072 194020   5440 S  64.1  9.5   1:43.16 ./gprof.srs_3_baseline -c console.conf 

Mac:trunk chengli.ycl$ docker exec git netstat -anp|grep srs|wc -l
    4002

Mac:trunk chengli.ycl$ docker exec git dstat -N lo
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- ---net/lo-- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
 19   9  70   0   0   2|   0     0 | 134M  134M|   0     0 |4500  6374 
 24  14  58   0   0   4|   0     0 | 184M  184M|   0     0 |4829  5833 

[root@de6e1cac0533 trunk]# gprof -b gprof.srs_3_baseline gmon.out |more
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 19.71      8.35     8.35                             _st_epoll_dispatch
 16.91     15.52     7.17 45118865     0.00     0.00  SrsConsumer::enqueue(SrsSharedPtrMessage*, bool, SrsRtmpJitterAlgorithm)
 10.29     19.88     4.36  1857259     0.00     0.00  SrsProtocol::do_send_messages(SrsSharedPtrMessage**, int)
  9.33     23.83     3.96 45118865     0.00     0.00  SrsFastVector::push_back(SrsSharedPtrMessage*)
  4.65     25.80     1.97     4000     0.49     3.17  SrsRtmpConn::do_playing(SrsSource*, SrsConsumer*, SrsQueueRecvThread*)
  3.54     27.30     1.50     7295     0.21     1.47  SrsSource::on_audio_imp(SrsSharedPtrMessage*)
  3.42     28.75     1.45  1857259     0.00     0.00  SrsProtocol::send_and_free_messages(SrsSharedPtrMessage**, int, int)
  3.16     30.09     1.34 45086840     0.00     0.00  srs_chunk_header_c0(int, unsigned int, int, signed char, int, char*, int)
  2.36     31.09     1.00 45118865     0.00     0.00  SrsRtmpJitter::correct(SrsSharedPtrMessage*, SrsRtmpJitterAlgorithm)

解读如下:

  • CPU占用64%,用户空间44%,系统空间14%。
  • 用户空间的函数,主要是 _st_epoll_dispatch ,以及RTMP Messages的处理逻辑。

SRS3 for Playing with ST Refined

SRS3,合并了这个PR的版本,优化了ST迭代的逻辑。

Mac:trunk chengli.ycl$ docker exec -it git top
top - 04:00:43 up 14:19,  0 users,  load average: 1.47, 1.57, 1.62
Tasks:  13 total,   3 running,  10 sleeping,   0 stopped,   0 zombie
%Cpu0  : 40.6 us, 10.2 sy,  0.0 ni, 43.3 id,  0.0 wa,  0.0 hi,  5.8 si,  0.0 st
%Cpu1  :  1.0 us,  2.1 sy,  0.0 ni, 96.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 17.7 us, 11.8 sy,  0.0 ni, 70.1 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu3  : 16.8 us,  9.5 sy,  0.0 ni, 73.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2037260 total,   429264 free,  1226620 used,   381376 buff/cache
KiB Swap:  1048572 total,  1028092 free,    20480 used.   667064 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
 6606 root      20   0  449356 317088   2824 S  19.3 15.6  24:59.70 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
 6654 root      20   0  448304 316176   2960 R  19.9 15.5  25:11.48 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
11352 root      20   0 1357608 241384   5344 R  54.8 11.8   2:25.22 ./gprof.srs_3_st -c console.conf

Mac:trunk chengli.ycl$ docker exec git netstat -anp|grep srs|wc -l
    4003

Mac:trunk chengli.ycl$ docker exec git dstat -N lo
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- ---net/lo-- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
 21  10  67   0   0   2|   0     0 | 111M  111M|   0     0 |4563  6364 
 23   9  66   0   0   2|   0     0 | 121M  121M|   0     0 |4505  6306 
 20   9  69   0   0   2|   0     0 | 130M  130M|   0     0 |4812  6843 

[root@de6e1cac0533 trunk]# gprof -b gprof.srs_3_st gmon.out |more
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 22.33     14.96    14.96 82024549     0.00     0.00  SrsConsumer::enqueue(SrsSharedPtrMessage*, bool, SrsRtmpJitterAlgorithm)
 13.08     23.73     8.77 82024549     0.00     0.00  SrsFastVector::push_back(SrsSharedPtrMessage*)
 12.30     31.97     8.24  3312993     0.00     0.00  SrsProtocol::do_send_messages(SrsSharedPtrMessage**, int)
  5.25     35.49     3.52     4001     0.88     5.96  SrsRtmpConn::do_playing(SrsSource*, SrsConsumer*, SrsQueueRecvThread*)
  5.07     38.89     3.40    13188     0.26     1.73  SrsSource::on_audio_imp(SrsSharedPtrMessage*)
  4.54     41.93     3.04  3312993     0.00     0.01  SrsProtocol::send_and_free_messages(SrsSharedPtrMessage**, int, int)
  3.49     44.27     2.34 82013595     0.00     0.00  srs_chunk_header_c0(int, unsigned int, int, signed char, int, char*, int)
  2.63     46.03     1.76 82024549     0.00     0.00  SrsRtmpJitter::correct(SrsSharedPtrMessage*, SrsRtmpJitterAlgorithm)
  2.28     47.56     1.53     7656     0.20     1.68  SrsSource::on_video_imp(SrsSharedPtrMessage*)
  2.13     48.99     1.43                             st_writev

解读如下:

  • CPU占用54%,用户空间40%,系统空间10%。
  • 用户空间的函数,主要是RTMP Messages的处理逻辑。
    Note: 优化完ST后,是对性能有一定的提升的, _st_epoll_dispatch 不再是热点函数了。
up9lanfz

up9lanfz2#

SRS3: Use Compiler O2 To Improve Performance

SRS1,2,3一直默认使用O0,关闭了编译器的优化,可以开启优化后对比下数据。

MacPro信息:

  • macOS Mojave
  • Version 10.14.6 (18G3020)
  • MacBook Pro (Retina, 15-inch, Mid 2015)
  • Processor: 2.2 GHz Intel Core i7
  • Memory: 16 GB 1600 MHz DDR3

Docker信息:

  • Docker Desktop 2.2.0.3(42716)
  • Engine: 19.03.5
  • Resources: CPUs 4, Memory 2GB, Swap 1GB
    Note: SRS绑定到CPU0,SB绑定到CPU2-3。

SRS3 Play Baseline

先看基线数据,占用CPU平均在66%,用户空间39%,系统空间22%。

Mac:trunk chengli.ycl$ docker exec -it git top
top - 01:03:30 up 1 day, 14 min,  0 users,  load average: 1.53, 1.39, 1.12
Tasks:   5 total,   3 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu0  : 39.6 us, 22.9 sy,  0.0 ni, 28.7 id,  0.0 wa,  0.0 hi,  8.9 si,  0.0 st
%Cpu1  :  0.3 us,  1.7 sy,  0.0 ni, 97.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 21.3 us, 11.8 sy,  0.0 ni, 66.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 26.7 us, 15.2 sy,  0.0 ni, 58.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2037260 total,   412404 free,  1260192 used,   364664 buff/cache
KiB Swap:  1048572 total,   939260 free,   109312 used.   640028 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
88041 root      20   0  555112 393012   3056 S  26.7 19.3   4:58.08 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88046 root      20   0  555004 392828   3000 R  35.3 19.3   5:34.46 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88034 root      20   0 1651656 218748   5484 R  66.3 10.7  12:38.10 ./srs_3_baseline -c console.conf                                          
88035 root      20   0   58284   3716   3196 R   0.0  0.2   0:00.46 top                                                                       
    1 root      20   0   11944   2628   2336 S   0.0  0.1   0:01.51 bash

SRS3 Play with Compiler O2

SRS3开启O2编译选项后,能优化10%左右的性能,CPU使用52%左右,用户空间26%,系统空间17%。

Mac:trunk chengli.ycl$ docker exec -it git top
top - 01:09:24 up 1 day, 20 min,  0 users,  load average: 1.23, 1.38, 1.20
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu0  : 26.7 us, 17.8 sy,  0.0 ni, 46.2 id,  0.0 wa,  0.0 hi,  9.2 si,  0.0 st
%Cpu1  :  1.8 us,  4.8 sy,  0.0 ni, 93.0 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu2  : 24.3 us, 11.4 sy,  0.0 ni, 64.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 20.6 us, 10.7 sy,  0.0 ni, 68.4 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem :  2037260 total,   375336 free,  1307788 used,   354136 buff/cache
KiB Swap:  1048572 total,   939260 free,   109312 used.   594752 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
88041 root      20   0  550440 388408   3056 S  31.2 19.1   6:55.76 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88046 root      20   0  545716 383624   3000 S  24.6 18.8   7:27.84 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88085 root      20   0 1713060 290732   5040 S  52.5 14.3   2:38.46 ./srs_3_o2 -c console.conf                                                
88035 root      20   0   58284   3716   3196 R   0.0  0.2   0:00.60 top                                                                       
    1 root      20   0   11944   2628   2336 S   0.0  0.1   0:01.54 bash

c47b9e46

0g0grzrc

0g0grzrc3#

发现Docker环境可能存在基线不稳定的问题,有时候高有时候低,差别还非常的大,如下图所示:

做了一些优化,有些是预想得到能提升比如开启O2,但是由于基线不稳,所以先放一放,到时候找台物理机测试,下面是优化的分支:

x0fgdtte

x0fgdtte4#

关于ST的优化,可以优化的点在于:

  1. timer和cond的使用,参考 Refine SRS timer and cond for performance issue. #1711
  2. IO事件处理需要遍历io_q,参考 Support MSG_ZEROCOPY for streaming server. state-threads#13 (comment)

关于ST的分析参考:https://github.com/ossrs/state-threads/tree/srs#analysis

  1. About setjmp and longjmp, read setjmp .
  2. About the stack structure, read stack
  3. About asm code comments, read #91d530e.
  4. About the scheduler, read #13-scheduler.
  5. About the IO event system, read #13-IO.

相关问题