Improve Performance for SRS

jjjwad0x 于 2022-12-31 发布在其他

关注(0)|答案(4)|浏览(220)

性能优化是无止境的话题，需要持续的优化。SRS2做过一次较大的性能优化，从3k提升到7k，后续还需要不断的优化，会把优化过程和数据贴这个Issue。

之前SRS2做过部分优化，参考：

Play RTMP benchmark

The data for playing RTMP was benchmarked by [SB][srs-bench]:

Update	SRS	Clients	Type	CPU	Memory	Commit
2014-12-07	2.0.67	10k(10000)	players	95%	656MB	code
2014-12-05	2.0.57	9.0k(9000)	players	90%	468MB	code
2014-12-05	2.0.55	8.0k(8000)	players	89%	360MB	code
2014-11-22	2.0.30	7.5k(7500)	players	87%	320MB	code
2014-11-13	2.0.15	6.0k(6000)	players	82%	203MB	code
2014-11-12	2.0.14	3.5k(3500)	players	95%	78MB	code
2014-11-12	2.0.14	2.7k(2700)	players	69%	59MB	-
2014-11-11	2.0.12	2.7k(2700)	players	85%	66MB	-
2014-11-11	1.0.5	2.7k(2700)	players	85%	66MB	-
2014-07-12	0.9.156	2.7k(2700)	players	89%	61MB	code
2014-07-12	0.9.156	1.8k(1800)	players	68%	38MB	-
2013-11-28	0.5.0	1.8k(1800)	players	90%	41M	-

Publish RTMP benchmark

The data for publishing RTMP was benchmarked by [SB][srs-bench]:

Update	SRS	Clients	Type	CPU	Memory	Commit
2014-12-04	2.0.52	4.0k(4000)	publishers	80%	331MB	code
2014-12-04	2.0.51	2.5k(2500)	publishers	91%	259MB	code
2014-12-04	2.0.49	2.5k(2500)	publishers	95%	404MB	code
2014-12-04	2.0.49	1.4k(1400)	publishers	68%	144MB	-
2014-12-03	2.0.48	1.4k(1400)	publishers	95%	140MB	code
2014-12-03	2.0.47	1.4k(1400)	publishers	95%	140MB	-
2014-12-03	2.0.47	1.2k(1200)	publishers	84%	76MB	code
2014-12-03	2.0.12	1.2k(1200)	publishers	96%	43MB	-
2014-12-03	1.0.10	1.2k(1200)	publishers	96%	43MB	-

Play HTTP FLV benchmark

The data for playing HTTP FLV was benchmarked by [SB][srs-bench]:

Update	SRS	Clients	Type	CPU	Memory	Commit
2014-05-25	2.0.171	6.0k(6000)	players	84%	297MB	code
2014-05-24	2.0.170	3.0k(3000)	players	89%	96MB	code
2014-05-24	2.0.169	3.0k(3000)	players	94%	188MB	code
2014-05-24	2.0.168	2.3k(2300)	players	92%	276MB	code
2014-05-24	2.0.167	1.0k(1000)	players	82%	86MB	-

Latency benchmark

The latency between encoder and player with realtime config([CN][v3_CN_LowLatency], [EN][v3_EN_LowLatency]):
|

Update	SRS	VP6	H.264	VP6+MP3	H.264+MP3
2014-12-16	2.0.72	0.1s	0.4s	0.8s	0.6s
2014-12-12	2.0.70	0.1s	0.4s	1.0s	0.9s
2014-12-03	1.0.10	0.4s	0.4s	0.9s	1.2s

srs

来源：https://github.com/ossrs/srs/issues/1673

4条答案

按热度按时间

r6l8ljro1#

SRS4: Refine ST Iterate Coroutines Performance

ST有个优化，可能能提升5%到10%，主要是优化迭代coroutines时的问题，数据参考：ossrs/state-threads#5 (comment)

这个优化改动较大，所以不会在SRS3上，预计会在SRS4上。

MacPro信息：

macOS Mojave
Version 10.14.6 (18G3020)
MacBook Pro (Retina, 15-inch, Mid 2015)
Processor: 2.2 GHz Intel Core i7
Memory: 16 GB 1600 MHz DDR3

Docker信息：

Docker Desktop 2.2.0.3(42716)
Engine: 19.03.5
Resources: CPUs 4, Memory 2GB, Swap 1GB
Note: SRS绑定到CPU0，SB绑定到CPU2-3。

SRS3 for Playing Baseline

SRS3，没有这个优化的版本，可以作为性能基线，看这个PR相对优化了多少。

Mac:trunk chengli.ycl$ docker exec -it git top
top - 03:44:38 up 14:03,  0 users,  load average: 1.72, 1.71, 1.74
Tasks:  12 total,   1 running,  11 sleeping,   0 stopped,   0 zombie
%Cpu0  : 44.7 us, 14.9 sy,  0.0 ni, 32.5 id,  0.0 wa,  0.0 hi,  7.8 si,  0.0 st
%Cpu1  :  1.5 us,  2.9 sy,  0.0 ni, 95.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu2  : 21.2 us, 11.2 sy,  0.0 ni, 67.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu3  : 16.0 us,  8.4 sy,  0.0 ni, 75.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem :  2037260 total,   490352 free,  1188940 used,   357968 buff/cache
KiB Swap:  1048572 total,  1028092 free,    20480 used.   704796 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
 6654 root      20   0  463540 331388   2960 S  24.6 16.3  21:00.42 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
 6606 root      20   0  449600 317332   2824 S  20.6 15.6  20:56.26 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
11191 root      20   0 1339072 194020   5440 S  64.1  9.5   1:43.16 ./gprof.srs_3_baseline -c console.conf 

Mac:trunk chengli.ycl$ docker exec git netstat -anp|grep srs|wc -l
    4002

Mac:trunk chengli.ycl$ docker exec git dstat -N lo
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- ---net/lo-- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
 19   9  70   0   0   2|   0     0 | 134M  134M|   0     0 |4500  6374 
 24  14  58   0   0   4|   0     0 | 184M  184M|   0     0 |4829  5833 

[root@de6e1cac0533 trunk]# gprof -b gprof.srs_3_baseline gmon.out |more
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 19.71      8.35     8.35                             _st_epoll_dispatch
 16.91     15.52     7.17 45118865     0.00     0.00  SrsConsumer::enqueue(SrsSharedPtrMessage*, bool, SrsRtmpJitterAlgorithm)
 10.29     19.88     4.36  1857259     0.00     0.00  SrsProtocol::do_send_messages(SrsSharedPtrMessage**, int)
  9.33     23.83     3.96 45118865     0.00     0.00  SrsFastVector::push_back(SrsSharedPtrMessage*)
  4.65     25.80     1.97     4000     0.49     3.17  SrsRtmpConn::do_playing(SrsSource*, SrsConsumer*, SrsQueueRecvThread*)
  3.54     27.30     1.50     7295     0.21     1.47  SrsSource::on_audio_imp(SrsSharedPtrMessage*)
  3.42     28.75     1.45  1857259     0.00     0.00  SrsProtocol::send_and_free_messages(SrsSharedPtrMessage**, int, int)
  3.16     30.09     1.34 45086840     0.00     0.00  srs_chunk_header_c0(int, unsigned int, int, signed char, int, char*, int)
  2.36     31.09     1.00 45118865     0.00     0.00  SrsRtmpJitter::correct(SrsSharedPtrMessage*, SrsRtmpJitterAlgorithm)

解读如下：

CPU占用64%，用户空间44%，系统空间14%。
用户空间的函数，主要是 _st_epoll_dispatch ，以及RTMP Messages的处理逻辑。

SRS3 for Playing with ST Refined

SRS3，合并了这个PR的版本，优化了ST迭代的逻辑。

Mac:trunk chengli.ycl$ docker exec -it git top
top - 04:00:43 up 14:19,  0 users,  load average: 1.47, 1.57, 1.62
Tasks:  13 total,   3 running,  10 sleeping,   0 stopped,   0 zombie
%Cpu0  : 40.6 us, 10.2 sy,  0.0 ni, 43.3 id,  0.0 wa,  0.0 hi,  5.8 si,  0.0 st
%Cpu1  :  1.0 us,  2.1 sy,  0.0 ni, 96.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 17.7 us, 11.8 sy,  0.0 ni, 70.1 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu3  : 16.8 us,  9.5 sy,  0.0 ni, 73.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2037260 total,   429264 free,  1226620 used,   381376 buff/cache
KiB Swap:  1048572 total,  1028092 free,    20480 used.   667064 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
 6606 root      20   0  449356 317088   2824 S  19.3 15.6  24:59.70 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
 6654 root      20   0  448304 316176   2960 R  19.9 15.5  25:11.48 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
11352 root      20   0 1357608 241384   5344 R  54.8 11.8   2:25.22 ./gprof.srs_3_st -c console.conf

Mac:trunk chengli.ycl$ docker exec git netstat -anp|grep srs|wc -l
    4003

Mac:trunk chengli.ycl$ docker exec git dstat -N lo
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- ---net/lo-- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
 21  10  67   0   0   2|   0     0 | 111M  111M|   0     0 |4563  6364 
 23   9  66   0   0   2|   0     0 | 121M  121M|   0     0 |4505  6306 
 20   9  69   0   0   2|   0     0 | 130M  130M|   0     0 |4812  6843 

[root@de6e1cac0533 trunk]# gprof -b gprof.srs_3_st gmon.out |more
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 22.33     14.96    14.96 82024549     0.00     0.00  SrsConsumer::enqueue(SrsSharedPtrMessage*, bool, SrsRtmpJitterAlgorithm)
 13.08     23.73     8.77 82024549     0.00     0.00  SrsFastVector::push_back(SrsSharedPtrMessage*)
 12.30     31.97     8.24  3312993     0.00     0.00  SrsProtocol::do_send_messages(SrsSharedPtrMessage**, int)
  5.25     35.49     3.52     4001     0.88     5.96  SrsRtmpConn::do_playing(SrsSource*, SrsConsumer*, SrsQueueRecvThread*)
  5.07     38.89     3.40    13188     0.26     1.73  SrsSource::on_audio_imp(SrsSharedPtrMessage*)
  4.54     41.93     3.04  3312993     0.00     0.01  SrsProtocol::send_and_free_messages(SrsSharedPtrMessage**, int, int)
  3.49     44.27     2.34 82013595     0.00     0.00  srs_chunk_header_c0(int, unsigned int, int, signed char, int, char*, int)
  2.63     46.03     1.76 82024549     0.00     0.00  SrsRtmpJitter::correct(SrsSharedPtrMessage*, SrsRtmpJitterAlgorithm)
  2.28     47.56     1.53     7656     0.20     1.68  SrsSource::on_video_imp(SrsSharedPtrMessage*)
  2.13     48.99     1.43                             st_writev

解读如下：

CPU占用54%，用户空间40%，系统空间10%。
用户空间的函数，主要是RTMP Messages的处理逻辑。
Note: 优化完ST后，是对性能有一定的提升的， _st_epoll_dispatch 不再是热点函数了。

赞(0）回复(0）举报 2022-12-31

up9lanfz2#

SRS3: Use Compiler O2 To Improve Performance

SRS1,2,3一直默认使用O0，关闭了编译器的优化，可以开启优化后对比下数据。

MacPro信息：

macOS Mojave
Version 10.14.6 (18G3020)
MacBook Pro (Retina, 15-inch, Mid 2015)
Processor: 2.2 GHz Intel Core i7
Memory: 16 GB 1600 MHz DDR3

Docker信息：

Docker Desktop 2.2.0.3(42716)
Engine: 19.03.5
Resources: CPUs 4, Memory 2GB, Swap 1GB
Note: SRS绑定到CPU0，SB绑定到CPU2-3。

SRS3 Play Baseline

先看基线数据，占用CPU平均在66%，用户空间39%，系统空间22%。

Mac:trunk chengli.ycl$ docker exec -it git top
top - 01:03:30 up 1 day, 14 min,  0 users,  load average: 1.53, 1.39, 1.12
Tasks:   5 total,   3 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu0  : 39.6 us, 22.9 sy,  0.0 ni, 28.7 id,  0.0 wa,  0.0 hi,  8.9 si,  0.0 st
%Cpu1  :  0.3 us,  1.7 sy,  0.0 ni, 97.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 21.3 us, 11.8 sy,  0.0 ni, 66.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 26.7 us, 15.2 sy,  0.0 ni, 58.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2037260 total,   412404 free,  1260192 used,   364664 buff/cache
KiB Swap:  1048572 total,   939260 free,   109312 used.   640028 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
88041 root      20   0  555112 393012   3056 S  26.7 19.3   4:58.08 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88046 root      20   0  555004 392828   3000 R  35.3 19.3   5:34.46 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88034 root      20   0 1651656 218748   5484 R  66.3 10.7  12:38.10 ./srs_3_baseline -c console.conf                                          
88035 root      20   0   58284   3716   3196 R   0.0  0.2   0:00.46 top                                                                       
    1 root      20   0   11944   2628   2336 S   0.0  0.1   0:01.51 bash

SRS3 Play with Compiler O2

SRS3开启O2编译选项后，能优化10%左右的性能，CPU使用52%左右，用户空间26%，系统空间17%。

Mac:trunk chengli.ycl$ docker exec -it git top
top - 01:09:24 up 1 day, 20 min,  0 users,  load average: 1.23, 1.38, 1.20
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu0  : 26.7 us, 17.8 sy,  0.0 ni, 46.2 id,  0.0 wa,  0.0 hi,  9.2 si,  0.0 st
%Cpu1  :  1.8 us,  4.8 sy,  0.0 ni, 93.0 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu2  : 24.3 us, 11.4 sy,  0.0 ni, 64.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 20.6 us, 10.7 sy,  0.0 ni, 68.4 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem :  2037260 total,   375336 free,  1307788 used,   354136 buff/cache
KiB Swap:  1048572 total,   939260 free,   109312 used.   594752 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
88041 root      20   0  550440 388408   3056 S  31.2 19.1   6:55.76 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88046 root      20   0  545716 383624   3000 S  24.6 18.8   7:27.84 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88085 root      20   0 1713060 290732   5040 S  52.5 14.3   2:38.46 ./srs_3_o2 -c console.conf                                                
88035 root      20   0   58284   3716   3196 R   0.0  0.2   0:00.60 top                                                                       
    1 root      20   0   11944   2628   2336 S   0.0  0.1   0:01.54 bash

c47b9e46

赞(0）回复(0）举报 2022-12-31

0g0grzrc3#

发现Docker环境可能存在基线不稳定的问题，有时候高有时候低，差别还非常的大，如下图所示：

做了一些优化，有些是预想得到能提升比如开启O2，但是由于基线不稳，所以先放一放，到时候找台物理机测试，下面是优化的分支：

compiler O2 编译开启O2优化。
inline 对热点函数开启inline优化。
tcmalloc 使用tcmalloc分配内存。
st 合并ST改进 #5 优化繁忙coroutine调度性能。
large iovs 增大mw_msgs合并写入的消息数目。
perf stat 统计mw的消息数目。
fast vector 优化每个consumer的队列。
mr always 总是开启mr读等待。
mr buffer 总是读取固定长度的数据。
small buffer 使用小的缓冲区可能性能更好。
vector queue 直接使用vector也是个选项。

赞(0）回复(0）举报 2022-12-31

x0fgdtte4#

关于ST的优化，可以优化的点在于：

timer和cond的使用，参考 Refine SRS timer and cond for performance issue. #1711
IO事件处理需要遍历io_q，参考 Support MSG_ZEROCOPY for streaming server. state-threads#13 (comment)

关于ST的分析参考：https://github.com/ossrs/state-threads/tree/srs#analysis

About setjmp and longjmp, read setjmp .
About the stack structure, read stack
About asm code comments, read #91d530e.
About the scheduler, read #13-scheduler.
About the IO event system, read #13-IO.

赞(0）回复(0）举报 2022-12-31

我来回答

Improve Performance for SRS

4条答案

SRS4: Refine ST Iterate Coroutines Performance

SRS3 for Playing Baseline

SRS3 for Playing with ST Refined

SRS3: Use Compiler O2 To Improve Performance

SRS3 Play Baseline

SRS3 Play with Compiler O2

相关问题

热门标签

最新问答