这是我为测试多线程性能而写的一段代码。总的来说,它在循环中执行一些长时间的计算,累积结果并测量它所花费的时间。累积结果需要将锁放在一个地方。问题是,在这一行上使用锁会降低多线程性能。为什么?
我还测量了锁定/解锁互斥锁所需的时间。我使用g++ +O3
选项编译代码。
#include <chrono>
#include <cmath>
#include <functional>
#include <iomanip>
#include <iostream>
#include <mutex>
#include <vector>
#include <thread>
long double store;
std::mutex lock;
using ftype=std::function<long double(long int)>;
using loop_type=std::function<void(long int, long int, ftype)>;
///simple class to time the execution and print result.
struct time_n_print
{
time_n_print() :
start(std::chrono::high_resolution_clock::now())
{}
~time_n_print()
{
auto elapsed = std::chrono::high_resolution_clock::now() - start;
auto ms = std::chrono::duration_cast<std::chrono::microseconds>(elapsed);
std::cout << "Elapsed(ms)=" << std::setw(7) << ms.count();
std::cout << "; Result: " << (long int)(store);
}
std::chrono::high_resolution_clock::time_point start;
};//class time_n_print
///do long and pointless calculations which result in 1.0
long double slow(long int i)
{
long double pi=3.1415926536;
long double i_rad = (long double)(i) * pi / 180;
long double sin_i = std::sin(i_rad);
long double cos_i = std::cos(i_rad);
long double sin_sq = sin_i * sin_i;
long double cos_sq = cos_i * cos_i;
long double log_sin_sq = std::log(sin_sq);
long double log_cos_sq = std::log(cos_sq);
sin_sq = std::exp(log_sin_sq);
cos_sq = std::exp(log_cos_sq);
long double sum_sq = sin_sq + cos_sq;
long double result = std::sqrt(sum_sq);
return result;
}
///just return 1
long double fast(long int)
{
return 1.0;
}
///sum everything up with mutex
void loop_guarded(long int a, long int b, ftype increment)
{
for(long int i = a; i < b; ++i)
{
long double inc = increment(i);
{
std::lock_guard<std::mutex> guard(lock);
store += inc;
}
}
}//loop_guarded
///sum everything up without locks
void loop_unguarded(long int a, long int b, ftype increment)
{
for(long int i = a; i < b; ++i)
{
long double inc = increment(i);
{
store += inc;
}
}
}//loop_unguarded
//run calculations on multiple threads.
void run_calculations(int size,
int nthreads,
loop_type loop,
ftype increment)
{
store = 0.0;
std::vector<std::thread> tv;
long a(0), b(0);
for(int n = 0; n < nthreads; ++n)
{
a = b;
b = n < nthreads - 1 ? a + size / nthreads : size;
tv.push_back(std::thread(loop, a, b, increment));
}
//Wait, until all threads finish
for(auto& t : tv)
{
t.join();
}
}//run_calculations
int main()
{
long int size = 10000000;
{
std::cout << "\n1 thread - fast, unguarded : ";
time_n_print t;
run_calculations(size, 1, loop_unguarded, fast);
}
{
std::cout << "\n1 thread - fast, guarded : ";
time_n_print t;
run_calculations(size, 1, loop_guarded, fast);
}
std::cout << std::endl;
{
std::cout << "\n1 thread - slow, unguarded : ";
time_n_print t;
run_calculations(size, 1, loop_unguarded, slow);
}
{
std::cout << "\n2 threads - slow, unguarded : ";
time_n_print t;
run_calculations(size, 2, loop_unguarded, slow);
}
{
std::cout << "\n3 threads - slow, unguarded : ";
time_n_print t;
run_calculations(size, 3, loop_unguarded, slow);
}
{
std::cout << "\n4 threads - slow, unguarded : ";
time_n_print t;
run_calculations(size, 4, loop_unguarded, slow);
}
std::cout << std::endl;
{
std::cout << "\n1 thread - slow, guarded : ";
time_n_print t;
run_calculations(size, 1, loop_guarded, slow);
}
{
std::cout << "\n2 threads - slow, guarded : ";
time_n_print t;
run_calculations(size, 2, loop_guarded, slow);
}
{
std::cout << "\n3 threads - slow, guarded : ";
time_n_print t;
run_calculations(size, 3, loop_guarded, slow);
}
{
std::cout << "\n4 threads - slow, guarded : ";
time_n_print t;
run_calculations(size, 4, loop_guarded, slow);
}
std::cout << std::endl;
return 0;
}
以下是4核Linux机器上的典型输出:
>1 thread - fast, unguarded : Elapsed(ms)= 32826; Result: 10000000
>1 thread - fast, guarded : Elapsed(ms)= 172208; Result: 10000000
>
>1 thread - slow, unguarded : Elapsed(ms)=2131659; Result: 10000000
>2 threads - slow, unguarded : Elapsed(ms)=1079671; Result: 9079646
>3 threads - slow, unguarded : Elapsed(ms)= 739284; Result: 8059758
>4 threads - slow, unguarded : Elapsed(ms)= 564641; Result: 7137484
>
>1 thread - slow, guarded : Elapsed(ms)=2198650; Result: 10000000
>2 threads - slow, guarded : Elapsed(ms)=1468137; Result: 10000000
>3 threads - slow, guarded : Elapsed(ms)=1306659; Result: 10000000
>4 threads - slow, guarded : Elapsed(ms)=1549214; Result: 10000000
所以我们可以看到
- 与增加长双精度值相比,锁定/解锁互斥体实际上花费了相当长的时间;
- 没有互斥锁,多线程的收益非常好,正如预期的那样。而且,正如预期的那样,由于竞争,我们损失了很多增量;
- 使用互斥锁时,没有超过2个线程的增益;
主要的问题是--为什么占用不到10%执行时间的部分代码会如此显著地降低性能?
我明白,我可以解决这个问题,通过在每个线程中分别累积结果,然后在最后总结它们。但为什么这个问题首先出现?
**UPDATE:**感谢您的回答和评论。基本上,如果每个线程有7-8%的时间处于锁定状态,我们就无法获得良好的性能增益。如果在上面的代码中,我在slow
函数中添加了10个循环,那么保护和未保护版本的性能增益在最多4个线程的情况下是相同的。因此,对我来说,现在的经验法则是--处于锁定状态的时间不应超过总执行时间的1%。
1条答案
按热度按时间58wvjzkj1#
锁定contested1互斥锁需要一个系统调用以及涉及以下内容的所有操作:操作系统的上下文切换,这可能会调度一些其他进程,这样当您返回时,所有缓存都将失效等等。这本来就是一个开销相当大的操作。因为您的
slow
函数并不 * 那么 * 昂贵,从各个方面考虑,互斥锁上似乎存在足够的争用,代码必须相对频繁地转到操作系统,并且这导致显著的性能拖累。让每个线程将其结果聚合到自己的变量中,然后在最后批量更新一次,这样在最后整个过程只需要一个互斥锁,这将是一个很好的实践。通常,如果您打算与互斥锁同步并关心性能,您需要找到将工作划分为足够粗的块的方法,以使互斥不会成为一个重要的阻力。
否则,无锁数据结构提供了一种替代方案。它们避免了操作系统的锁定,但如果竞争变得太激烈,它们中的许多会相互忙碌等待。如果不是这种情况,那么如果你追求性能,它们非常值得一看。
1正如所罗门在评论中指出的那样,如果互斥锁是无竞争的,那么锁操作(对于现代CPU和操作系统)可以在用户空间中完成,因此要简单得多。