OS X中的多线程C程序比Linux慢得多

wljmcqd8  于 2023-06-05  发布在  Linux
关注(0)|答案(4)|浏览(124)

这是我为一个操作系统课的作业写的,我已经完成并交了。我昨天发布了这个问题,但由于“学术诚信”的规定,我把它关闭,直到提交截止日期后。
目的是学习如何使用临界区。有一个data数组,其中有100个单调递增的数字,0...99和40个线程,每个线程随机交换两个元素2,000,000次。每秒一次Checker通过,并确保每个数字只有一个(这意味着没有发生并行访问)。
以下是Linux时代:

real    0m5.102s
user    0m5.087s
sys     0m0.000s

OS X时代

real    6m54.139s
user    0m41.873s
sys     6m43.792s

我在运行OS X的同一台机器上运行一个带有ubuntu/trusty64的vagrant box。这是一个四核i7 2.3Ghz(高达3.2Ghz)2012 rMBP。
如果我理解正确的话,sys是系统开销,我无法控制,即使这样,41秒的用户时间也表明线程可能是串行运行的。
如果需要,我可以发布所有代码,但我会发布我认为相关的位。我使用pthreads,因为这是Linux提供的,但我假设它们在OSX上工作。
创建swapper线程以运行swapManyTimes例程:

for (int i = 0; i < NUM_THREADS; i++) {
    int err = pthread_create(&(threads[i]), NULL, swapManyTimes, NULL);
}

Swapper线程临界区,在for循环中运行200万次:

pthread_mutex_lock(&mutex);    // begin critical section
int tmpFirst = data[first];
data[first] = data[second];
data[second] = tmpFirst;
pthread_mutex_unlock(&mutex);  // end critical section

只创建一个Checker线程,与Swapper相同。它通过遍历data数组并用true标记与每个值对应的索引来操作。然后,它检查有多少索引是空的。例如:

pthread_mutex_lock(&mutex);
for (int i = 0; i < DATA_SIZE; i++) {
    int value = data[i];
    consistency[value] = 1;
}
pthread_mutex_unlock(&mutex);

它在运行完while(1)循环后,通过调用sleep(1)每秒运行一次。在所有swapper线程都被加入后,这个线程也被取消并加入。
我很乐意提供任何更多的信息,可以帮助弄清楚为什么这在Mac上这么糟糕。我并不是真的在寻求代码优化方面的帮助,除非这是OSX的问题。我尝试在OS X上同时使用clanggcc-4.9构建它。

a6b3iqyw

a6b3iqyw1#

MacOSX和Linux实现pthread的方式不同,导致了这种缓慢的行为。特别是MacOSX不使用自旋锁(根据ISO C标准,它们是可选的)。这可能会导致像这样的示例的代码性能非常非常慢。

k3bvogb1

k3bvogb12#

我已经在很大程度上复制了你的结果(没有扫地机):

#include <stdlib.h>
#include <stdio.h>

#include <pthread.h>

pthread_mutex_t Lock;
pthread_t       LastThread;
int             Array[100];

void *foo(void *arg)
{
  pthread_t self  = pthread_self();
  int num_in_row  = 1;
  int num_streaks = 0;
  double avg_strk = 0.0;
  int i;

  for (i = 0; i < 1000000; ++i)
  {
    int p1 = (int) (100.0 * rand() / (RAND_MAX - 1));
    int p2 = (int) (100.0 * rand() / (RAND_MAX - 1));

    pthread_mutex_lock(&Lock);
    {
      int tmp   = Array[p1];
      Array[p1] = Array[p2];
      Array[p2] = tmp;

      if (pthread_equal(LastThread, self))
        ++num_in_row;

      else
      {
        ++num_streaks;
        avg_strk += (num_in_row - avg_strk) / num_streaks;
        num_in_row = 1;
        LastThread = self;
      }
    }
    pthread_mutex_unlock(&Lock);
  }

  fprintf(stdout, "Thread exiting with avg streak length %lf\n", avg_strk);

  return NULL;
}

int main(int argc, char **argv)
{
  int       num_threads = (argc > 1 ? atoi(argv[1]) : 40);
  pthread_t thrs[num_threads];
  void     *ret;
  int       i;

  if (pthread_mutex_init(&Lock, NULL))
  {
    perror("pthread_mutex_init failed!");
    return 1;
  }

  for (i = 0; i < 100; ++i)
    Array[i] = i;

  for (i = 0; i < num_threads; ++i)
    if (pthread_create(&thrs[i], NULL, foo, NULL))
    {
      perror("pthread create failed!");
      return 1;
    }

  for (i = 0; i < num_threads; ++i)
    if (pthread_join(thrs[i], &ret))
    {
      perror("pthread join failed!");
      return 1;
    }

  /*
  for (i = 0; i < 100; ++i)
    printf("%d\n", Array[i]);

  printf("Goodbye!\n");
  */

  return 0;
}

在Linux(2.6.18-308.24.1.el5)服务器上Intel(R)Xeon(R)CPU E3-1230 V2@3.30GHz

[ltn@svg-dc60-t1 ~]$ time ./a.out 1

real    0m0.068s
user    0m0.068s
sys 0m0.001s
[ltn@svg-dc60-t1 ~]$ time ./a.out 2

real    0m0.378s
user    0m0.443s
sys 0m0.135s
[ltn@svg-dc60-t1 ~]$ time ./a.out 3

real    0m0.899s
user    0m0.956s
sys 0m0.941s
[ltn@svg-dc60-t1 ~]$ time ./a.out 4

real    0m1.472s
user    0m1.472s
sys 0m2.686s
[ltn@svg-dc60-t1 ~]$ time ./a.out 5

real    0m1.720s
user    0m1.660s
sys 0m4.591s

[ltn@svg-dc60-t1 ~]$ time ./a.out 40

real    0m11.245s
user    0m13.716s
sys 1m14.896s

在我的MacBook Pro(约塞米蒂10.10.2)2.6 GHz i7,16 GB内存

john-schultzs-macbook-pro:~ jschultz$ time ./a.out 1

real    0m0.057s
user    0m0.054s
sys 0m0.002s
john-schultzs-macbook-pro:~ jschultz$ time ./a.out 2

real    0m5.684s
user    0m1.148s
sys 0m5.353s
john-schultzs-macbook-pro:~ jschultz$ time ./a.out 3

real    0m8.946s
user    0m1.967s
sys 0m8.034s
john-schultzs-macbook-pro:~ jschultz$ time ./a.out 4

real    0m11.980s
user    0m2.274s
sys 0m10.801s
john-schultzs-macbook-pro:~ jschultz$ time ./a.out 5

real    0m15.680s
user    0m3.307s
sys 0m14.158s
john-schultzs-macbook-pro:~ jschultz$ time ./a.out 40

real    2m7.377s
user    0m23.926s
sys 2m2.434s

我的Mac用了12倍的挂钟时间来完成40个线程,这是相对于一个非常旧的Linux + gcc版本。
注意:我更改了代码,使每个线程可以执行1 M次交换。
看起来在争论之下,OSX比Linux做了更多的工作。也许它比Linux更精细地交织它们?

EDIT更新代码,记录线程立即重新捕获锁的平均次数:

Linux

[ltn@svg-dc60-t1 ~]$ time ./a.out 10
Thread exiting with avg streak length 2.103567
Thread exiting with avg streak length 2.156641
Thread exiting with avg streak length 2.101194
Thread exiting with avg streak length 2.068383
Thread exiting with avg streak length 2.110132
Thread exiting with avg streak length 2.046878
Thread exiting with avg streak length 2.087338
Thread exiting with avg streak length 2.049701
Thread exiting with avg streak length 2.041052
Thread exiting with avg streak length 2.048456

real    0m2.837s
user    0m3.012s
sys 0m16.040s

Mac OSX

john-schultzs-macbook-pro:~ jschultz$ time ./a.out 10
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000

real    0m34.163s
user    0m5.902s
sys 0m30.329s

因此,OSX更均匀地共享其锁,因此有更多的线程挂起和恢复。

tyky79it

tyky79it3#

The OP does not mention/show any code that indicates the thread(s) sleep, wait, give up execution, etc and all the threads are at the same 'nice' level.

因此,单个线程可以很好地获得CPU,并且直到它完成所有2 mil执行才释放它。
这将导致在linux上执行上下文切换的时间最少。
然而,在MAC OS上,在允许另一个“准备执行”线程/进程执行之前,执行仅被给予“时间片”来执行。
这意味着更多的上下文切换。
上下文切换在“sys”时间内执行。
结果是MAC OS将需要更长的时间来执行。
为了平衡环境,可以通过插入一个nanosleep()或调用

#include <sched.h>

then calling

int sched_yield(void);
gcmastyq

gcmastyq4#

只是碰到了这个问题。8年过去了,OS X和MacBook现在使用苹果硅M2芯片。我编译了上面发布的程序,并用cc -O2 thread1.cc -o thread1编译。
这台机器是2023 MBP w/Apple M2 Max芯片。12核,3.49GHz,无超线程。64GB RAM。OS X Ventura 13.2。
编译器是:

$ cc --version
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.3.0
Thread model: posix

结果如下:

$ time  ./thread1 10
Thread exiting with avg streak length 43.901835
Thread exiting with avg streak length 46.534646
Thread exiting with avg streak length 42.759162
Thread exiting with avg streak length 49.811208
Thread exiting with avg streak length 45.170582
Thread exiting with avg streak length 40.739703
Thread exiting with avg streak length 43.436409
Thread exiting with avg streak length 41.932822
Thread exiting with avg streak length 46.498628
Thread exiting with avg streak length 49.501871

real    0m0.301s
user    0m0.244s
sys     0m2.476s

旧OSX

real    0m34.163s
user    0m5.902s
sys     0m30.329s

所以基本上快了100倍。下面是40线程运行:

$ time  ./thread1 40
Thread exiting with avg streak length 40.080895
Thread exiting with avg streak length 40.563484
Thread exiting with avg streak length 41.033526
Thread exiting with avg streak length 40.101704
Thread exiting with avg streak length 40.923921
Thread exiting with avg streak length 39.206587
Thread exiting with avg streak length 41.269448
Thread exiting with avg streak length 40.074901
Thread exiting with avg streak length 39.983886
Thread exiting with avg streak length 39.298998
Thread exiting with avg streak length 41.183559
Thread exiting with avg streak length 40.596663
Thread exiting with avg streak length 40.342142
Thread exiting with avg streak length 40.017688
Thread exiting with avg streak length 40.077315
Thread exiting with avg streak length 42.075402
Thread exiting with avg streak length 40.006241
Thread exiting with avg streak length 39.011589
Thread exiting with avg streak length 39.169611
Thread exiting with avg streak length 40.141985
Thread exiting with avg streak length 38.546334
Thread exiting with avg streak length 40.282509
Thread exiting with avg streak length 39.643092
Thread exiting with avg streak length 39.955526
Thread exiting with avg streak length 40.576165
Thread exiting with avg streak length 40.821767
Thread exiting with avg streak length 40.840747
Thread exiting with avg streak length 40.670436
Thread exiting with avg streak length 40.204157
Thread exiting with avg streak length 38.819978
Thread exiting with avg streak length 40.324757
Thread exiting with avg streak length 39.573690
Thread exiting with avg streak length 40.386526
Thread exiting with avg streak length 39.140447
Thread exiting with avg streak length 38.662529
Thread exiting with avg streak length 40.751783
Thread exiting with avg streak length 37.928655
Thread exiting with avg streak length 38.996100
Thread exiting with avg streak length 41.535742
Thread exiting with avg streak length 39.928289

real    0m1.222s
user    0m1.019s
sys     0m12.889s

非常快!然后我在一个新的AWS Linux c5.4xlarge示例上尝试了它:

16 procs like this:
model name      : Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
stepping        : 7
microcode       : 0x5003501
cpu MHz         : 3621.221

32GB RAM

编译:

$ cc --version
cc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-15)

$ cc -O2 thread1.cc -lpthread -o thread1

得到这个结果:

$ time ./thread1 10
Thread exiting with avg streak length 2.319996
Thread exiting with avg streak length 2.286782
Thread exiting with avg streak length 2.281775
Thread exiting with avg streak length 2.316133
Thread exiting with avg streak length 2.266562
Thread exiting with avg streak length 2.268253
Thread exiting with avg streak length 2.270898
Thread exiting with avg streak length 2.243067
Thread exiting with avg streak length 2.249166
Thread exiting with avg streak length 2.221897

real    0m4.604s
user    0m11.333s
sys     0m31.416s

因此,新的Linux/16 proc机器比“旧”Linux * 慢 *。真有趣感谢@jschultz410发布了这样一段紧凑且高度可移植的C代码。

相关问题