gcc Clang优化了RDTSC asm块，认为重复的块产生与前一个块相同的结果,这是法律的吗？

9udxz4iz 于 2023-10-19 发布在其他

关注(0)|答案(2)|浏览(132)

假设我们有一些包含RDTSC的相同asm的重复，例如

volatile size_t tick1;
    asm ( "rdtsc\n"           // Returns the time in EDX:EAX.
          "shl $32, %%rdx\n"  // Shift the upper bits left.
          "or %%rdx, %q0"     // 'Or' in the lower bits.
          : "=a" (tick1)
          : 
          : "rdx");
    
    this_thread::sleep_for(1s);

    volatile size_t tick2;    
    asm ( "rdtsc\n"          // clang's optimizer just thinks this asm yields
          "shl $32, %%rdx\n" // the same bits as above, so it just loads
          "or %%rdx, %q0"    // the result to qword ptr [rsp + 8]
          : "=a" (tick2)     // 
          :                  //   mov     qword ptr [rsp + 8], rbx
          : "rdx");

    printf("tick2 - tick1 diff : %zu cycles\n", tick2 - tick1);
    printf("CPU Clock Speed    : %.2f GHz\n\n", (double) (tick2 - tick1) / 1'000'000'000.);

Clang++的优化器（即使使用-O 1）认为这两个asm块产生相同的结果：

tick2 - tick1 diff : 0 cycles
CPU Clock Speed    : 0.00 GHz

tick1              : bd806adf8b2
this_thread::sleep_for(1s)
tick2              : bd806adf8b2

当关闭Clang的优化器时，第二个块会按照预期产生进展的滴答声：

tick2 - tick1 diff : 2900160778 cycles
CPU Clock Speed    : 2.90 GHz

tick1              : 14ab6ab3391c
this_thread::sleep_for(1s)
tick2              : 14ac17902a26

1st GCC g++“似乎”不受此影响。

tick2 - tick1 diff : 2900226898 cycles
CPU Clock Speed    : 2.90 GHz

tick1              : 20e40010d8a8
this_thread::sleep_for(1s)
tick2              : 20e4aceecbfa

[直播]
但是，让我们在tick2之后添加tick3，其中asm正好是tick2

volatile size_t tick1;
    asm ( "rdtsc\n"           // Returns the time in EDX:EAX.
          "shl $32, %%rdx\n"  // Shift the upper bits left.
          "or %%rdx, %q0"     // 'Or' in the lower bits.
          : "=a" (tick1)
          : 
          : "rdx");
    
    this_thread::sleep_for(1s);

    volatile size_t tick2;    
    asm ( "rdtsc\n"          // clang's optimizer just thinks this asm yields
          "shl $32, %%rdx\n" // the same bits as above, so it just loads
          "or %%rdx, %q0"    // the result to qword ptr [rsp + 8]
          : "=a" (tick2)     // 
          :                  //   mov     qword ptr [rsp + 8], rbx
          : "rdx");

    volatile size_t tick3;
    asm ( "rdtsc\n"          
          "shl $32, %%rdx\n"   
          "or %%rdx, %q0"    
          : "=a" (tick3)
          : 
          : "rdx");

原来GCC认为tick3的asm必须产生与tick2相同的值，因为“显然”没有外部副作用，所以它只是从tick2重新加载。即使这是错误的，好吧，它有一个非常强的点，虽然。

tick2 - tick1 diff : 2900209182 cycles
CPU Clock Speed    : 2.90 GHz

tick1              : 5670bd15088e
this_thread::sleep_for(1s)
tick2              : 567169f2b6ac
tick3              : 567169f2b6ac

[直播]
在C模式下，GCC和Clang的优化器都会对此产生影响。
换句话说，即使使用-O1，这两种方法都优化了包含rdtsc的asm块的重复

tick2 - tick1 diff : 0 cycles
CPU Clock Speed    : 0.00 GHz

tick1              : 324ab8f5dd2a
thrd_sleep(&(struct timespec){.tv_sec=1}, nullptr)
tick2              : 324ab8f5dd2a
tick3_rdx          : 324b65d3368c

[直播]
事实证明，所有优化器都可以在相同的非volatileasm语句上执行common-subexpression elimination，因此RDTSC的asm语句需要是volatile。

gcc

来源：https://stackoverflow.com/questions/76903792/clang-optimizes-out-rdtsc-asm-blocks-thinking-the-repeated-block-yields-the-same

2条答案

按热度按时间

fzsnzjdm1#

C++标准没有涵盖内联汇编，所以我不确定你对“法律的”的定义是什么。你看到的行为对我来说是有意义的，因为你运行内联汇编是为了它的副作用（即你的程序集没有实现一个纯函数），你忘了使用volatile关键字。关于GCC inline assembly documentation：
扩展asm语句的典型用途是操作输入值以生成输出值。但是，您的asm语句也可能产生副作用。如果是这样，您可能需要使用volatile限定符来禁用某些优化。
还有：
如果GCC的优化器确定不需要输出变量，它们有时会丢弃asm语句。此外，如果优化器认为代码将总是返回相同的结果（即，如果代码返回相同的结果，则优化器可能会将代码移出循环。其输入值在调用之间没有变化）。使用volatile限定符将禁用这些优化。
如果在asm之后立即插入volatile关键字，问题就会消失。
P.S.不使用内联汇编，只包含x86intrin.h，然后使用__rdtsc()函数。

赞(0）回复(0）举报 2023-10-19

wgeznvg72#

更新

感谢@DavidGrayson的精彩回答。

TL;DR

根本没有什么比内部函数的粒度更好。
GCC的优化器有充分的权利（根据文档）做出假设，即使有时关于asm的假设是错误的。
优化器没有权利对volatile asm块做出假设，除非移动它们。
也就是说，优化器完全有权移动偶数个volatile asm块，以便将两个连续的volatile asm块编译为不连续的块。

auto tick1  = __rdtsc();
    
    this_thread::sleep_for(1s);

    auto tick2  = __rdtsc();
    auto tick3  = __rdtsc();

    printf("tick2 - tick1 diff : %llu cycles\n", tick2 - tick1);
    printf("CPU Clock Speed    : %.2f GHz\n\n", (double) (tick2 - tick1) / 1'000'000'000.);

它只是工作。

tick2 - tick1 diff : 2900206596 cycles
CPU Clock Speed    : 2.90 GHz

tick1              : 3ee4e9f13612
this_thread::sleep_for(1s)
tick2              : 3ee596ceda16
tick3              : 3ee596ceda32

此外，当你看代码生成器时，编译器只是有更多的自由度来优化，通过重新排列，甚至以多种方式混合东西，这超出了手写的可能性。
“锵”的一声：

rdtsc
        mov     r14, rdx
        mov     rcx, rax
        rdtsc
        mov     r15, rdx
        shl     r14, 32
        or      r14, rcx
        shl     r15, 32
        or      r15, rax

GCC：

rdtsc
        mov     rbx, rax
        sal     rdx, 32
        or      rbx, rdx
        rdtsc
        mov     edi, OFFSET FLAT:.LC1
        mov     r13, rbx
        sal     rdx, 32
        mov     rbp, rax
        xor     eax, eax
        sub     r13, r12
        or      rbp, rdx

[直播]

赞(0）回复(0）举报 2023-10-19

我来回答

gcc Clang优化了RDTSC asm块，认为重复的块产生与前一个块相同的结果,这是法律的吗？

2条答案

更新

相关问题

热门标签

最新问答