assembly 从叮当声中生成良好的进位加代码

piztneat  于 2022-11-13  发布在  其他
关注(0)|答案(4)|浏览(318)

我正在试着生成代码(目前使用clang++-3.8),将两个由多个机器字组成的数字相加。为了简化起见,我现在只添加128位数字,但我希望能够概括这一点。
首先是一些类型定义:

typedef unsigned long long unsigned_word;
typedef __uint128_t unsigned_128;

和一个“结果”类型:

struct Result
{
  unsigned_word lo;
  unsigned_word hi;
};

第一个函数f取两对无符号字并返回一个结果,作为中间步骤,在将这两个64位字相加之前,将它们放入一个128位字中,如下所示:

Result f (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
  Result x;
  unsigned_128 n1 = lo1 + (static_cast<unsigned_128>(hi1) << 64);
  unsigned_128 n2 = lo2 + (static_cast<unsigned_128>(hi2) << 64);
  unsigned_128 r1 = n1 + n2;
  x.lo = r1 & ((static_cast<unsigned_128>(1) << 64) - 1);
  x.hi = r1 >> 64;
  return x;
}

它实际上是这样很好地内联的:

movq    8(%rsp), %rsi
movq    (%rsp), %rbx
addq    24(%rsp), %rsi
adcq    16(%rsp), %rbx

现在,我使用clang多精度原语编写了一个更简单的函数,如下所示:

static Result g (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
  Result x;
  unsigned_word carryout;
  x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
  x.hi = __builtin_addcll(hi1, hi2, carryout, &x.carry);
  return x;
}

这会产生下列组件:

movq    24(%rsp), %rsi
movq    (%rsp), %rbx
addq    16(%rsp), %rbx
addq    8(%rsp), %rsi
adcq    $0, %rbx

在这个例子中,有一个额外的加法,它不是对低字做一个普通的add,然后对高字做一个adc,它只是对高字做一个add,然后对低字做一个add,然后再对高字做一个adc,参数为零。
这看起来可能不太糟糕,但是当你用更大的字(比如192位,256位)尝试这个方法时,你很快就会得到一堆or和其他处理进位的指令,而不是一个简单的addadcadc,... adc的链。
多精度基元似乎做了一个可怕的工作,正是他们打算做的。
所以我要找的是我可以归纳为任意长度的代码(不需要这样做,只是足够让我能想出如何),哪个clang产生加法的方式是一样有效的,因为它是在128位类型内置(不幸的是,我不能很容易地概括)。我假设这应该只是一个adc的链,但我欢迎参数和代码,它应该是其他的东西。

c90pui9n

c90pui9n1#

There is an intrinsic to do this: _addcarry_u64. However, only Visual Studio and ICC (at least VS 2013 and 2015 and ICC 13 and ICC 15) do this efficiently. Clang 3.7 and GCC 5.2 still don't produce efficient code with this intrinsic.
Clang in addition has a built-in which one would think does this, __builtin_addcll , but it does not produce efficient code either.
The reason Visual Studio does this is that it does not allow inline assembly in 64-bit mode so the compiler should provide a way to do this with an intrinsic (though Microsoft took their time implementing this).
Therefore, with Visual Studio use _addcarry_u64 . With ICC use _addcarry_u64 or inline assembly. With Clang and GCC use inline assembly.
Note that since the Broadwell microarchitecture there are two new instructions: adcx and adox which you can access with the _addcarryx_u64 intrinsic . Intel's documentation for these intrinsics used to be different then the assembly produced by the compiler but it appears their documentation is correct now. However, Visual Studio still only appears to produce adcx with _addcarryx_u64 whereas ICC produces both adcx and adox with this intrinsic. But even though ICC produces both instructions it does not produce the most optimal code (ICC 15) and so inline assembly is still necessary.
Personally, I think the fact that a non-standard feature of C/C++, such as inline assembly or intrinsics, is required to do this is a weakness of C/C++ but others might disagree. The adc instruction has been in the x86 instruction set since 1979. I would not hold my breath on C/C++ compilers being able to optimally figure out when you want adc . Sure they can have built-in types such as __int128 but the moment you want a larger type that's not built-in you have to use some non-standard C/C++ feature such as inline assembly or intrinsics.
In terms of inline assembly code to do this I already posted a solution for 256-bit addition for eight 64-bit integers in register at multi-word addition using the carry flag .
Here is that code reposted.

#define ADD256(X1, X2, X3, X4, Y1, Y2, Y3, Y4) \
 __asm__ __volatile__ ( \
 "addq %[v1], %[u1] \n" \
 "adcq %[v2], %[u2] \n" \
 "adcq %[v3], %[u3] \n" \
 "adcq %[v4], %[u4] \n" \
 : [u1] "+&r" (X1), [u2] "+&r" (X2), [u3] "+&r" (X3), [u4] "+&r" (X4) \
 : [v1]  "r" (Y1), [v2]  "r" (Y2), [v3]  "r" (Y3), [v4]  "r" (Y4))

If you want to explicitly load the values from memory you can do something like this

//uint64_t dst[4] = {1,1,1,1};
//uint64_t src[4] = {1,2,3,4};
asm (
     "movq (%[in]), %%rax\n"
     "addq %%rax, %[out]\n"
     "movq 8(%[in]), %%rax\n"
     "adcq %%rax, 8%[out]\n"
     "movq 16(%[in]), %%rax\n"
     "adcq %%rax, 16%[out]\n"
     "movq 24(%[in]), %%rax\n"
     "adcq %%rax, 24%[out]\n"
     : [out] "=m" (dst)
     : [in]"r" (src)
     : "%rax"
     );

That produces nearlly identical assembly as from the following function in ICC

void add256(uint256 *x, uint256 *y) {
    unsigned char c = 0;
    c = _addcarry_u64(c, x->x1, y->x1, &x->x1);
    c = _addcarry_u64(c, x->x2, y->x2, &x->x2);
    c = _addcarry_u64(c, x->x3, y->x3, &x->x3);
        _addcarry_u64(c, x->x4, y->x4, &x->x4);
}

I have limited experience with GCC inline assembly (or inline assembly in general - I usually use an assembler such as NASM) so maybe there are better inline assembly solutions.
So what I'm looking for is code that I could generalize to any length
To answer this question here is another solution using template meta programming. I used this same trick for loop unrolling. This produces optimal code with ICC. If Clang or GCC ever implement _addcarry_u64 efficiently this would be a good general solution.

#include <x86intrin.h>
#include <inttypes.h>

#define LEN 4  // N = N*64-bit add e.g. 4=256-bit add, 3=192-bit add, ...

static unsigned char c = 0;

template<int START, int N>
struct Repeat {
    static void add (uint64_t *x, uint64_t *y) {
        c = _addcarry_u64(c, x[START], y[START], &x[START]);
        Repeat<START+1, N>::add(x,y);
    }
};

template<int N>
    struct Repeat<LEN, N> {
    static void add (uint64_t *x, uint64_t *y) {}
};

void sum_unroll(uint64_t *x, uint64_t *y) {
    Repeat<0,LEN>::add(x,y);
}

Assembly from ICC

xorl      %r10d, %r10d                                  #12.13
movzbl    c(%rip), %eax                                 #12.13
cmpl      %eax, %r10d                                   #12.13
movq      (%rsi), %rdx                                  #12.13
adcq      %rdx, (%rdi)                                  #12.13
movq      8(%rsi), %rcx                                 #12.13
adcq      %rcx, 8(%rdi)                                 #12.13
movq      16(%rsi), %r8                                 #12.13
adcq      %r8, 16(%rdi)                                 #12.13
movq      24(%rsi), %r9                                 #12.13
adcq      %r9, 24(%rdi)                                 #12.13
setb      %r10b

Meta programming is a basic feature of assemblers so it's too bad C and C++ (except through template meta programming hacks) have no solution for this either (the D language does).
The inline assembly I used above which referenced memory was causing some problems in a function. Here is a new version which seems to work better

void foo(uint64_t *dst, uint64_t *src)
{
    __asm (
        "movq (%[in]), %%rax\n"
        "addq %%rax, (%[out])\n"
        "movq 8(%[in]), %%rax\n"
        "adcq %%rax, 8(%[out])\n"
        "movq 16(%[in]), %%rax\n"
        "addq %%rax, 16(%[out])\n"
        "movq 24(%[in]), %%rax\n"
        "adcq %%rax, 24(%[out])\n"
        :
        : [in] "r" (src), [out] "r" (dst)
        : "%rax"
    );
}
cld4siwp

cld4siwp2#

在叮当6,两个__builtin_addcl__builtin_add_overflowproduce the same, optimal disassembly

Result g(unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
  Result x;
  unsigned_word carryout;
  x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
  x.hi = __builtin_addcll(hi1, hi2, carryout, &carryout);
  return x;
}

Result h(unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
  Result x;
  unsigned_word carryout;
  carryout = __builtin_add_overflow(lo1, lo2, &x.lo);
  carryout = __builtin_add_overflow(hi1, carryout, &hi1);
  __builtin_add_overflow(hi1, hi2, &x.hi);
  return x;
}

两者的装配:

add rdi, rdx
adc rsi, rcx
mov rax, rdi
mov rdx, rsi
ret
pdtvr36n

pdtvr36n3#

从clang 5.0开始,可以使用__uint128_t-加法并通过移位获得进位位来获得良好的结果:

inline uint64_t add_with_carry(uint64_t &a, const uint64_t &b, const uint64_t &c)
{
    __uint128_t s = __uint128_t(a) + b + c;
    a = s;
    return s >> 64;
}

在许多情况下,clang仍然会执行奇怪的操作(我假设是因为可能的别名?),但通常将一个变量复制到临时变量中会有所帮助。
使用示例

template<int size> struct LongInt
{
    uint64_t data[size];
};

手动使用:

void test(LongInt<3> &a, const LongInt<3> &b_)
{
    const LongInt<3> b = b_; // need to copy b_ into local temporary
    uint64_t c0 = add_with_carry(a.data[0], b.data[0], 0);
    uint64_t c1 = add_with_carry(a.data[1], b.data[1], c0);
    uint64_t c2 = add_with_carry(a.data[2], b.data[2], c1);
}

通用解决方案:

template<int size>
void addTo(LongInt<size> &a, const LongInt<size> b)
{
    __uint128_t c = __uint128_t(a.data[0]) + b.data[0];
    for(int i=1; i<size; ++i)
    {
        c = __uint128_t(a.data[i]) + b.data[i] + (c >> 64);
        a.data[i] = c;
    }
}

Godbolt Link:上面的所有示例都只编译为movaddadc指令(从clang 5.0开始,且至少为-O2)。
这些例子不能用gcc(最高8.1,目前是godbolt的最高版本)生成好的代码,而且我还没有设法得到任何可用的__builtin_addcll......

rbpvctlc

rbpvctlc4#

使用__builtin_addcll的代码在Clang 10版本中进行了全面优化,适用于至少3个链(这要求adc具有可变的进位输入,也会产生进位输出)。Godbolt展示了Clang 9在这种情况下将setc/movzx弄得一团糟。
Clang 6和之后的代码可以很好地处理2个链的情况,如@zneak的答案所示,其中不需要从adc进行进位。
没有内建的惯用代码也很好。而且,它在每个编译器中都能工作,并且也被GCC 5+完全优化为2的链(add/adc,没有使用adc的进位输出)。编写正确的C来在有进位输入时生成进位输出是很棘手的,所以这不容易扩展。

Result h (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
  unsigned_word lo = lo1 + lo2;
  bool carry = lo < lo1;
  unsigned_word hi = hi1 + hi2 + carry;
  return Result{lo, hi};
}

https://godbolt.org/z/ThxGj1WGK

相关问题