GCC和MSVC对非ASCII字符的处理方式是否不同,或者它的行为是否未定义

pftdvrlh  于 2022-11-12  发布在  其他
关注(0)|答案(1)|浏览(149)

由于在一个非英语国家,我想做一个测试与字符数组和非ASCII字符。
我用MSVC和Mingwin GCC编译了这段代码:

#include <iostream>

int main()
{
    constexpr char const* c = "é";
    int i = 0;

    char const* s;

    for (s = c; *s; s++)
    {
        i++;
    }

    std::cout << "Size: " << i << std::endl;

    std::cout << "Char size: " << sizeof(char) << std::endl;
}

两者都显示Char size: 1,但MSVC显示Size: 1,明温GCC显示Size: 2
这是由非ASCII字符引起的未定义行为,还是有其他原因(可能是UTF-8中的GCC编码和UTF-16中的MSVC)?

deikduxw

deikduxw1#

The encoding used to map ordinary string literals to a sequence of code units is (mostly) implementation-defined.
GCC defaults to UTF-8 in which the character é uses two code units and my guess is that MSVC uses code page 1252 , in which the same character uses up only one code unit. (That encoding uses a single code unit per character anyway.)
Compilers typically have switches to change the ordinary literal and execution character set encoding, e.g. for GCC with the -fexec-charset option.
Also be careful that the source file is encoded in an encoding that the compiler expects. If the file is UTF-8 encoded but the compiler expects it to be something else, then it is going to interpret the bytes in the file corresponding to the intended character é as a different (sequence of) characters. That is however independent of the ordinary literal encoding mentioned above. GCC for example has the -finput-charset option to explicitly choose the source encoding and defaults to UTF-8.
If you intent the literal to be UTF-8 encoded into bytes, then you should use u8 -prefixed literals which are guaranteed to use this encoding:

constexpr auto c = u8"é";

Note that the type auto here will be const char* in C17, but const char8_t* since C20. s must be adjusted accordingly. This will then guarantee an output of 2 for the length (number of code units). Similarly there are u and U for UTF-16 and UTF-32 in both of which only one code unit would be used for é , but the size of code units would be 2 or 4 bytes (assuming CHAR_BIT == 8 ) respectively (types char16_t and char32_t ).

相关问题