linux 如何从控制台读取到char16_t缓冲区

qjp7pelc  于 2023-11-17  发布在  Linux
关注(0)|答案(1)|浏览(120)

我在Linux上工作。我必须从控制台读取到char16_t缓冲区。目前我的代码看起来像这样:

char tempBuf[1024] = {0};
int readBytes = read(STDIN_FILENO, tempBuf, 1024);
char16_t* buf = convertToChar16(tempBuf, readBytes);

字符串
在转换函数中,我使用mbrtoc16标准库函数分别转换每个字符。这是从控制台读取到char16_t buf的唯一方法吗?你知道其他解决方案吗?

eiee3dmh

eiee3dmh1#

多字节字符

在阅读到固定长度缓冲区时,您需要小心的主要事情是意外截断“多字节字符串”中的“多字节字符”
你问什么是多字节字符?在我的环境中,它们是UTF-8字符。例如,如果我运行echo $LANG,我会得到en_US.UTF-8。这些就是它们的发音,它们是可以存储在多个字节上的字符。7位字节集以外的任何字符都存储在2个或更多个字节中,这些字节顺序排列。如果您只读取多个字节中的一部分,字节字符(截断它),那么你最终会在读取的两边都有垃圾。
让我们看一个具体的例子:

示例代码

在下面完整的可运行文件中,我特意将缓冲区缩短到只有5个字符宽,这样我就可以轻松地容纳一个完整的4字节UTF-8多字节字符和一个空终止符。

#include <stdio.h>
#include <unistd.h>
#include <string.h>

#define BUF_LEN 5

int main()
{
    /* you do your read assuming some byte length */
    char tempBuf[BUF_LEN] = {0};
    int readBytes = read(STDIN_FILENO, tempBuf, BUF_LEN);

    /* If you try to read from this tempBuffer with %s you'll overrun your
     * buffer since it doesn't have a null terminator, so we'll look at it
     * character by character */
    printf("Printing bytes:\n");
    for(size_t i = 0; i < readBytes; i++)
    {
        printf( "\t%zu) 0x%02x -- %c\n",
                i,
                (unsigned char)tempBuf[i], 
                (unsigned char)tempBuf[i]);
        /* we cast the above to an unsigned char because the extra UTF
         * characters will start with a negative signed char and will not cast
         * correctly to an unsigned int to be used for reading hex values */
    }

    /* so what do we do if we identify a bad byte? we put it back into stdin */
    /* start at the end and search backward to find the most recent ascii
     * character */
    printf("\nlet's back up\n");
    char * p = &tempBuf[BUF_LEN - 1];
    while(((unsigned char)*p) > 127)
    {
        ungetc((unsigned char)*(p--), stdin);
    }
    printf("try again on that character\n");
    memset(tempBuf, 0, BUF_LEN); // set the buffer to zero again so what we 
                                 // read makes sense
    fgets(tempBuf, BUF_LEN, stdin);
    printf("Printing bytes again:\n");
    for(size_t i = 0; i < readBytes; i++)
    {
        printf( "\t%zu) 0x%02x -- %c\n",
                i,
                (unsigned char)tempBuf[i], 
                (unsigned char)tempBuf[i]);
        /* we cast the above to an unsigned char because the extra UTF
         * characters will start with a negative signed char and will not cast
         * correctly to an unsigned int to be used for reading hex values */
    }
    printf("Multi-byte string all at once: \"%s\"", tempBuf);
    
    return 0;
}

字符串

运行示例

利用上面的代码,我可以构造一个输入,我知道它会故意打断(截断)一个字符,就像这样,看看发生了什么。

scott@scott-G3:~/tmp$ g++ -o stackoverflow_example stackoverflow_example.cpp 
scott@scott-G3:~/tmp$ ./stackoverflow_example 
abcdé
Printing bytes:
    0) 0x61 -- a
    1) 0x62 -- b
    2) 0x63 -- c
    3) 0x64 -- d
    4) 0xc3 -- �

let's back up
try again on that character
Printing bytes again:
    0) 0xc3 -- �
    1) 0xa9 -- �
    2) 0x0a -- 

    3) 0x00 -- 
    4) 0x00 -- 
Multi-byte string all at once: "é


发生了什么?
在上面的例子中,我故意定位了UTF-8字符“é",它扩展为两个字节0xC30xA9,这样它就会被你的read调用切断。然后我使用ungetc0xC3放回标准输入,然后再和它的搭档0xA9一起读一遍。只有当它们彼此相邻时,它们才有意义。你会看到一个0x0a跟在它后面,我们知道并喜欢它是'\n',因为读者也抓住了我的回归。

相关问题