linux 如何从控制台读取到char16_t缓冲区

qjp7pelc 于 2023-11-17 发布在 Linux

关注(0)|答案(1)|浏览(120)

我在Linux上工作。我必须从控制台读取到char16_t缓冲区。目前我的代码看起来像这样：

char tempBuf[1024] = {0};
int readBytes = read(STDIN_FILENO, tempBuf, 1024);
char16_t* buf = convertToChar16(tempBuf, readBytes);

字符串
在转换函数中，我使用mbrtoc16标准库函数分别转换每个字符。这是从控制台读取到char16_t buf的唯一方法吗？你知道其他解决方案吗？

linux

来源：https://stackoverflow.com/questions/76698205/how-to-read-from-the-console-to-char16-t-buffer

1条答案

按热度按时间

eiee3dmh1#

多字节字符

在阅读到固定长度缓冲区时，您需要小心的主要事情是意外截断“多字节字符串”中的“多字节字符”
你问什么是多字节字符？在我的环境中，它们是UTF-8字符。例如，如果我运行echo $LANG，我会得到en_US.UTF-8。这些就是它们的发音，它们是可以存储在多个字节上的字符。7位字节集以外的任何字符都存储在2个或更多个字节中，这些字节顺序排列。如果您只读取多个字节中的一部分，字节字符（截断它），那么你最终会在读取的两边都有垃圾。
让我们看一个具体的例子：

示例代码

在下面完整的可运行文件中，我特意将缓冲区缩短到只有5个字符宽，这样我就可以轻松地容纳一个完整的4字节UTF-8多字节字符和一个空终止符。

#include <stdio.h>
#include <unistd.h>
#include <string.h>

#define BUF_LEN 5

int main()
{
    /* you do your read assuming some byte length */
    char tempBuf[BUF_LEN] = {0};
    int readBytes = read(STDIN_FILENO, tempBuf, BUF_LEN);

    /* If you try to read from this tempBuffer with %s you'll overrun your
     * buffer since it doesn't have a null terminator, so we'll look at it
     * character by character */
    printf("Printing bytes:\n");
    for(size_t i = 0; i < readBytes; i++)
    {
        printf( "\t%zu) 0x%02x -- %c\n",
                i,
                (unsigned char)tempBuf[i], 
                (unsigned char)tempBuf[i]);
        /* we cast the above to an unsigned char because the extra UTF
         * characters will start with a negative signed char and will not cast
         * correctly to an unsigned int to be used for reading hex values */
    }

    /* so what do we do if we identify a bad byte? we put it back into stdin */
    /* start at the end and search backward to find the most recent ascii
     * character */
    printf("\nlet's back up\n");
    char * p = &tempBuf[BUF_LEN - 1];
    while(((unsigned char)*p) > 127)
    {
        ungetc((unsigned char)*(p--), stdin);
    }
    printf("try again on that character\n");
    memset(tempBuf, 0, BUF_LEN); // set the buffer to zero again so what we 
                                 // read makes sense
    fgets(tempBuf, BUF_LEN, stdin);
    printf("Printing bytes again:\n");
    for(size_t i = 0; i < readBytes; i++)
    {
        printf( "\t%zu) 0x%02x -- %c\n",
                i,
                (unsigned char)tempBuf[i], 
                (unsigned char)tempBuf[i]);
        /* we cast the above to an unsigned char because the extra UTF
         * characters will start with a negative signed char and will not cast
         * correctly to an unsigned int to be used for reading hex values */
    }
    printf("Multi-byte string all at once: \"%s\"", tempBuf);
    
    return 0;
}

字符串

运行示例

利用上面的代码，我可以构造一个输入，我知道它会故意打断（截断）一个字符，就像这样，看看发生了什么。

scott@scott-G3:~/tmp$ g++ -o stackoverflow_example stackoverflow_example.cpp 
scott@scott-G3:~/tmp$ ./stackoverflow_example 
abcdé
Printing bytes:
    0) 0x61 -- a
    1) 0x62 -- b
    2) 0x63 -- c
    3) 0x64 -- d
    4) 0xc3 -- �

let's back up
try again on that character
Printing bytes again:
    0) 0xc3 -- �
    1) 0xa9 -- �
    2) 0x0a -- 

    3) 0x00 -- 
    4) 0x00 -- 
Multi-byte string all at once: "é

型
发生了什么？
在上面的例子中，我故意定位了UTF-8字符“é"，它扩展为两个字节0xC3，0xA9，这样它就会被你的read调用切断。然后我使用ungetc将0xC3放回标准输入，然后再和它的搭档0xA9一起读一遍。只有当它们彼此相邻时，它们才有意义。你会看到一个0x0a跟在它后面，我们知道并喜欢它是'\n'，因为读者也抓住了我的回归。

赞(0）回复(0）举报 2023-11-17

我来回答

linux 如何从控制台读取到char16_t缓冲区

1条答案

多字节字符

示例代码

运行示例

相关问题

热门标签

最新问答