如何在C++中检查字符串是否具有有效的UTF-8字符？

67up9zun 于 2023-04-01 发布在其他

关注(0)|答案(2)|浏览(156)

我尝试使用ICU库来测试字符串是否包含无效的UTF-8字符。我创建了一个UTF-8转换器，但没有无效数据会导致转换错误。感谢您的帮助。
谢谢，普拉散

int main()                                                                                        
{                                     
    string str ("AP1120 CorNet-IP v5.0 v5.0.1.22 òÀ MIB 1.5.3.50 Profile EN-C5000");
    //  string str ("example string here");
    //  string str (" ����������"     );                  
    UErrorCode status = U_ZERO_ERROR;                   
    UConverter *cnv;            
    const char *sourceLimit;    
    const char * source = str.c_str();                  
    cnv = ucnv_open("utf-8", &status);                                                              
    assert(U_SUCCESS(status));                                                                      

    UChar *target;                                                                                  
    int sourceLength = str.length();                                                                
    int targetLimit = 2 * sourceLength;                                                             
    target = new UChar[targetLimit];                                                                

    ucnv_toUChars(cnv, target, targetLimit, source, sourceLength, &status);
    cout << u_errorName(status) << endl;
    assert(U_SUCCESS(status));                          
}

c++

来源：https://stackoverflow.com/questions/9539661/how-to-check-if-a-string-has-valid-utf-8-characters-in-c

2条答案

按热度按时间

qcbq4gxm1#

我修改了你的程序，打印出实际的字符串，之前和之后：

#include <unicode/ucnv.h>
#include <string>
#include <iostream>
#include <cassert>
#include <cstdio>

int main()
{
    std::string str("22 òÀ MIB 1");
    UErrorCode status = U_ZERO_ERROR;
    UConverter * const cnv = ucnv_open("utf-8", &status);
    assert(U_SUCCESS(status));

    int targetLimit = 2 * str.size();
    UChar *target = new UChar[targetLimit];

    ucnv_toUChars(cnv, target, targetLimit, str.c_str(), -1, &status);

    for (unsigned int i = 0; i != targetLimit && target[i] != 0; ++i)
        std::printf("0x%04X ", target[i]);
    std::cout << std::endl;
    for (char c : str)
        std::printf("0x%02X ", static_cast<unsigned char>(c));
    std::cout << std::endl << "Status: " << status << std::endl;
}

现在，使用默认的编译器设置，我得到：

0x0032 0x0032 0x0020 0x00F2 0x00C0 0x0020 0x004D 0x0049 0x0042 0x0020 0x0031
0x32 0x32 0x20 0xC3 0xB2 0xC3 0x80 0x20 0x4D 0x49 0x42 0x20 0x31

也就是说，输入已经是UTF-8。这是我的编辑器的阴谋，它以UTF-8保存文件（在十六进制编辑器中可验证），并且GCC将 * 执行字符集 * 设置为UTF-8。
您可以强制GCC更改这些参数。例如，强制执行字符集为ISO-8859-1（通过-fexec-charset=iso-8859-1）会产生以下结果：

0x0032 0x0032 0x0020 0xFFFD 0xFFFD 0x0020 0x004D 0x0049 0x0042 0x0020 0x0031
0x32 0x32 0x20 0xF2 0xC0 0x20 0x4D 0x49 0x42 0x20 0x31

如您所见，输入现在是ISO-8859-1编码的，转换提示 * 失败 * 并产生“无效字符”代码点U+FFFD。
然而，转换操作仍然返回“成功”状态。似乎库不认为用户数据转换错误是函数调用的错误。相反，错误状态似乎是为空间不足之类的事情保留的。

赞(0）回复(0）举报 2023-04-01

nqwrtyyt2#

我使用这个代码。我检测我的字符串的所有字符集，并逐个测试charsetname ==“UTF-8”。True是有效的UTF-8中断循环并最终返回结果，False如果没有找到字符集或chatsetnames不等于“UTF-8”。

参考资料

ICU文档ChatsetDetector

代码

#include <iostream>
#include <string>
#include <unicode/ucsdet.h>

#define UTF8_CHARSET_NAME_STRING ("UTF-8"s)

using namespace std::string_literals;

bool IsValidUTF8(const std::string &data)
{
    UErrorCode status = U_ZERO_ERROR;
    UCharsetDetector *detector = ucsdet_open(&status);
    ucsdet_setText(detector, data.c_str(), data.length(), &status);
    int32_t detectedNumber = 0;
    auto matches = ucsdet_detectAll(detector, &detectedNumber, &status);
    if (!matches)
    {
        return false;
    }
    bool valid = false;
    for (int32_t i = 0; i < detectedNumber; i++)
    {
        const char *charsetName = ucsdet_getName(matches[i], &status);
        if (UTF8_CHARSET_NAME_STRING == charsetName)
        {
            valid = true;
            break;
        }
    }
    ucsdet_close(detector);
    return valid;
}

int main()
{
    String strData = {(char)0xff, 0x25, 0x00, (char)0xfa, (char)0xff,(char)0xff,(char)0xff}; 
    std::cout<< "Result: " << (IsValidUTF8(strData)? ("true; Original String : \"" + strData + "\"") : "false") <<std::endl;  
    
    strData = "🤣😀HelloWorld!!!😀🤣";    
    std::cout<< "Result: " << (IsValidUTF8(strData)? ("true; Original String : \"" + strData + "\"") : "false") <<std::endl;                        
    return EXIT_SUCCESS; // 0
}

产出

Result: false
Result: true; Original String : "🤣😀HelloWorld!!!😀🤣"

赞(0）回复(0）举报 2023-04-01

我来回答

如何在C++中检查字符串是否具有有效的UTF-8字符？

2条答案

相关问题

热门标签

最新问答