如何在C++中大写波兰语特殊字母？

xwmevbvl 于 2023-08-09 发布在其他

关注(0)|答案(3)|浏览(258)

我有一个字符串，我想大写，但它可能包含波兰语特殊字母（，ć，，∪，ñ，ó，）。函数transform(string.begin(), string.end(), string.begin(), ::toupper);只将拉丁字母大写，所以我写了一个这样的函数：

string to_upper(string nazwa)
    {
        transform(nazwa.begin(), nazwa.end(), nazwa.begin(), ::toupper);

        for (int i = 0; i < (int)nazwa.size(); i++)
        {
            switch(nazwa[i])
            {
                case u'ą':
                {
                    nazwa[i] = u'Ą';
                    break;
                }
                case u'ć':
                {
                    nazwa[i] = u'Ć';
                    break;
                }
                case u'ę':
                {
                    nazwa[i] = u'Ę';
                    break;
                }
                case u'ó':
                {
                    nazwa[i] = u'Ó';
                    break;
                }
                case u'ł':
                {
                    nazwa[i] = u'Ł';
                    break;
                }
                case u'ń':
                {
                    nazwa[i] = u'Ń';
                    break;
                }
                case u'ś':
                {
                    nazwa[i] = u'Ś';
                    break;
                }
                case u'ż':
                {
                    nazwa[i] = u'Ż';
                    break;
                }
                case u'ź':
                {
                    nazwa[i] = u'Ź';
                    break;
                }
            }
        }

        return nazwa;
    }

字符串
我也试过使用if代替switch，但它没有改变任何东西。在Qt Creator中，除了u 'Ó'之外，每个要插入的大写字母旁边都会出现类似的错误：Implicit conversion from 'char16_t' to 'std::basic_string<char>::value_type' (aka 'char') changes value from 260 to 4（来自u''）。运行程序后，字符串中的字符不会交换。

c++

来源：https://stackoverflow.com/questions/76818615/how-to-capitalize-polish-special-letters-in-c

3条答案

按热度按时间

mrfwxfqh1#

问题来源

std::string将字符存储为char s，长度为一个字节，因此它们的值只能从0到255。
这使得不可能将u'ą'存储在一个char中，例如，因为unicode value for ą是0x105（十进制= 261，高于255）。
为了避免这个问题，人们发明了UTF-8，这是一种字符编码标准，可以将任何Unicode字符编码为字节。具有更高值的字符当然需要多个字节来编码。
您的std::string很可能使用UTF-8编码字符。（我说很可能是因为您的代码没有直接指出它，但它几乎是100%肯定的情况下，因为它是唯一的通用方式编码重音字母在char的字符串。为了100%确定，你需要检查Qt的代码，因为它似乎是你正在使用的）
这样做的结果是，你不能只使用一个for来迭代你的std::string的char s，因为你基本上假设一个char等于一个字母，这根本不是事实。
例如，在ą的情况下，它将被编码为字节C4 85，因此您将有一个值为0xC4（= 196）的char，然后是值为0x85（= 133）的另一个char。

大写字符的具体大小写

幸运的是，Latin Extended-A part of the Unicode table（archive）向我们展示了这些特殊的大写字母正好出现在它们的小写字母之前。
不仅如此，我们还可以看到：

从Unicode索引0x 100到0x 137（包括两者），小写字母是奇数索引。
从0x 139到0x 148（包括两者），小写字母是偶数索引。
从0x 14 A到0x 177（包括两者），小写字母是奇数索引。
从0x 179到0x 17 E（包括两者），小写字母是偶数索引。

这将使将小写代码点转换为大写代码点变得更容易，因为我们所要做的就是检查字符的索引是否对应于小写，如果是，则将其减去1以使其大写。

用UTF-8编码其中一个字符

用UTF-8编码（源代码）：

转换二进制代码位（如果您喜欢这样说，则是Unicode值）
UTF-8编码字符的第一个字节的二进制值为110xxxxx，请将xxxxx替换为字符的二进制码位的高五个字节
第二个字节的二进制值为10xxxxxx，请将xxxxxx替换为字符的二进制码位的低六个字节

所以对于ą，十六进制的值是0x105，所以二进制的**00100*000101 *。
第一个字节值则为11000100**（= 0xC 4）。
第二个字节值则为10 * 000101 *（= 0x 85）。
请注意，这种编码“技术”之所以有效，是因为要大写的字符的值（代码点）在0x 80和0x 7 FF之间。它根据值的高低而变化，请参阅此处的文档。

修复代码

我已经重写了你的to_upper函数，根据我到目前为止写的内容：

string to_upper(string nazwa)
{
    for (int i = 0; i < (int)nazwa.size(); i++)
    {
        // Getting the current character we are working with
        char chr1 = nazwa[i];

        // We want to find UTF-8-encoded polish letters here
        // So we are looking for a character that has first three bits set to 110,
        // as all polish letters encoded in UTF-8 are in UTF-8 Class 1 and therefore
        // are two bytes long, the first byte being of binary value 110xxxxx
        if(((chr1 >> 5) & 0b111) != 0b110) {
            nazwa[i] = toupper(chr1); // Do the std toupper here for regular characters
            continue;
        }

        // If we are here, then the character we are dealing with is two bytes long, so get its value.
        // We won't need to check for that second byte during next iteration, so we increment i
        i++;
        char chr2 = nazwa[i];

        // Get the unicode value of the encoded character
        uint16_t fullChr = ((chr1 & 0b11111) << 6) | (chr2 & 0b111111);

        // Get the various conditions to check for lowercase code points
        bool lowercaseIsOdd =  (fullChr >= 0x100 && fullChr <= 0x137) || (fullChr >= 0x14A && fullChr <= 0x177);
        bool lowercaseIsEven = (fullChr >= 0x139 && fullChr <= 0x148) || (fullChr >= 0x179 && fullChr <= 0x17E);
        bool chrIndexIsOdd =   (fullChr % 2) == 1;

        // Depending of whether the code point needs to be odd or even to be lowercase and depending of if the code point
        // is odd or even, decrease it by one to make it uppercase
        if((lowercaseIsOdd && chrIndexIsOdd)
        || (lowercaseIsEven && !chrIndexIsOdd))
            fullChr--;

        // Support for some additional, more commonly used accented letters
        if(fullChr >= 0xE0 && fullChr <= 0xF6)
            fullChr -= 0x20;

        // Re-encode the character point in UTF-8
        nazwa[i-1] = (0b110 << 5) | ((fullChr >> 6) & 0b11111); // We incremented i earlier, so subtract one to edit the first byte of the letter we're encoding
        nazwa[i] = (0b10 << 6) | (fullChr & 0b111111);
    }

    return nazwa;
}

字符串

注意：不要忘记使用#include <cstdint>才能使uint16_t工作。*
注2：我已经添加了对一些Latin 1 Supplement（archive）字母的支持，因为你在评论中要求它。虽然我们从小写代码点中减去0x20得到大写代码点，但这与我在本答案中提到的其他字母的原理几乎相同。

我在我的代码中包含了很多注解，请考虑阅读它们以更好地理解。
我已经用字符串"ĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıĲĳĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž"测试过了，它把它转换成了"ĀĀĂĂĄĄĆĆĈĈĊĊČČĎĎĐĐĒĒĔĔĖĖĘĘĚĚĜĜĞĞĠĠĢĢĤĤĦĦĨĨĪĪĬĬĮĮİİĲĲĴĴĶĶĸĹĹĻĻĽĽĿĿŁŁŃŃŅŅŇŇŊŊŌŌŎŎŐŐŒŒŔŔŖŖŘŘŚŚŜŜŞŞŠŠŢŢŤŤŦŦŨŨŪŪŬŬŮŮŰŰŲŲŴŴŶŶŸŹŹŻŻŽŽ"，所以它工作得很好：

int main() {
    string str1 = "ĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıĲĳĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž";
    string str2 = to_upper(str1);

    printf("str1: %s\n", str1.c_str());
    printf("str2: %s\n", str2.c_str());
}

型
x1c 0d1x的数据

注意：所有终端默认使用UTF-8，Qt标签也是，基本上所有东西都使用UTF-8，除了Windows CMD，所以如果你在Windows CMD或Powershell上测试上述代码，你需要使用命令chcp 65001将它们更改为UTF-8，或者在执行代码时使用adding a Windows API call to change the CMD encoding。
注2：当你直接在代码中编写原始字符串时，你的编译器默认会用UTF-8编码。这就是为什么我的to_upper函数版本可以直接使用波兰字母编写代码，而无需进一步修改。当我说一切都使用UTF-8时，我是认真的。
注3：我保留它是为了避免给你当前的代码带来问题，但是你使用了string而不是std::string，这意味着你的代码中有一个using namespace std;。在这种情况下，请参见Why is "using namespace std;" considered bad practice? *

其他答案备注

请记住，我的回答是非常具体的。它的目的是，如你所要求的，大写波兰字母。
其他答案依赖于std的特性，这些特性显然更通用，适用于所有语言，所以我邀请您给予它们。
依靠现有的功能总是比重新发明轮子更好，但我认为有一个自制的替代方案也很好，它可能更容易理解，有时更有效。

赞(0）回复(0）举报 2023-08-09

ccrfmcuu2#

最简单的处理方法是使用宽字符串。唯一的陷阱是正确处理编码/区域设置。
试试这个：

#include <algorithm>
#include <iostream>
#include <locale>
#include <string>

int main()
try {
    std::locale cLocale{ "C.UTF-8" };
    std::locale::global(cLocale);

    std::locale sys { "" };
    std::wcin.imbue(sys);
    std::wcout.imbue(sys);

    std::wstring line;
    while (getline(std::wcin, line)) {
        std::transform(line.begin(), line.end(), line.begin(), [&cLocale](auto ch) { return std::toupper(ch, cLocale); });
        std::wcout << line << L'\n';
    }
} catch (const std::exception& e) {
    std::cerr << e.what() << '\n';
}

字符串
https://godbolt.org/z/3cKaEeW3z
现在：

cLocale定义了标准库在与您的程序交互时使用的区域设置。
sys是系统区域设置，它定义了输入输出流应该使用哪种编码。注意使用的是哪个overload toupper。

只有当您使用适用于波兰语的单字节编码时，相同的代码才能用于std::string和std::cinstd::cout。在这种情况下，您应该将cLocale中的字符串更改为：

#include <algorithm>
#include <iostream>
#include <locale>
#include <string>

int main()
try {
    std::locale cLocale{ ".1250" };
    std::locale::global(cLocale);

    std::locale sys { "" };
    std::cin.imbue(sys);
    std::cout.imbue(sys);

    std::string line;
    while (getline(std::cin, line)) {
        std::transform(line.begin(), line.end(), line.begin(), [&cLocale](auto ch) { return std::toupper(ch, cLocale); });
        std::cout << line << '\n';
    }
} catch (const std::exception& e) {
    std::cerr << e.what() << '\n';
}

型
请注意，这个区域名称是平台和编译器相关的，并且系统必须配置才能工作。上面的作品在Windows与MSVC（我已经测试）。无法演示，因为没有支持波兰语言环境的在线编译器。
如果使用多字节编码，则转换将失败，因为无法处理此多字节字符

赞(0）回复(0）举报 2023-08-09

3npbholx3#

这应该可以在大多数Unix-y系统上工作，除了土耳其语I和德语ß等奇怪的情况。

#include <clocale>
#include <locale>
#include <iostream>
#include <string>
#include <cwctype>
#include <codecvt>

inline std::wstring stow(const std::string& p)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> wconv;
    return wconv.from_bytes(p);
}

inline std::string wtos(const std::wstring& p)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> wconv;
    return wconv.to_bytes(p);
}

int main()
{
    std::locale loc("");

    // AFAICT the calls below are optional on a Mac 
    // for this particular task but it could be a 
    // good idea to use them anyway
    // std::setlocale(LC_ALL, "");
    // std::locale::global(loc);
    // std::cin.imbue(loc);
    // std::cout.imbue(loc);

    std::string s;
    std::getline(std::cin, s);

    std::wstring w = stow(s);
    for (auto& c: w)
    {
        c = std::toupper(c, loc);
    }

    std::cout << wtos(w) << "\n";
}

字符串
注意，它使用了不推荐使用的C++工具来进行UTF-8代码转换。如果这让您感到困扰，请替换stow和wtos中的任何UTF-8到UTF-32和反向转换器。您也可以随意替换系统中存在的区域设置（可以是“pl_PL.UTF-8”或类似的）。

赞(0）回复(0）举报 2023-08-09

我来回答

如何在C++中大写波兰语特殊字母？

3条答案

问题来源

大写字符的具体大小写

用UTF-8编码其中一个字符

修复代码

其他答案备注

相关问题

热门标签

最新问答