如何截断一个java字符串，以适应给定的字节数，一旦UTF-8编码？

2hh7jdfx 于 2023-05-21 发布在 Java

关注(0)|答案(8)|浏览(175)

如何截断java String，以便在UTF-8编码后，它可以容纳给定数量的字节存储？

来源：https://stackoverflow.com/questions/119328/how-do-i-truncate-a-java-string-to-fit-in-a-given-number-of-bytes-once-utf-8-en

8条答案

按热度按时间

dvtswwa31#

下面是一个简单的循环，它计算UTF-8表示的大小，并在超过时截断：

public static String truncateWhenUTF8(String s, int maxBytes) {
    int b = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);

        // ranges from http://en.wikipedia.org/wiki/UTF-8
        int skip = 0;
        int more;
        if (c <= 0x007f) {
            more = 1;
        }
        else if (c <= 0x07FF) {
            more = 2;
        } else if (c <= 0xd7ff) {
            more = 3;
        } else if (c <= 0xDFFF) {
            // surrogate area, consume next char as well
            more = 4;
            skip = 1;
        } else {
            more = 3;
        }

        if (b + more > maxBytes) {
            return s.substring(0, i);
        }
        b += more;
        i += skip;
    }
    return s;
}

这 does 处理出现在输入字符串中的surrogate pairs。Java的UTF-8编码器（正确地）将代理项对作为单个4字节序列而不是两个3字节序列输出，因此truncateWhenUTF8()将返回它所能返回的最长截断字符串。如果在实现中忽略代理项对，则截断的字符串可能比需要的短。
我还没有对这段代码做过很多测试，但这里有一些初步的测试：

private static void test(String s, int maxBytes, int expectedBytes) {
    String result = truncateWhenUTF8(s, maxBytes);
    byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
    if (utf8.length > maxBytes) {
        System.out.println("BAD: our truncation of " + s + " was too big");
    }
    if (utf8.length != expectedBytes) {
        System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
    }
    System.out.println(s + " truncated to " + result);
}

public static void main(String[] args) {
    test("abcd", 0, 0);
    test("abcd", 1, 1);
    test("abcd", 2, 2);
    test("abcd", 3, 3);
    test("abcd", 4, 4);
    test("abcd", 5, 4);

    test("a\u0080b", 0, 0);
    test("a\u0080b", 1, 1);
    test("a\u0080b", 2, 1);
    test("a\u0080b", 3, 3);
    test("a\u0080b", 4, 4);
    test("a\u0080b", 5, 4);

    test("a\u0800b", 0, 0);
    test("a\u0800b", 1, 1);
    test("a\u0800b", 2, 1);
    test("a\u0800b", 3, 1);
    test("a\u0800b", 4, 4);
    test("a\u0800b", 5, 5);
    test("a\u0800b", 6, 5);

    // surrogate pairs
    test("\uD834\uDD1E", 0, 0);
    test("\uD834\uDD1E", 1, 0);
    test("\uD834\uDD1E", 2, 0);
    test("\uD834\uDD1E", 3, 0);
    test("\uD834\uDD1E", 4, 4);
    test("\uD834\uDD1E", 5, 4);

}

更新修改代码示例，现在处理代理对。

赞(0）回复(0）举报 2023-05-21

1u4esq0p2#

你应该使用CharsetEncoder，简单的getBytes()+尽可能多的复制可以将UTF-8字符减半。
就像这样：

public static int truncateUtf8(String input, byte[] output) {
    
    ByteBuffer outBuf = ByteBuffer.wrap(output);
    CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());

    CharsetEncoder utf8Enc = StandardCharsets.UTF_8.newEncoder();
    utf8Enc.encode(inBuf, outBuf, true);
    System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
    return outBuf.position();
}

赞(0）回复(0）举报 2023-05-21

eeq64g8w3#

这是我想到的，它使用标准的Java API，所以应该是安全的，并与所有的unicode怪异和代理对等兼容。该解决方案取自http://www.jroller.com/holy/entry/truncating_utf_string_to_the，并添加了null检查，以避免在字符串的字节数少于maxBytes时进行解码。

/**
 * Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
 * half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
 * character.
 *
 * Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
 */
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
    if (s == null) {
        return null;
    }
    Charset charset = Charset.forName("UTF-8");
    CharsetDecoder decoder = charset.newDecoder();
    byte[] sba = s.getBytes(charset);
    if (sba.length <= maxBytes) {
        return s;
    }
    // Ensure truncation by having byte buffer = maxBytes
    ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
    CharBuffer cb = CharBuffer.allocate(maxBytes);
    // Ignore an incomplete character
    decoder.onMalformedInput(CodingErrorAction.IGNORE)
    decoder.decode(bb, cb, true);
    decoder.flush(cb);
    return new String(cb.array(), 0, cb.position());
}

赞(0）回复(0）举报 2023-05-21

ldxq2e6h4#

UTF-8编码有一个简洁的特性，允许您查看您在字节集中的位置。
在你想要的字符限制处检查流。

如果它的高位是0，它是一个单字节的char，只需将其替换为0就可以了。
如果它的高位是1，下一位也是1，那么你就在一个多字节char的开头，所以只要把那个字节设置为0就可以了。
如果高位是1，但下一位是0，那么你就在一个字符的中间，沿着缓冲区往回走，直到你遇到一个高位有2个或更多1的字节，然后用0替换那个字节。

示例：如果您的流是：31 33 31 C1 A3 32 33 00，你可以让你的字符串长度为1、2、3、5、6或7个字节，但不能是4，因为这会把0放在C1之后，这是一个多字节char的开始。

赞(0）回复(0）举报 2023-05-21

pbwdgjma5#

你可以使用-new String（data.getBytes（“UTF-8”），0，maxLen，“UTF-8”）;

赞(0）回复(0）举报 2023-05-21

uqzxnwby6#

您可以计算字节数而不进行任何转换。

foreach character in the Java string
  if 0 <= character <= 0x7f
     count += 1
  else if 0x80 <= character <= 0x7ff
     count += 2
  else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
     count += 3
  else if 0xdc00 <= character <= 0xffff
     count += 3
  else { // surrogate, a bit more complicated
     count += 4
     skip one extra character in the input stream
  }

您必须检测代理项对（D800-DBFF和U+ DC 00-U+DFFF），并为每个有效的代理项对计数4个字节。如果你在第一个范围内得到第一个值，在第二个范围内得到第二个值，一切都好，跳过它们并添加4。但如果不是，则它是无效的代理对。我不确定Java是如何处理的，但是在这种情况下（不太可能），你的算法必须正确计数。

赞(0）回复(0）举报 2023-05-21

pwuypxnk7#

基于billjamesdev's answer，我提出了下面的方法，据我所知，这是最简单的 * 并且 * 仍然可以使用代理对：

public static String utf8ByteTrim(String s, int trimSize) {
    final byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
    if ((bytes[trimSize-1] & 0x80) != 0) { // inside a multibyte sequence
        while ((bytes[trimSize-1] & 0x40) == 0) { // 2nd, 3rd, 4th bytes
            trimSize--;
        }
        trimSize--;
    }
    return new String(bytes, 0, trimSize, StandardCharsets.UTF_8);
}

一些测试：

String test = "Aæ😂尝试";
IntStream.range(1, 16).forEachOrdered(i ->
        System.out.println("Size " + i + ": " + utf8ByteTrim(test, i))
);

---

Size 1: A
Size 2: A
Size 3: A
Size 4: Aæ
Size 5: Aæ
Size 6: Aæ
Size 7: Aæ
Size 8: Aæ😂
Size 9: Aæ😂
Size 10: Aæ😂
Size 11: Aæ😂尝
Size 12: Aæ😂尝
Size 13: Aæ😂尝试
Size 14: Aæ😂尝试
Size 15: Aæ😂尝试

赞(0）回复(0）举报 2023-05-21

mbzjlibv8#

从字符串的尾部扫描比从开始扫描要有效得多，特别是在很长的字符串上。所以walen是正确的，不幸的是，这个答案没有提供正确的截断。
如果你想要一个只向后扫描几个字符的解决方案，这是最好的选择。
使用billjamesdev's answer中的数据，我们可以有效地向后扫描并正确地获得字符边界上的截断。

public static String utf8ByteTrim(String s, int requestedTrimSize) {
    final byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
    int maxTrimSize = Integer.min(requestedTrimSize, bytes.length);
    int trimSize = maxTrimSize;
    if ((bytes[trimSize-1] & 0x80) != 0) { // inside a multibyte sequence
        while ((bytes[trimSize - 1] & 0x40) == 0) { // 2nd, 3rd, 4th bytes
            trimSize--;
        }
        trimSize--;  // Get to the start of the UTF-8
        // Now see if that final UTF-8 character fits.
        // Assume the UTF-8 starts with binary 110xxxxx and is 2 bytes
        int numBytes = 2;  
        if ((bytes[trimSize] & 0xF0) == 0xE0) {
            // If the UTF-8 starts with binary 1110xxxx it is 3 bytes
            numBytes = 3;
        } else if ((bytes[trimSize] & 0xF8) == 0xF0) {
            // If the UTF-8 starts with binary 11110xxx it is 3 bytes
            numBytes = 4;
        }
        if( (trimSize + numBytes) == maxTrimSize)  {
            // The entire last UTF-8 character fits
            trimSize = maxTrimSize; 
        }
    }
    return new String(bytes, 0, trimSize, StandardCharsets.UTF_8);
}

只有一个while循环在向后遍历时最多执行3次迭代。然后几个if语句将确定要截断的字符。
一些测试：

String test = "Aæ😂尝试"; // Sizes: (1,2,4,3,3) = 13 bytes
IntStream.range(1, 16).forEachOrdered(i ->
        System.out.println("Size " + i + ": " + utf8ByteTrim(test, i))
);

---

Size 1: A
Size 2: A
Size 3: Aæ
Size 4: Aæ
Size 5: Aæ
Size 6: Aæ
Size 7: Aæ😂
Size 8: Aæ😂
Size 9: Aæ😂
Size 10: Aæ😂尝
Size 11: Aæ😂尝
Size 12: Aæ😂尝
Size 13: Aæ😂尝试
Size 14: Aæ😂尝试
Size 15: Aæ😂尝试

赞(0）回复(0）举报 2023-05-21

我来回答

如何截断一个java字符串，以适应给定的字节数，一旦UTF-8编码？

8条答案

相关问题

热门标签

最新问答