oracle 如何在java中为UTF8字符串做子字符串？

ldxq2e6h 于 2022-11-03 发布在 Oracle

关注(0)|答案(8)|浏览(210)

假设我有以下字符串：Rückruf in Ausland我需要将其插入到最大大小为10的数据库中。我在java中执行了一个正常的子字符串，它提取了此字符串Rückruf in，长度为10个字符。当它尝试插入此列时，我收到以下oracle错误：
java.sql.SQLException:ORA-12899：列“WAEL”.“TESTTBL”.“DESC”值太大（实际值：11、最大值：10）这是因为数据库有一个AL 32 UTF8字符集，因此ü将占用2个字符。
我需要用java写一个函数来处理这个子字符串，但考虑到ü需要2个字节，所以在这种情况下返回的子字符串应该是Rückruf i（9个字符）.有什么建议吗？

oracle

来源：https://stackoverflow.com/questions/31455706/how-to-do-substring-for-utf8-string-in-java

8条答案

按热度按时间

kcwpcxri1#

如果你想在Java中修剪数据，你必须编写一个函数，使用所用的db字符集修剪字符串，类似于下面的测试用例：

package test;

import java.io.UnsupportedEncodingException;

public class TrimField {

    public static void main(String[] args) {
        //UTF-8 is the db charset
        System.out.println(trim("Rückruf ins Ausland",10,"UTF-8"));
        System.out.println(trim("Rüückruf ins Ausland",10,"UTF-8"));
    }

    public static String trim(String value, int numBytes, String charset) {
        do {
            byte[] valueInBytes = null;
            try {
                valueInBytes = value.getBytes(charset);
            } catch (UnsupportedEncodingException e) {
                throw new RuntimeException(e.getMessage(), e);
            }
            if (valueInBytes.length > numBytes) {
                value = value.substring(0, value.length() - 1);
            } else {
                return value;
            }
        } while (value.length() > 0);
        return "";

    }

}

赞(0）回复(0）举报 2022-11-03

holgip5t2#

如果必须是Java，则可以将字符串解析为字节，并修剪数组的长度。

String s = "Rückruf ins Ausland";
        byte[] bytes = s.getBytes("UTF-8");
        byte[] bytes2 = new byte[10];
        System.arraycopy(bytes, 0, bytes2, 0, 10);
        String trim = new String(bytes2, "UTF-8");

赞(0）回复(0）举报 2022-11-03

qv7cva1a3#

你可以在java中计算一个String的正确长度，把字符串转换成字节数组。
例如，请参阅下面的代码：

System.out.println("Rückruf i".length()); // prints 9 
System.out.println("Rückruf i".getBytes().length); // prints 10

如果当前字符集不是UTF-8，请将代码替换为：

System.out.println("Rückruf i".length()); // prints 9 
System.out.println("Rückruf i".getBytes("UTF-8").length); // prints 10

如果需要，您可以将UTF-8替换为您想要测试的字符集，以确定该字符集中的字符串长度。

赞(0）回复(0）举报 2022-11-03

syqv5f0l4#

下面的代码非常糟糕地遍历了整个字符串的完整Unicode码位，字符对（代理码位）也是如此。

public String trim(String s, int length) {
    byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
    if (bytes.length <= length) {
        return s;
    }
    int totalByteCount = 0;
    for (int i = 0; i < s.length(); ) {
        int cp = s.codePointAt(i);
        int n = Character.charCount(cp);
        int byteCount = s.substring(i, i + n)
                .getBytes(StandardCharsets.UTF_8).length;
        if (totalByteCount + byteCount) > length) {
            break;
        }
        totalByteCount += byteCount;
        i += n;
    }
    return new String(bytes, 0, totalByteCount);
}

它仍然可以优化一点。

赞(0）回复(0）举报 2022-11-03

2exbekwf5#

I think that the best bet in this case would be substringing at the database level, with the Oracle SUBSTR function directly on the SQL QUERY .
For example :

INSERT INTO ttable (colname) VALUES (SUBSTR( ?, 1, 10 ))

Where the exclamation point stand for the SQL parameter sent through JDBC .

赞(0）回复(0）举报 2022-11-03

sq1bmfud6#

You need to have the encoding in the database match the encoding for java strings. Alternatively, you can convert the string using something like this and get the length that matches the encoding in the database. This will give you an accurate byte count. Otherwise, you're still just hoping that the encodings match.

String string = "Rückruf ins Ausland";

    int curByteCount = 0;
    String nextChar;
    for(int index = 0; curByteCount +  
         (nextChar = string.substr(index,index + 1)).getBytes("UTF-8").length < trimmedBytes.length;  index++){
        curByteCount += nextChar.getBytes("UTF-8").length;

    }
    byte[] subStringBytes = new byte[10];
    System.arraycopy(string.getBytes("UTF-8"), 0, subStringBytes, 0, curByteCount);
    String trimed = new String(subStringBytes, "UTF-8");

This should do it. It also, shoudln't truncate a multi-byte character in the process. The assumption here is that the database is UTF-8 Encoding. Another assumption is that the string actually needs to be trimmed.

赞(0）回复(0）举报 2022-11-03

kxeu7u2r7#

嘿，所有的ASCII字符都小于128。你可以使用下面的代码。

public class Test {
    public static void main(String[] args) {
        String s= "Rückruf ins Ausland";
        int length =10;
        for(int i=0;i<s.length();i++){
            if(!(((int)s.charAt(i))<128)){
                length--;                   
            }
        }
        System.out.println(s.substring(0,length));
    }
}

你可以复制粘贴并检查它是否满足你的需要或它打破了任何地方。

赞(0）回复(0）举报 2022-11-03

ggazkfy88#

这里是最好的解决方案，它只需要1毫秒的时间来执行，因为它没有任何循环。

/**
     * This function trims the text by requested max byte size
     *
     * @param text   text string
     * @param length maximum byte size
     * @return trimmed text
     */
    public static String trim(String text, int length) {
        byte[] inputBytes = text.getBytes(StandardCharsets.UTF_8);
        byte[] outputBytes = new byte[length];

        System.arraycopy(inputBytes, 0, outputBytes, 0, length);
        String result = new String(outputBytes, StandardCharsets.UTF_8);

        // check if last character is truncated
        int lastIndex = result.length() - 1;

        if (lastIndex > 0 && result.charAt(lastIndex) != text.charAt(lastIndex)) {
            // last character is truncated so remove the last character
            return result.substring(0, lastIndex);
        }

        return result;
    }

赞(0）回复(0）举报 2022-11-03

我来回答

oracle 如何在java中为UTF8字符串做子字符串？

8条答案

相关问题

热门标签

最新问答