java中用regex模式检测非拉丁字符

snvhrwxg 于 2021-06-27 发布在 Java

关注(0)|答案(2)|浏览(779)

我认为拉丁字符是我的问题，但我不完全确定什么是正确的分类。我尝试使用正则表达式模式来测试字符串是否包含非拉丁字符。我期待以下结果

"abcDE 123";  // Yes, this should match
"!@#$%^&*";   // Yes, this should match
"aaàààäää";   // Yes, this should match
"ベビードラ";   // No, this shouldn't match
"????";  // No, this shouldn't match

我的理解是 {IsLatin} 预设只是检测是否有任何字符是拉丁语。我想检测是否有任何字符不是拉丁语。

Pattern LatinPattern = Pattern.compile("\\p{IsLatin}");
Matcher matcher = LatinPattern.matcher(str);
if (!matcher.find()) {
    System.out.println("is NON latin");
    return;
}
System.out.println("is latin");

Java regex latin

来源：https://stackoverflow.com/questions/65620673/detect-non-latin-characters-with-regex-pattern-in-java

2条答案

按热度按时间

jexiocij1#

热释光；dr:使用正则表达式 ^[\p{Print}\p{IsLatin}]*$ 如果字符串包含以下内容，则需要匹配的正则表达式：
空间
数字
标点符号
拉丁字符（unicode脚本“拉丁”）
最简单的方法是合并 \p{IsLatin} 与 \p{Print} ，在哪里 Pattern 定义 \p{Print} 作为： \p{Print} -可打印字符：
[\p{Graph}\x20] \p{Graph} -可见字符：
[\p{Alnum}\p{Punct}] \p{Alnum} -字母数字字符：
[\p{Alpha}\p{Digit}] \p{Alpha} -字母字符：
[\p{Lower}\p{Upper}] \p{Lower} -小写字母：
[a-z] \p{Upper} -大写字母字符：
[A-Z] \p{Digit} -十进制数字：
[0-9] \p{Punct} -标点符号：一种
!"#$%&'()*+,-./:;<=>?@[]^_{|}~\x20` -空间：

这使得 \p{Print} 一样 [\p{ASCII}&&\P{Cntrl}] ，即不是控制字符的ascii字符。
这个 \p{Alpha} 零件与重叠 \p{IsLatin} ，但这很好，因为character类消除了重复项。
所以，正则表达式是： ^[\p{Print}\p{IsLatin}]*$ 测试

Pattern latinPattern = Pattern.compile("^[\\p{Print}\\p{IsLatin}]*$");

String[] inputs = { "abcDE 123", "!@#$%^&*", "aaàààäää", "ベビードラ", "????" };
for (String input : inputs) {
    System.out.print("\"" + input + "\": ");
    Matcher matcher = latinPattern.matcher(input);
    if (! matcher.find()) {
        System.out.println("is NON latin");
    } else {
        System.out.println("is latin");
    }
}

输出

"abcDE 123": is latin
"!@#$%^&*": is latin
"aaàààäää": is latin
"ベビードラ": is NON latin
"????": is NON latin

赞(0）回复(0）举报 2021-06-27

n3schb8v2#

所有拉丁unicode字符类都是：

\p{InBasic_Latin}: U+0000–U+007F
\p{InLatin-1_Supplement}: U+0080–U+00FF
\p{InLatin_Extended-A}: U+0100–U+017F
\p{InLatin_Extended-B}: U+0180–U+024F

所以，答案是

Pattern LatinPattern = Pattern.compile("^[\\p{InBasicLatin}\\p{InLatin-1Supplement}\\p{InLatinExtended-A}\\p{InLatinExtended-B}]+$");
Pattern LatinPattern = Pattern.compile("^[\\x00-\\x{024F}]+$"); //U+0000-U+024F

注意，在java中，下划线是从unicode属性类名中删除的。
请参见java演示：

List<String> strs = Arrays.asList(
        "abcDE 123",  // Yes, this should match
        "!@#$%^&*",   // Yes, this should match
        "aaàààäää",   // Yes, this should match
        "ベビードラ", // No, this shouldn't match
        "????");     // No, this shouldn't match  
Pattern LatinPattern = Pattern.compile("^[\\p{InBasicLatin}\\p{InLatin-1Supplement}\\p{InLatinExtended-A}\\p{InLatinExtended-B}]+$");
//Pattern LatinPattern = Pattern.compile("^[\\x00-\\x{024F}]+$"); //U+0000-U+024F
for (String str : strs) {
    Matcher matcher = LatinPattern.matcher(str);
    if (!matcher.find()) {
        System.out.println(str + " => is NON Latin");
        //return;
    } else {
        System.out.println(str + " => is Latin");
    }
}

注：如果更换 .find() 与 .matches() ，你可以扔掉 ^ 以及 $ 在模式中。
输出：

abcDE 123 => is Latin
!@#$%^&* => is Latin
aaàààäää => is Latin
ベビードラ => is NON Latin
???? => is NON Latin

赞(0）回复(0）举报 2021-06-27

我来回答

java中用regex模式检测非拉丁字符

2条答案

相关问题

热门标签

最新问答