solr 如何在java中拆分连接在一起的单词？[已结束]

aamkag61 于 2022-11-05 发布在 Solr

关注(0)|答案(3)|浏览(150)

已关闭。此问题需要details or clarity。当前不接受答案。
**想要改进此问题吗？**通过editing this post添加详细信息并阐明问题。

去年关闭了。
Improve this question
例如：Hello World
预期值：Hello World
我尝试使用Solr的tokenizer，但是没有找到合适的tokenizer。我该怎么办？

solr

来源：https://stackoverflow.com/questions/68800157/how-to-split-words-that-are-connected-together-in-java

3条答案

按热度按时间

rdrgkggo1#

在Solr中，DictionaryCompoundWordFilter是为此而构建的;它不是一个tokenizer，但是它在tokenizer之后作为一个过滤器工作，将一个子字符串中的已知单词拆分成单独的token。这在英语以外的许多语言中特别有用，但在这里也有价值。
您为它提供一个所选语言的有效单词字典（在示例中，这些单词是hello和world），过滤器将这些单词提取到单独的标记中：
假设germanwords.txt至少包含以下单词：dumm kopf donau dampf schiff

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/>
</analyzer>

英寸："Donaudampfschiff dummkopf"
要筛选的标记器："Donaudampfschiff"(1), "dummkopf"(2),
输出："Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)

赞(0）回复(0）举报 2022-11-05

tvmytwxo2#

如果标记器接受正则表达式，则可以使用以下模式作为标记：

(?<=[a-z])(?=[A-Z])

示例Java代码：

String input = "HelloWorld";
String[] words = input.split("(?<=[a-z])(?=[A-Z])");
System.out.println(Arrays.toString(words));  // [Hello, World]

赞(0）回复(0）举报 2022-11-05

gt0wga4j3#

你可以用

String.split(condition);

示例：

String words = "HelloWorldHi";
words.split("regex"); // This will give you an array of words ["Hello", "World", "Hi"]

正则表达式示例：RegExr Example

[A-Z][a-z]{1,}

明细：

[A-Z]: Match any character in the set (From A to Z)
[a-z]: Match any character in the set. From (a to z)
{1, }: Matches the specified quantity of the previous token. {1,3} will match 1 to 3. {3} will match exactly 3. {1,} will match 1 or more.

赞(0）回复(0）举报 2022-11-05

我来回答

solr 如何在java中拆分连接在一起的单词？[已结束]

3条答案

相关问题

热门标签

最新问答