elasticsearch 从任意start_offset标记每个单词

j2cgzkjk  于 2023-06-21  发布在  ElasticSearch
关注(0)|答案(1)|浏览(127)

我想标记以下文本:

"text": "king martin"

进入

[k, ki, kin, king, i, in, ing, ng, g, m, ma, mar, mart, martin, ar, art, arti, artin, r, rt, rti,  rtin, t, ti, tin, i, in, n]

但更特别地进入:

[kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]

这是一种获得这些代币的方法吗?我已经尝试了下面的tokenizer,但如何说:“在任何start_offset开始?“

"ngram_tokenizer": {
        "type": "edge_ngram",
        "min_gram": "3",
        "max_gram": "15",
        "token_chars": [
          "letter",
          "digit"
        ]
      }

谢谢你!

qmelpv7a

qmelpv7a1#

您可以使用ngram tokenizer而不是edge_gram。

PUT test_ngram_stack
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    },
    "index.max_ngram_diff": 10
  }
}

POST test_ngram_stack/_analyze
{
  "analyzer": "my_analyzer",
  "text": "king martin"
}

相关问题