elasticsearch 弹性模量标准分析仪中的分离器

roqulrg3  于 2022-11-02  发布在  ElasticSearch
关注(0)|答案(1)|浏览(134)

我知道elasticsearch的标准分析器使用标准的tokenizer来生成token,在这个elasticsearch docs中,他们说它做的是基于语法的tokenization,但是标准tokenizer使用的分隔符并不清楚。
我的用例如下
1.在我的elasticsearch索引中,我有一些字段使用默认的analyzer标准analyzer
1.在这些字段中,我希望使用#字符进行搜索,并使用.作为多一个分隔符。我可以使用标准分析器实现我的用例吗?
我检查了它将为字符串生成什么和所有令牌。嘿,john。s #100是一个测试名称

POST _analyze
{
  "text": "hey john.s #100 is a test name",
  "analyzer": "standard"
}

它生成了以下标记

{
  "tokens": [
    {
      "token": "hey",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "john.s",
      "start_offset": 4,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "100",
      "start_offset": 12,
      "end_offset": 15,
      "type": "<NUM>",
      "position": 2
    },
    {
      "token": "is",
      "start_offset": 16,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "a",
      "start_offset": 19,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "test",
      "start_offset": 21,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "name",
      "start_offset": 26,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 6
    }
  ]
}

所以我怀疑在标准的tokenizer中只有空格被用作分隔符?
感谢您发送编修。

uhry853o

uhry853o1#

让我们首先看看为什么它没有在.上为某些单词中断标记:

标准分析器只使用标准标记器,但标准标记器提供基于Unicode文本分割算法的基于语法的标记化。您可以在这里阅读更多关于算法的信息,herehere。它不使用空格标记器。

现在让我们来看看,如何在.点上而不是在#上进行标记:

您可以使用Character Group tokenizer并提供要应用标记化字符列表。

POST _analyze
{
  "tokenizer": {
    "type": "char_group",
    "tokenize_on_chars": [
      "whitespace",
      ".",
      "\n"
    ]
  },
  "text": "hey john.s #100 is a test name"
}

回应:

{
  "tokens": [
    {
      "token": "hey",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "john",
      "start_offset": 4,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "s",
      "start_offset": 9,
      "end_offset": 10,
      "type": "word",
      "position": 2
    },
    {
      "token": "#100",
      "start_offset": 11,
      "end_offset": 15,
      "type": "word",
      "position": 3
    },
    {
      "token": "is",
      "start_offset": 16,
      "end_offset": 18,
      "type": "word",
      "position": 4
    },
    {
      "token": "a",
      "start_offset": 19,
      "end_offset": 20,
      "type": "word",
      "position": 5
    },
    {
      "token": "test",
      "start_offset": 21,
      "end_offset": 25,
      "type": "word",
      "position": 6
    },
    {
      "token": "name",
      "start_offset": 26,
      "end_offset": 30,
      "type": "word",
      "position": 7
    }
  ]
}

相关问题