词干分析器后的ElasticSearch同义词过滤器有时无法正常工作

jhiyze9q  于 2023-04-29  发布在  ElasticSearch
关注(0)|答案(2)|浏览(221)

对于在词干提取器之后应用同义词过滤器的简单分析器,有时对于某些词干提取词,同义词在同义词过滤器中使用确切的词干提取词时不起作用。
首先,我创建了一个分析器,在法语雪球过滤器之后应用同义词过滤器。

curl -XPUT "http://localhost:9200/my_index" -H 'Content-Type: application/json' -d '
{
  "settings": {
    "analysis": {
      "filter": {
        "my_snow": {
          "type": "snowball",
          "language": "French"
        },
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "autr => synonym_1",
            "journali => synonym_2",
            "journalier => synonym_3"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "my_snow",
            "my_synonym_filter"
          ]
        }
      }
    }
  }
}'

因为我的同义词过滤器在词干提取器之后,所以我必须找出词干提取到的词。为了找到要放入同义词过滤器的词干词,我使用不带同义词的"explain": "true"运行/my_index/_analyze查询。它给我的查询词干标记我放在同义词过滤器。
然后,我用文本“journalière”测试了这个分析器。如下所示,它的词干为“journali”,同义词过滤器将其转换为“synonym_3”而不是“synonym_2”!如果没有滤波器中的"journalier => synonym_3"线,它根本不会被转换!下面是查询和响应:

curl -XGET "http://localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d '
{
  "analyzer" : "my_analyzer",
  "text" : "journalière",
  "explain" : "true"
}' | json_pp
{
   "detail" : {
      "charfilters" : [],
      "custom_analyzer" : true,
      "tokenfilters" : [
         {
            "name" : "my_snow",
            "tokens" : [
               {
                  "bytes" : "[6a 6f 75 72 6e 61 6c 69]",
                  "end_offset" : 11,
                  "keyword" : false,
                  "position" : 0,
                  "positionLength" : 1,
                  "start_offset" : 0,
                  "termFrequency" : 1,
                  "token" : "journali",
                  "type" : "<ALPHANUM>"
               }
            ]
         },
         {
            "name" : "my_synonym_filter",
            "tokens" : [
               {
                  "bytes" : "[73 79 6e 6f 6e 79 6d 5f 33]",
                  "end_offset" : 11,
                  "keyword" : false,
                  "position" : 0,
                  "positionLength" : 1,
                  "start_offset" : 0,
                  "termFrequency" : 1,
                  "token" : "synonym_3",
                  "type" : "SYNONYM"
               }
            ]
         }
      ],
      "tokenizer" : {
         "name" : "standard",
         "tokens" : [
            {
               "bytes" : "[6a 6f 75 72 6e 61 6c 69 c3 a8 72 65]",
               "end_offset" : 11,
               "position" : 0,
               "positionLength" : 1,
               "start_offset" : 0,
               "termFrequency" : 1,
               "token" : "journalière",
               "type" : "<ALPHANUM>"
            }
         ]
      }
   }
}

我还用单词“journaliere”测试了分析器,看看重音是否与这个bug有关。它的词干为“journalier”,然后同义词过滤器不起作用。请参阅下面的查询和响应:

curl -XGET "http://localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d '
{
  "analyzer" : "my_analyzer",
  "text" : "journaliere",
  "explain" : "true"
}' | json_pp
{
   "detail" : {
      "charfilters" : [],
      "custom_analyzer" : true,
      "tokenfilters" : [
         {
            "name" : "my_snow",
            "tokens" : [
               {
                  "bytes" : "[6a 6f 75 72 6e 61 6c 69 65 72]",
                  "end_offset" : 11,
                  "keyword" : false,
                  "position" : 0,
                  "positionLength" : 1,
                  "start_offset" : 0,
                  "termFrequency" : 1,
                  "token" : "journalier",
                  "type" : "<ALPHANUM>"
               }
            ]
         },
         {
            "name" : "my_synonym_filter",
            "tokens" : [
               {
                  "bytes" : "[6a 6f 75 72 6e 61 6c 69 65 72]",
                  "end_offset" : 11,
                  "keyword" : false,
                  "position" : 0,
                  "positionLength" : 1,
                  "start_offset" : 0,
                  "termFrequency" : 1,
                  "token" : "journalier",
                  "type" : "<ALPHANUM>"
               }
            ]
         }
      ],
      "tokenizer" : {
         "name" : "standard",
         "tokens" : [
            {
               "bytes" : "[6a 6f 75 72 6e 61 6c 69 65 72 65]",
               "end_offset" : 11,
               "position" : 0,
               "positionLength" : 1,
               "start_offset" : 0,
               "termFrequency" : 1,
               "token" : "journaliere",
               "type" : "<ALPHANUM>"
            }
         ]
      }
   }
}

最后,所以要确保其他的话工作,我测试了“autre”。它的词干是“autr”,然后给出“synonym_1”,这是正确的。
使用Elasicsearch 717.9,这是我的docker-compose配置:

version: '3.7'
services:
    elasticsearch:
        image: docker.elastic.co/elasticsearch/elasticsearch:7.17.9
        container_name: elasticsearch
        environment:
            - discovery.type=single-node
            - bootstrap.memory_lock=true
            - "ES_JAVA_OPTS=-Xms1000m -Xmx2000m"
        ulimits:
            memlock:
                soft: -1
                hard: -1
        volumes:
            - elasticsearch-data:/usr/share/elasticsearch/data
        ports:
            - 9200:9200
volumes:
    elasticsearch-data:
        driver: local

看起来解释分析输出的标记并不总是同义词过滤器使用的相同单词。有没有办法找出stemmer之后的“journaliere”的同义词是什么,或者是哪里有bug?

1bqhqjot

1bqhqjot1#

我相信有一些bug,所以我建议在git上打开一个问题。
另一个有趣的点是,当输入项是“journal”时,我只能得到synonym_2。
我做了另一个测试,使用stemmer_override强制“journalière =〉journal”。似乎“journal”匹配synonym_2,但“journalière”的词干是“journali”,不匹配synonym_2。
会像这样:

{
  "settings": {
    "analysis": {
      "filter": {
        "my_snow": {
          "type": "snowball",
          "language": "French"
        },
        "my_override_stemmer": {
          "type": "stemmer_override",
          "rules": [
            "journalière => journal"
          ]
        },
        "my_synonym_filter": {
          "type": "synonym_graph",
          "synonyms": [
            "autr => synonym_1",
            "journali => synonym_2",
            "journalier => synonym_3"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "my_override_stemmer",
            "my_snow",
            "my_synonym_filter"
          ]
        }
      }
    }
  }
}
nszi6y05

nszi6y052#

我打开了一个问题,最后它不是一个bug。当在词干分析器之后使用同义词时,我们不应该将词干分析的标记放在同义词过滤器中。
下面是我应该如何定义我的同义词过滤器:

"my_synonym_filter": {
  "type": "synonym_graph",
  "synonyms": [
    "autre => synonym_1",
    "journalière => synonym_2",
    "journaliere => synonym_3"
  ]
}

相关问题