匹配短语查询未按预期工作

xsuvu9jc 于 2021-06-15 发布在 ElasticSearch

关注(0)|答案(3)|浏览(481)

阅读弹性文件：
这个 match_phrase query首先分析查询字符串以生成术语列表。然后它会搜索所有的搜索词，但只会将包含所有搜索词的文档保存在彼此相对的相同位置。
我已将我的分析器配置为使用带有关键字标记器的edge\ngram：

{
        "index": {
            "number_of_shards": 1,
            "analysis": {
                "filter": {
                    "autocomplete_filter": {
                        "type": "edge_ngram",
                        "min_gram": 1,
                        "max_gram": 20
                    }
                },
                "analyzer": {
                    "autocomplete": {
                        "type": "custom",
                        "tokenizer": "keyword",
                        "filter": [
                            "lowercase",
                            "autocomplete_filter"
                        ]
                    }
                }
            }
        }
    }

以下是用于索引的java类：

@Document(indexName = "myindex", type = "program")
@Getter
@Setter
@Setting(settingPath = "/elasticsearch/settings.json")
public class Program {

    @org.springframework.data.annotation.Id
    private Long instanceId;

    @Field(analyzer = "autocomplete",searchAnalyzer = "autocomplete",type = FieldType.String )
    private String name;
}

如果文档“hello world”中有以下短语，则以下查询将与之匹配：

{
  "match" : {
    "name" : {
      "query" : "ho",
      "type" : "phrase"
    }
  }
}
result : "hello world"

这不是我所期望的，因为文档中并不是所有的搜索词。
我的问题：
1-我不应该在边缘图/自动完成中为查询“ho”设置两个搜索词吗(术语应分别为“h”和“ho”。）
2-当根据短语查询定义的所有术语都不匹配时，为什么“ho”与“hello world”匹配“ho”术语不应匹配）
更新：
以防问题不清楚。匹配短语查询应该将字符串分析为术语列表，如下所示 ho . 现在我们将有两个术语，因为这是边缘随机数 1 最小克。这两个术语是 h 以及 ho . 根据elasticsearch，文档必须包含所有搜索词。然而 hello world 有 h 只有，没有 ho 那我为什么在这里找到匹配的？

elasticsearch elasticsearch-2.0 spring-data-elasticsearch

来源：https://stackoverflow.com/questions/53945046/match-phrase-query-not-working-as-expected

3条答案

按热度按时间

inn6fuwd1#

我从elasticsearch论坛得到了答案：
您正在使用边缘内存令牌筛选器。让我们看看分析器如何处理查询字符串 "ho" . 假设您的索引被调用 my_index :

GET my_index/_analyze
{
  "text": "ho",
  "analyzer": "autocomplete"
}

响应显示分析器的输出将是位置0处的两个标记：

{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "ho",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    }
  ]
}

elasticsearch对同一位置的两个令牌的查询做了什么？它将查询视为“or”，即使使用类型也是如此 "phrase" . 您可以从validateapi的输出中看到这一点（它向您显示了将查询写入的lucene查询）：

GET my_index/_validate/query?rewrite=true
{
  "query": {
    "match": {
      "name": {
        "query": "ho",
        "type": "phrase"
      }
    }
  }
}

因为您的查询和文档都有 h 在位置0处，文档将被命中。
现在，如何解决这个问题？您可以使用edge\ngram标记器，而不是edge\ngram标记过滤器。这个标记器增加它输出的每个标记的位置。
因此，如果您创建这样的索引：

PUT my_index
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "tokenizer": {
        "autocomplete_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "autocomplete_tokenizer",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "autocomplete"
        }
      }
    }
  }
}

您将看到，此查询不再是命中：

GET my_index/_search
{
  "query": {
    "match": {
      "name": {
        "query": "ho",
        "type": "phrase"
      }
    }
  }
}

但举个例子：

GET my_index/_search
{
  "query": {
    "match": {
      "name": {
        "query": "he",
        "type": "phrase"
      }
    }
  }
}

赞(0）回复(0）举报 2021-06-16

r3i60tvu2#

如果你能提供完整的，可运行的例子，你的问题，这将使它更容易帮助你。例如：

PUT test
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "autocomplete"
        }
      }
    }
  }
}

PUT test/_doc/1
{
  "name": "Hello world"
}

GET test/_search
{
  "query": {
    "match_phrase": {
      "name": "hello foo"
    }
  }
}

从您的搜索查询判断，您使用的是ElasticSearch2.x或更早版本。这是一个死版本-你真的应该升级。
我不确定边缘克上的短语搜索在组合中有多大意义。你想在这里达到什么目的？
为什么匹配？搜索查询与存储字段在同一个分析器中运行。既然你已经定义了 min_gram: 1 ，您的 ho 将被搜索为 h 以及 ho . 这个 h 与 h 从 hello . match 或者 match_phrase 用这个分析仪没什么区别。