Elasticsearch文本分析器打破关键字过滤器/规范化器

oknrviil  于 2023-10-17  发布在  ElasticSearch
关注(0)|答案(1)|浏览(110)

我们正在使用Elasticsearch 8.4.0,如果这是相关的。
我们在索引上使用了以下标准化器。

{
            "analysis": {
                "filter": {
                    "preserved_ascii_folding": {
                        "preserve_original": true,
                        "type": "asciifolding"
                    }
                },
                "normalizer": {
                    "preserved_ascii_keyword_normalizer": {
                        "filter": [
                            "lowercase",
                            "trim",
                            "preserved_ascii_folding"
                        ],
                        "type": "custom"
                    }
                }

我们专门只在关键字字段上使用它,就像这样:

"keywords_in_fr_fr": {
                    "normalizer": "preserved_ascii_keyword_normalizer",
                    "type": "keyword"
                },

我们正在试图解决一个问题,即像“熊”这样的东西在关键字“熊”的搜索中排名比“熊”更高。为此,我们添加了以下分析器:

"analysis": {
    "analyzer": {
      "unique_lowercase": {
        "filter": [
          "lowercase",
          "unique"
          ],
         "tokenizer": "whitespace",
         "type": "custom"
        }
      }
    },

类似地,我们只将其应用于文本字段,如下所示;

"name_in_fr_fr": {
                    "analyzer": "unique_lowercase",
                    "type": "text"
                },

然而,这导致了一堆记录由于这样的错误而不再被索引,它现在反对其中包含非ASCII字符的关键字。
{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse field [keywords_in_fr_fr] of type [keyword] in document with id '1421'. Preview of field's value: 'mot-clé'"}],"type":"mapper_parsing_exception","reason":"failed to parse field [keywords_in_fr_fr] of type [keyword] in document with id '1421'. Preview of field's value: 'mot-clé'","caused_by":{"type":"illegal_state_exception","reason":"The normalization token stream is expected to produce exactly 1 token, but got 2+ for analyzer analyzer name[preserved_ascii_keyword_normalizer], analyzer [org.elasticsearch.index.analysis.CustomAnalyzer@6d8a4c15], analysisMode [ALL] and input \"mot-clé\"\n"}},"status":400}
我发现的唯一有用的东西,而谷歌建议我们应该放弃preserve_original: true,但我们的搜索顾问写了原始代码,我想他知道他在做什么。(我们的合同在一年前就结束了,所以我不能直接联系他。
请让我知道如果你需要更多的信息/代码。

zc0qhyus

zc0qhyus1#

这里有几个问题。首先,你的关键字规范器不处理熊的情况,你得到的错误与它无关。正如您正确指出的,错误是由preserve_original标志引起的。虽然它在分析器中很有用,但你不能在规范化器中使用is,因为规范化器不能产生超过1个token,而preserve_original恰恰做到了这一点-它发出原始token,后面跟着去掉了变音符号的token。preserve_original标志有几个特定的用例,但如果不查看查询,我无法告诉您是否需要它。
你处理熊问题的方式有点苛刻,但很有效。但它有一些问题。如果你的用户搜索“bear bear bear”,它将失败(返回任意数量的熊)。它也将不会处理“熊?熊。熊!“正确地产生3个不同的“熊”,其中没有一个将匹配搜索“熊”。同样,不确定你的搜索看起来像什么,但我至少会用一个标准的tokenizer替换空白标记器。
或者,您可以构建自己的相似性,简单地忽略术语频率。这将是一个有点慢,但它将与短语有重复的话

DELETE test
PUT test
{
  "settings": {
    "similarity": {
      "scripted_idf": {
        "type": "scripted",
        "weight_script": {
          "source": "double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;"
        },
        "script": {
          "source": "double norm = 1/Math.sqrt(doc.length); return weight * norm;"
        }
      }
    },
    "analysis": {
      "filter": {
        "ascii_folding": {
          "type": "asciifolding"
        }
      },
      "normalizer": {
        "preserved_ascii_keyword_normalizer": {
          "filter": [
            "lowercase",
            "trim",
            "ascii_folding"
          ],
          "type": "custom"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "keywords_in_fr_fr": {
        "normalizer": "preserved_ascii_keyword_normalizer",
        "type": "keyword"
      },
      "name_in_fr_fr": {
        "analyzer": "standard",
        "similarity": "scripted_idf",
        "type": "text"
      }
    }
  }
}

POST test/_doc
{
  "keywords_in_fr_fr": "mot-clé",
  "name_in_fr_fr": "bear? bear. bear!"
}

POST test/_doc?refresh
{
  "keywords_in_fr_fr": "mot-clé",
  "name_in_fr_fr": "just a bear"
}

POST test/_search 
{
  "query": {
    "match": {
      "name_in_fr_fr": "bear"
    }
  }
}

POST test/_search 
{
  "query": {
    "match": {
      "name_in_fr_fr": "just bear"
    }
  }
}

POST test/_search 
{
  "query": {
    "match_phrase": {
      "name_in_fr_fr": "bear bear bear"
    }
  }
}

相关问题