自定义ElasticSearch搜索结果以不显示HTML标签

23c0lvtd  于 2022-10-06  发布在  ElasticSearch
关注(0)|答案(1)|浏览(218)

我想为我的博客做一个搜索API,我在ElasticSearch中以HTML格式存储所有数据,以便尽可能快地在全文搜索中使用它,但HTML标签困扰着我在我的内容中进行搜索。通过多次搜索,我发现了一个关于如何在搜索中忽略它们的答案,但我无法将它们过滤掉而不显示在结果中。有什么方法可以做到这一点吗?

现在我搜索并得到以下结果:

POST /test/_search HTTP/1.1
Content-Type: application/json
Content-Length: 68

{
  "query": {
    "match": {
      "html": "more"
    }
  }
}

答复:

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "test",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "html": "<html><body><h1 style="font-family: Arial">Test</h1> <span>More test</span></body></html>"
                }
            }
        ]
    }
}

但我想要这样的东西:

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "test",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "html": "Test More test"
                }
            }
        ]
    }
}
zy1mlcev

zy1mlcev1#

您需要在您的Map中使用HTML条形字符过滤器。通过它,您将从您的文档中删除HTML元素。我使用此post试图接近您的结果。

PUT idx_test
{
  "settings": {
    "analysis": {
      "filter": {
        "my_pattern_replace_filter": {
          "type": "pattern_replace",
          "pattern": "n",
          "replacement": ""
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"
          ],
          "char_filter": [
            "html_strip"
          ]
        },
        "parsed_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "my_pattern_replace_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "html": {
        "type": "text",
        "analyzer": "my_analyzer",
        "fields": {
          "raw": {
            "type": "text",
            "fielddata": true,
            "analyzer": "parsed_analyzer"
          }
        }
      }
    }
  }
}

POST idx_test/_doc
{
  "html": """<html><body><h1 style="font-family: Arial">Test</h1> <span>More test</span></body></html>"""
}

GET idx_test/_search
{
  "script_fields": {
    "html_raw": {
      "script": "doc['html.raw']"
    }
  }, 
  "query": {
    "match": {
      "html": "more"
    }
  }
}

结果:

"hits": [
  {
    "_index": "idx_test",
    "_id": "0b-UqoMBCzQxtx05B-WH",
    "_score": 0.2876821,
    "fields": {
      "html_raw": [
        "Test More test"
      ]
    }
  }
]

相关问题