Elasticsearch突出显示脚本字段的结果

jv4diomz  于 2022-11-02  发布在  ElasticSearch
关注(0)|答案(1)|浏览(137)

last question that I asked中我想删除我的搜索结果中的HTML标签,之后我想我可以用一个普通的查询高亮显示结果,但是在高亮显示字段中我得到了其他HTML内容,你用脚本删除了这些内容。你能帮我高亮显示我保存在数据库中的没有HTML标签的结果吗?
我的Map和设置:

{
  "settings": {
    "analysis": {
      "filter": {
        "my_pattern_replace_filter": {
          "type": "pattern_replace",
          "pattern": "\n",
          "replacement": ""
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"
          ],
          "char_filter": [
            "html_strip"
          ]
        },
        "parsed_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "my_pattern_replace_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "html": {
        "type": "text",
        "analyzer": "my_analyzer",
        "fields": {
          "raw": {
            "type": "text",
            "fielddata": true,
            "analyzer": "parsed_analyzer"
          }
        }
      }
    }
  }
}

搜索查询:

POST idx_test/_search

{
  "script_fields": {
    "raw": {
      "script": "doc['html.raw']"
    }
  }, 
  "query": {
    "match": {
      "html": "more"
    }
  },"highlight": {
    "fields": {
      "*":{ "pre_tags" : ["<strong>"], "post_tags" : ["</strong>"] }
    }
  }
}

结果:

"hits": [
    {
        "_index": "idx_test2",
        "_type": "_doc",
        "_id": "GijDsYMBjgX3UBaguGxc",
        "_score": 0.2876821,
        "fields": {
            "raw": [
                "Test More test"
            ]
        },
        "highlight": {
            "html": [
                "<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>"
            ]
        }
    }
]

我想得到的结果:

"hits": [
    {
        "_index": "idx_test2",
        "_type": "_doc",
        "_id": "GijDsYMBjgX3UBaguGxc",
        "_score": 0.2876821,
        "fields": {
            "raw": [
                "Test <strong>More</strong> test"
            ]
        }
]
pcrecxhr

pcrecxhr1#

我想到了另一个解决方案。你可以索引两个字段,原始的html和html_extract,其中只有文本。你必须使用一个处理器来索引来自消息的文本,高亮显示就可以了。

Map

PUT idx_html_strip
{
  "mappings": {
    "properties": {
      "html": {
        "type": "text"
      },
      "html_extract": {
        "type": "text"
      }
    }
  }
}

处理器管道

PUT /_ingest/pipeline/pipe_html_strip
{
  "description": "_description",
  "processors": [
    {
      "html_strip": {
        "field": "html",
        "target_field": "html_extract"
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": "ctx['html_raw'] = ctx['html_raw'].replace('\n',' ').trim()"
      }
    }
  ]
}

索引数据

请注意使用?pipeline=pipe_html_strip

POST idx_html_strip/_doc?pipeline=pipe_html_strip
{
  "html": """<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>"""
}

查询

GET idx_html_strip/_search?filter_path=hits.hits._source,hits.hits.highlight
{
  "query": {
    "multi_match": {
      "query": "More",
      "fields": ["html", "html_extract"]
    }
  },"highlight": {
    "fields": {
      "*":{ "pre_tags" : ["<strong>"], "post_tags" : ["</strong>"] }
    }
  }
}

结果

{
  "hits": {
    "hits": [
      {
        "_source": {
          "html": """<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>""",
          "html_extract": "Test More test"
        },
        "highlight": {
          "html": [
            """<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong><strong>More</strong></strong> test</span></body>"""
          ],
          "html_extract": [
            "Test <strong>More</strong> test"
          ]
        }
      }
    ]
  }
}

相关问题