这是一个具有以下要点的示例文档:药品营销大楼– 责任。â â 质量。– 2020年8月13日–â如何在索引时从内容中删除特殊字符或非ascii unicode字符?我用的是ES7.x和风暴爬虫1.17
hwamh0ep1#
看起来像是对字符集的错误检测。您可以在编制索引之前通过编写自定义解析过滤器对内容进行规范化,并删除不需要的字符。
agyaoht72#
如果编写一个自定义的解析过滤器和规范化看起来很困难。您只需在analyzer定义中添加asciifolding令牌过滤器,即可将非ascii字符转换为ascii字符,如下所示发布http://{hostname}}:{{port}}/\u分析
{ "tokenizer": "standard", "filter": [ "asciifolding" ], "text": "Pharmaceutical Marketing Building â responsibilities.  Mass. â Aug. 13, 2020 âÂ" }
并为你的文本生成标记。
{ "tokens": [ { "token": "Pharmaceutical", "start_offset": 0, "end_offset": 14, "type": "<ALPHANUM>", "position": 0 }, { "token": "Marketing", "start_offset": 15, "end_offset": 24, "type": "<ALPHANUM>", "position": 1 }, { "token": "Building", "start_offset": 25, "end_offset": 33, "type": "<ALPHANUM>", "position": 2 }, { "token": "a", "start_offset": 34, "end_offset": 35, "type": "<ALPHANUM>", "position": 3 }, { "token": "responsibilities.A", "start_offset": 36, "end_offset": 54, "type": "<ALPHANUM>", "position": 4 }, { "token": "A", "start_offset": 55, "end_offset": 56, "type": "<ALPHANUM>", "position": 5 }, { "token": "Mass", "start_offset": 57, "end_offset": 61, "type": "<ALPHANUM>", "position": 6 }, { "token": "a", "start_offset": 63, "end_offset": 64, "type": "<ALPHANUM>", "position": 7 }, { "token": "Aug", "start_offset": 65, "end_offset": 68, "type": "<ALPHANUM>", "position": 8 }, { "token": "13", "start_offset": 70, "end_offset": 72, "type": "<NUM>", "position": 9 }, { "token": "2020", "start_offset": 74, "end_offset": 78, "type": "<NUM>", "position": 10 }, { "token": "aA", "start_offset": 79, "end_offset": 81, "type": "<ALPHANUM>", "position": 11 } ] }
2条答案
按热度按时间hwamh0ep1#
看起来像是对字符集的错误检测。您可以在编制索引之前通过编写自定义解析过滤器对内容进行规范化,并删除不需要的字符。
agyaoht72#
如果编写一个自定义的解析过滤器和规范化看起来很困难。您只需在analyzer定义中添加asciifolding令牌过滤器,即可将非ascii字符转换为ascii字符,如下所示
发布http://{hostname}}:{{port}}/\u分析
并为你的文本生成标记。