如何在索引时停止在内容中存储特殊字符

j1dl9f46  于 2021-06-14  发布在  ElasticSearch
关注(0)|答案(2)|浏览(415)

这是一个具有以下要点的示例文档:药品营销大楼– 责任。â â 质量。– 2020年8月13日–â
如何在索引时从内容中删除特殊字符或非ascii unicode字符?我用的是ES7.x和风暴爬虫1.17

hwamh0ep

hwamh0ep1#

看起来像是对字符集的错误检测。您可以在编制索引之前通过编写自定义解析过滤器对内容进行规范化,并删除不需要的字符。

agyaoht7

agyaoht72#

如果编写一个自定义的解析过滤器和规范化看起来很困难。您只需在analyzer定义中添加asciifolding令牌过滤器,即可将非ascii字符转换为ascii字符,如下所示
发布http://{hostname}}:{{port}}/\u分析

{
    "tokenizer": "standard",
    "filter": [
        "asciifolding"
    ],
    "text": "Pharmaceutical Marketing Building â responsibilities.  Mass. â Aug. 13, 2020 âÂ"
}

并为你的文本生成标记。

{
    "tokens": [
        {
            "token": "Pharmaceutical",
            "start_offset": 0,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "Marketing",
            "start_offset": 15,
            "end_offset": 24,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "Building",
            "start_offset": 25,
            "end_offset": 33,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "a",
            "start_offset": 34,
            "end_offset": 35,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "responsibilities.A",
            "start_offset": 36,
            "end_offset": 54,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "A",
            "start_offset": 55,
            "end_offset": 56,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "Mass",
            "start_offset": 57,
            "end_offset": 61,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "a",
            "start_offset": 63,
            "end_offset": 64,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "Aug",
            "start_offset": 65,
            "end_offset": 68,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "13",
            "start_offset": 70,
            "end_offset": 72,
            "type": "<NUM>",
            "position": 9
        },
        {
            "token": "2020",
            "start_offset": 74,
            "end_offset": 78,
            "type": "<NUM>",
            "position": 10
        },
        {
            "token": "aA",
            "start_offset": 79,
            "end_offset": 81,
            "type": "<ALPHANUM>",
            "position": 11
        }
    ]
}

相关问题