如何在索引时停止在内容中存储特殊字符

j1dl9f46 于 2021-06-14 发布在 ElasticSearch

关注(0)|答案(2)|浏览(431)

这是一个具有以下要点的示例文档：药品营销大楼â€“ 责任。â â 质量。â€“ 2020年8月13日â€“â
如何在索引时从内容中删除特殊字符或非ascii unicode字符？我用的是ES7.x和风暴爬虫1.17

elasticsearch stormcrawler elasticsearch-analyzers

来源：https://stackoverflow.com/questions/64384571/how-to-stop-storing-special-characters-in-content-while-indexing

2条答案

按热度按时间

hwamh0ep1#

看起来像是对字符集的错误检测。您可以在编制索引之前通过编写自定义解析过滤器对内容进行规范化，并删除不需要的字符。

赞(0）回复(0）举报 2021-06-15

agyaoht72#

如果编写一个自定义的解析过滤器和规范化看起来很困难。您只需在analyzer定义中添加asciifolding令牌过滤器，即可将非ascii字符转换为ascii字符，如下所示
发布http://{hostname}}:{{port}}/\u分析

{
    "tokenizer": "standard",
    "filter": [
        "asciifolding"
    ],
    "text": "Pharmaceutical Marketing Building â responsibilities.Â Â Mass. â Aug. 13, 2020 âÂ"
}

并为你的文本生成标记。

{
    "tokens": [
        {
            "token": "Pharmaceutical",
            "start_offset": 0,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "Marketing",
            "start_offset": 15,
            "end_offset": 24,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "Building",
            "start_offset": 25,
            "end_offset": 33,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "a",
            "start_offset": 34,
            "end_offset": 35,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "responsibilities.A",
            "start_offset": 36,
            "end_offset": 54,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "A",
            "start_offset": 55,
            "end_offset": 56,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "Mass",
            "start_offset": 57,
            "end_offset": 61,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "a",
            "start_offset": 63,
            "end_offset": 64,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "Aug",
            "start_offset": 65,
            "end_offset": 68,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "13",
            "start_offset": 70,
            "end_offset": 72,
            "type": "<NUM>",
            "position": 9
        },
        {
            "token": "2020",
            "start_offset": 74,
            "end_offset": 78,
            "type": "<NUM>",
            "position": 10
        },
        {
            "token": "aA",
            "start_offset": 79,
            "end_offset": 81,
            "type": "<ALPHANUM>",
            "position": 11
        }
    ]
}

赞(0）回复(0）举报 2021-06-15

我来回答

如何在索引时停止在内容中存储特殊字符

2条答案

相关问题

热门标签

最新问答