我使用uax_url_email tokenizer为我们的索引中的电子邮件字段。它工作完美,并生成像johndoe@yahoo.com的普通电子邮件的单一令牌。但是,它生成多个令牌时,电子邮件有外国或特殊字符。有解决这个问题的办法吗?我不希望生成多个令牌
PUT email-test-index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"email_analyzer": {
"filter": ["lowercase"],
"tokenizer": "email_tokenizer"
}
},
"tokenizer": {
"email_tokenizer": {
"type": "uax_url_email"
}
}
}
}
},
"mappings": {
"date_detection": false,
"numeric_detection": false,
"properties": {
"EMAIL": {
"type": "text",
"store": true,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "email_analyzer"
}
}
}
}
当它工作时:
GET email-test-index/_analyze
{
"field": "EMAIL",
"text": "johndoe@yahoo.com"
}
{
"tokens" : [
{
"token" : "johndoe@yahoo.com",
"start_offset" : 0,
"end_offset" : 17,
"type" : "<EMAIL>",
"position" : 0
}
]
}
当它不工作时:
GET email-test-index/_analyze
{
"field": "EMAIL",
"text": "johndoeó8@yahoo.com"
}
{
"tokens" : [
{
"token" : "johndoeó8",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "yahoo.com",
"start_offset" : 10,
"end_offset" : 19,
"type" : "<URL>",
"position" : 1
}
]
}
1条答案
按热度按时间qkf9rpyu1#
尾巴;
你不能没有摆脱特殊字符。我可能是错的,但我不认为这样的字符甚至是允许的电子邮件标准。
溶液
您可以使用Map字符过滤器并捕获所有非ascii字符,以将它们Map到ascii。
请注意,
ó
已替换为"o"