对于在词干提取器之后应用同义词过滤器的简单分析器,有时对于某些词干提取词,同义词在同义词过滤器中使用确切的词干提取词时不起作用。
首先,我创建了一个分析器,在法语雪球过滤器之后应用同义词过滤器。
curl -XPUT "http://localhost:9200/my_index" -H 'Content-Type: application/json' -d '
{
"settings": {
"analysis": {
"filter": {
"my_snow": {
"type": "snowball",
"language": "French"
},
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"autr => synonym_1",
"journali => synonym_2",
"journalier => synonym_3"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_snow",
"my_synonym_filter"
]
}
}
}
}
}'
因为我的同义词过滤器在词干提取器之后,所以我必须找出词干提取到的词。为了找到要放入同义词过滤器的词干词,我使用不带同义词的"explain": "true"
运行/my_index/_analyze
查询。它给我的查询词干标记我放在同义词过滤器。
然后,我用文本“journalière”测试了这个分析器。如下所示,它的词干为“journali”,同义词过滤器将其转换为“synonym_3”而不是“synonym_2”!如果没有滤波器中的"journalier => synonym_3"
线,它根本不会被转换!下面是查询和响应:
curl -XGET "http://localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d '
{
"analyzer" : "my_analyzer",
"text" : "journalière",
"explain" : "true"
}' | json_pp
{
"detail" : {
"charfilters" : [],
"custom_analyzer" : true,
"tokenfilters" : [
{
"name" : "my_snow",
"tokens" : [
{
"bytes" : "[6a 6f 75 72 6e 61 6c 69]",
"end_offset" : 11,
"keyword" : false,
"position" : 0,
"positionLength" : 1,
"start_offset" : 0,
"termFrequency" : 1,
"token" : "journali",
"type" : "<ALPHANUM>"
}
]
},
{
"name" : "my_synonym_filter",
"tokens" : [
{
"bytes" : "[73 79 6e 6f 6e 79 6d 5f 33]",
"end_offset" : 11,
"keyword" : false,
"position" : 0,
"positionLength" : 1,
"start_offset" : 0,
"termFrequency" : 1,
"token" : "synonym_3",
"type" : "SYNONYM"
}
]
}
],
"tokenizer" : {
"name" : "standard",
"tokens" : [
{
"bytes" : "[6a 6f 75 72 6e 61 6c 69 c3 a8 72 65]",
"end_offset" : 11,
"position" : 0,
"positionLength" : 1,
"start_offset" : 0,
"termFrequency" : 1,
"token" : "journalière",
"type" : "<ALPHANUM>"
}
]
}
}
}
我还用单词“journaliere”测试了分析器,看看重音是否与这个bug有关。它的词干为“journalier”,然后同义词过滤器不起作用。请参阅下面的查询和响应:
curl -XGET "http://localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d '
{
"analyzer" : "my_analyzer",
"text" : "journaliere",
"explain" : "true"
}' | json_pp
{
"detail" : {
"charfilters" : [],
"custom_analyzer" : true,
"tokenfilters" : [
{
"name" : "my_snow",
"tokens" : [
{
"bytes" : "[6a 6f 75 72 6e 61 6c 69 65 72]",
"end_offset" : 11,
"keyword" : false,
"position" : 0,
"positionLength" : 1,
"start_offset" : 0,
"termFrequency" : 1,
"token" : "journalier",
"type" : "<ALPHANUM>"
}
]
},
{
"name" : "my_synonym_filter",
"tokens" : [
{
"bytes" : "[6a 6f 75 72 6e 61 6c 69 65 72]",
"end_offset" : 11,
"keyword" : false,
"position" : 0,
"positionLength" : 1,
"start_offset" : 0,
"termFrequency" : 1,
"token" : "journalier",
"type" : "<ALPHANUM>"
}
]
}
],
"tokenizer" : {
"name" : "standard",
"tokens" : [
{
"bytes" : "[6a 6f 75 72 6e 61 6c 69 65 72 65]",
"end_offset" : 11,
"position" : 0,
"positionLength" : 1,
"start_offset" : 0,
"termFrequency" : 1,
"token" : "journaliere",
"type" : "<ALPHANUM>"
}
]
}
}
}
最后,所以要确保其他的话工作,我测试了“autre”。它的词干是“autr”,然后给出“synonym_1”,这是正确的。
使用Elasicsearch 717.9,这是我的docker-compose配置:
version: '3.7'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.17.9
container_name: elasticsearch
environment:
- discovery.type=single-node
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms1000m -Xmx2000m"
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- elasticsearch-data:/usr/share/elasticsearch/data
ports:
- 9200:9200
volumes:
elasticsearch-data:
driver: local
看起来解释分析输出的标记并不总是同义词过滤器使用的相同单词。有没有办法找出stemmer之后的“journaliere”的同义词是什么,或者是哪里有bug?
2条答案
按热度按时间1bqhqjot1#
我相信有一些bug,所以我建议在git上打开一个问题。
另一个有趣的点是,当输入项是“journal”时,我只能得到synonym_2。
我做了另一个测试,使用stemmer_override强制“journalière =〉journal”。似乎“journal”匹配synonym_2,但“journalière”的词干是“journali”,不匹配synonym_2。
会像这样:
nszi6y052#
我打开了一个问题,最后它不是一个bug。当在词干分析器之后使用同义词时,我们不应该将词干分析的标记放在同义词过滤器中。
下面是我应该如何定义我的同义词过滤器: