elasticsearch query_string:“abusifs”(复数)找到一个文档,但不是“abusif”(单数),我是错误地摄取了我的pdf文档还是我做了一个错误的查询?

gajydyqb  于 12个月前  发布在  ElasticSearch
关注(0)|答案(1)|浏览(109)

我已经使用命令将pdf摄取到 Elastic 中:

curl -s -X PUT -H "Content-Type: application/json" -u "$user:$pwd" 
   -d "@$json_file" "$host/$index/_doc/$entree?pipeline=attachment"

字符串
PDF中有pdfinfo

Title:           t416. Urbanisme : la loi ELAN
Subject:         
Keywords:        ELAN, construction, marchand de sommeil, lutte contre les recours abusifs
Author:          Marc Le Bihan
Creator:         LaTeX via pandoc
Producer:        pdfTeX-1.40.24
CreationDate:    Fri Nov 10 04:56:26 2023 CET
ModDate:         Fri Nov 10 04:56:26 2023 CET
Custom Metadata: yes
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           1
Encrypted:       no
Page size:       612 x 792 pts (letter)
Page rot:        0
File size:       90038 bytes
Optimized:       no
PDF version:     1.5


当我用单词abusifs查询索引时,法语中abusif的复数形式:

GET apprentissage/_search
{
 "query": {
    "query_string": {
      "query": "abusifs"
    }
  },
  
  "_source": {
    "includes": [ "attachment.modified", "attachment.title", "attachment.content"]
  }    
}


它会找到条目:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 7.6852083,
    "hits": [
      {
        "_index": "apprentissage",
        "_id": "t416-urbanisme-la_loi_ELAN",
        "_score": 7.6852083,
        "_ignored": [
          "attachment.content.keyword",
          "data.keyword"
        ],
        "_source": {
          "attachment": {
            "modified": "2023-11-10T03:56:26Z",
            "title": "t416. Urbanisme : la loi ELAN",
            "content": """t416. Urbanisme : la loi ELAN
Loi portant Évolution du Logement, de l’Aménagement et du Numérique

Marc Le Bihan

23/11/2018 : Loi portant évolution du logement, de l’aménagement et du numérique
(ELAN) :
[...]
2) Lutte contre les recours abusifs
[...]
          }
        }
      }
    ]
  }
}


但是如果我只尝试查询它的单数形式abusif,它什么也找不到:

GET apprentissage/_search
{
 "query": {
    "query_string": {
      "query": "abusif"
    }
  },
  
  "_source": {
    "includes": [ "attachment.modified", "attachment.title", "attachment.content"]
  }    
}
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

我以为摄取者会自己检测到使用的语言,失败了吗?
我是否应该更强制地设置该语言,或者在我的ingredient命令中,或者在pdf中?
因为我的文档看起来没有被编入法语索引
但也许是我的查询不是执行我的研究的好查询?
/apprentissage索引,其中文档被摄取:

{
  "apprentissage": {
    "aliases": {},
    "mappings": {
      "properties": {
        "attachment": {
          "properties": {
            "author": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "content": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "content_length": {
              "type": "long"
            },
            "content_type": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "creator_tool": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "date": {
              "type": "date"
            },
            "format": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "keywords": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "language": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "modified": {
              "type": "date"
            },
            "title": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "data": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    },
    "settings": {
      "index": {
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_content"
            }
          }
        },
        "number_of_shards": "1",
        "provided_name": "apprentissage",
        "creation_date": "1694840235250",
        "number_of_replicas": "1",
        "uuid": "yMn4iKJxT42s5gOX2rFZYw",
        "version": {
          "created": "8100099"
        }
      }
    }
  }
}


我的摄取脚本:

#!/bin/bash
export source=$1

# Le paramètre source doit être alimenté
if [ -z "$source" ]; then
   echo "Le nom du fichier pdf à indexer dans Elastic est attendu en paramètre." >&2
   exit 1
fi

# Si le fichier source n'a pas d'extension, lui rajouter celle .pdf
if [[ "$source" != *"."* ]]; then
   source=$source.pdf
fi

# Il doit avoir l'extension pdf
if [[ "$source" != *".pdf" ]]; then
   echo "Le fichier à indexer dans Elastic doit avoir l'extension .pdf" >&2
   exit 1
fi

host="http://localhost:9200"
user="elastic"
pwd="...."

index=apprentissage
entree=$(basename "${source%.*}")
json_file=$(mktemp)
cur_url="$host/$index/_doc/$entree?pipeline=attachment"

echo '{"data"  : "'"$( base64 "$source" -w 0    )"'"}' >"$json_file"
# echo "transfert via $json_file vers $cur_url"

if ! ingest=$(curl -s -X PUT -H "Content-Type: application/json" -u "$user:$pwd" -d "@$json_file" "$cur_url"); then
  echo "Echec de l'ingestion dans Elastic de $source : $ingest" >&2
  exit $?
fi

rm "$json_file"
echo "$source indexé dans Elastic"

qqrboqgw

qqrboqgw1#

根据您的Map,attachment.content字段由standard分析器分析,因为没有指定其他分析器。standard分析器不支持法语,因此不会执行任何法语词干分析,因此abusifabusifs是两个不同的单词。因此您看到的结果。
如果你知道你将只索引法语内容,你可以通过使用一个法语分析器来使你的内容字段法语敏感。
您需要使用以下Map重新创建索引

"content": {
          "type": "text",
          "analyzer": "french",            <--- add this analyzer
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },

字符串
然后,您需要重新索引您的内容,完成后,您的搜索查询将按预期工作,并且在搜索abusifsabusif时将找到文档
c.q.f.d. ;-)

相关问题