我已经使用命令将pdf摄取到 Elastic 中:
curl -s -X PUT -H "Content-Type: application/json" -u "$user:$pwd"
-d "@$json_file" "$host/$index/_doc/$entree?pipeline=attachment"
字符串
PDF中有pdfinfo
:
Title: t416. Urbanisme : la loi ELAN
Subject:
Keywords: ELAN, construction, marchand de sommeil, lutte contre les recours abusifs
Author: Marc Le Bihan
Creator: LaTeX via pandoc
Producer: pdfTeX-1.40.24
CreationDate: Fri Nov 10 04:56:26 2023 CET
ModDate: Fri Nov 10 04:56:26 2023 CET
Custom Metadata: yes
Metadata Stream: no
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 612 x 792 pts (letter)
Page rot: 0
File size: 90038 bytes
Optimized: no
PDF version: 1.5
型
当我用单词abusifs
查询索引时,法语中abusif
的复数形式:
GET apprentissage/_search
{
"query": {
"query_string": {
"query": "abusifs"
}
},
"_source": {
"includes": [ "attachment.modified", "attachment.title", "attachment.content"]
}
}
型
它会找到条目:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 7.6852083,
"hits": [
{
"_index": "apprentissage",
"_id": "t416-urbanisme-la_loi_ELAN",
"_score": 7.6852083,
"_ignored": [
"attachment.content.keyword",
"data.keyword"
],
"_source": {
"attachment": {
"modified": "2023-11-10T03:56:26Z",
"title": "t416. Urbanisme : la loi ELAN",
"content": """t416. Urbanisme : la loi ELAN
Loi portant Évolution du Logement, de l’Aménagement et du Numérique
Marc Le Bihan
23/11/2018 : Loi portant évolution du logement, de l’aménagement et du numérique
(ELAN) :
[...]
2) Lutte contre les recours abusifs
[...]
}
}
}
]
}
}
型
但是如果我只尝试查询它的单数形式abusif
,它什么也找不到:
GET apprentissage/_search
{
"query": {
"query_string": {
"query": "abusif"
}
},
"_source": {
"includes": [ "attachment.modified", "attachment.title", "attachment.content"]
}
}
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
我以为摄取者会自己检测到使用的语言,失败了吗?
我是否应该更强制地设置该语言,或者在我的ingredient命令中,或者在pdf中?
因为我的文档看起来没有被编入法语索引
但也许是我的查询不是执行我的研究的好查询?/apprentissage
索引,其中文档被摄取:
{
"apprentissage": {
"aliases": {},
"mappings": {
"properties": {
"attachment": {
"properties": {
"author": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"content_length": {
"type": "long"
},
"content_type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"creator_tool": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"date": {
"type": "date"
},
"format": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"keywords": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"language": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"modified": {
"type": "date"
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"data": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"settings": {
"index": {
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_content"
}
}
},
"number_of_shards": "1",
"provided_name": "apprentissage",
"creation_date": "1694840235250",
"number_of_replicas": "1",
"uuid": "yMn4iKJxT42s5gOX2rFZYw",
"version": {
"created": "8100099"
}
}
}
}
}
型
我的摄取脚本:
#!/bin/bash
export source=$1
# Le paramètre source doit être alimenté
if [ -z "$source" ]; then
echo "Le nom du fichier pdf à indexer dans Elastic est attendu en paramètre." >&2
exit 1
fi
# Si le fichier source n'a pas d'extension, lui rajouter celle .pdf
if [[ "$source" != *"."* ]]; then
source=$source.pdf
fi
# Il doit avoir l'extension pdf
if [[ "$source" != *".pdf" ]]; then
echo "Le fichier à indexer dans Elastic doit avoir l'extension .pdf" >&2
exit 1
fi
host="http://localhost:9200"
user="elastic"
pwd="...."
index=apprentissage
entree=$(basename "${source%.*}")
json_file=$(mktemp)
cur_url="$host/$index/_doc/$entree?pipeline=attachment"
echo '{"data" : "'"$( base64 "$source" -w 0 )"'"}' >"$json_file"
# echo "transfert via $json_file vers $cur_url"
if ! ingest=$(curl -s -X PUT -H "Content-Type: application/json" -u "$user:$pwd" -d "@$json_file" "$cur_url"); then
echo "Echec de l'ingestion dans Elastic de $source : $ingest" >&2
exit $?
fi
rm "$json_file"
echo "$source indexé dans Elastic"
型
1条答案
按热度按时间qqrboqgw1#
根据您的Map,
attachment.content
字段由standard
分析器分析,因为没有指定其他分析器。standard
分析器不支持法语,因此不会执行任何法语词干分析,因此abusif
和abusifs
是两个不同的单词。因此您看到的结果。如果你知道你将只索引法语内容,你可以通过使用一个法语分析器来使你的内容字段法语敏感。
您需要使用以下Map重新创建索引
字符串
然后,您需要重新索引您的内容,完成后,您的搜索查询将按预期工作,并且在搜索
abusifs
和abusif
时将找到文档c.q.f.d. ;-)