lucene 如何使用ElasticSearch从文本中查找相似标签

guicsvcw 于 2022-11-07 发布在 Lucene

关注(0)|答案(2)|浏览(222)

我尝试使用Elastic Search从文本中查找最相似的标签。
例如，我创建test_index并插入两个文档：

POST test_index/_doc/17
{
  "id": 17,
  "tags": ["it", "devops", "server"]
}

POST test_index/_doc/20
{
  "id": 20,
  "tags": ["software", "hardware"]
}

因此，我希望从“我正在使用一些软件和应用程序”文本中找到“软件”标记（文本或ID）。
我希望有人能提供一个如何做到这一点的例子，或者至少为我指明正确的方向。

谢谢-谢谢

lucene

来源：https://stackoverflow.com/questions/60777533/how-to-find-similar-tags-from-text-using-elastic-search

2条答案

按热度按时间

6jjcrrmo1#

你要找的只是一个名为Stemming的概念，你需要创建一个Custom Analyzer并利用Stemmer Token Filter。
请查看以下Map、样本文档、查询和回复：

Map：

PUT my_stem_index
{
  "settings": {
      "analysis" : {
          "analyzer" : {
              "my_analyzer" : {
                  "tokenizer" : "standard",
                  "filter" : ["lowercase", "my_stemmer"]
              }
          },
          "filter" : {
              "my_stemmer" : {
                  "type" : "stemmer",
                  "name" : "english"
              }
          }
      }
  },
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "tags":{
        "type": "text",
        "analyzer": "my_analyzer",
        "fields": {
          "keyword":{
            "type": "keyword"
          }
        }
      }
    }
  }
}

从评论中可以看出，您使用的是〈7的版本。为此，您可能需要在其中添加type。

PUT my_stem_index
{
   "settings":{
      "analysis":{
         "analyzer":{
            "my_analyzer":{
               "tokenizer":"standard",
               "filter":[
                  "lowercase",
                  "my_stemmer"
               ]
            }
         },
         "filter":{
            "my_stemmer":{
               "type":"stemmer",
               "name":"english"
            }
         }
      }
   },
   "mappings":{
      "_doc":{
         "properties":{
            "id":{
               "type":"keyword"
            },
            "tags":{
               "type":"text",
               "analyzer":"my_analyzer",
               "fields":{
                  "keyword":{
                     "type":"keyword"
                  }
               }
            }
         }
      }
   }
}

示例文档：

POST my_stem_index/_doc/17
{
  "id": 17,
  "tags": ["it", "devops", "server"]
}

POST my_stem_index/_doc/20
{
  "id": 20,
  "tags": ["software", "hardware"]
}

POST my_stem_index/_doc/21
{
  "id": 21,
  "tags": ["softwares and applications", "hardwares and storage devices"]
}

请求查询：

POST my_stem_index/_search
{
  "query": {
    "match": {
      "tags": "software"
    }
  }
}

响应：

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.5908618,
    "hits" : [
      {
        "_index" : "my_stem_index",
        "_type" : "_doc",
        "_id" : "20",
        "_score" : 0.5908618,
        "_source" : {
          "id" : 20,
          "tags" : [
            "software",
            "hardware"
          ]
        }
      },
      {
        "_index" : "my_stem_index",
        "_type" : "_doc",
        "_id" : "21",
        "_score" : 0.35965496,
        "_source" : {
          "id" : 21,
          "tags" : [
            "softwares and applications",             <--- Note this has how `softwares` also was searchable.
            "hardwares and storage devices"
          ]
        }
      }
    ]
  }
}

作为响应，请注意两个文档（即具有_id 20和21的文档）是如何出现的。

附加注解：

如果你是Elasticsearch的新手，我建议你花点时间来理解Analysis的概念，以及Elasticsearch是如何使用Analyzers实现同样的功能的。
这将帮助您了解当您只查询software时，包含softwares and applications的文档是如何返回的，反之亦然。
希望这对你有帮助！

赞(0）回复(0）举报 2022-11-07

yqyhoc1h2#

如果您搜索的文本有基础或根词，Stemming是一个很好的方法。
如果你需要从文本中找到最相似的单词，Ngram是更合适的方法。
如果你在word of tags中搜索文本的确切单词，Shingles是更好的方法。

赞(0）回复(0）举报 2022-11-07

我来回答

lucene 如何使用ElasticSearch从文本中查找相似标签

2条答案

Map：

示例文档：

请求查询：

响应：

附加注解：

相关问题

热门标签

最新问答