Elasticsearch查询以获取具有某个属性的最小出现次数的文档列表

rdrgkggo 于 2023-04-29 发布在 ElasticSearch

关注(0)|答案(2)|浏览(109)

我有一个这样的文件索引

[
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-23"
   ...
 },
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-24"
   ...
 },
 {
   "customer_id" : "345",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-25"
   ...
 }
]

我想从特定的国家得到所有文件的清单。g USA，在给予时间范围内，至少出现2次相同的customer_id。有了上面的数据，它应该会返回

[
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-24"
   ...
 }
]

现在，我尝试了下面的ES查询

POST /index_name/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "creation_date": {
              "gte": "2021-06-23",
              "lte": "2021-08-23"
            }
          }
        },
        {
          "match": {
            "country": "USA"
          }
        }
      ]
    }
  },
  "aggs": {
    "customer_agg": {
      "terms": {
        "field": "customer_id",
        "min_doc_count": 2
      }
    }
  }
}

上面的查询返回以下结果

"hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.5587491,
    "hits" : [...]
    ]
  },
  "aggregations" : {
    "person_agg" : {
      "doc_count_error_upper_bound" : 1,
      "sum_other_doc_count" : 1,
      "buckets" : [
        {
          "key" : "customer_id",
          "doc_count" : 2
        }
      ]
    }
  }

我不需要bucket列表作为响应，而只需要满足条件的文档列表。我怎么才能做到呢？

elasticsearch

来源：https://stackoverflow.com/questions/69222093/elasticsearch-query-to-get-the-list-of-documents-with-some-minimum-occurrence-of

2条答案

按热度按时间

fjnneemd1#

乍一看，我注意到，在搜索查询中，您正在搜索一个名为creation_timestamp的字段，但在文档的Map中，您说您想要进行范围检查的日期字段被称为creation_date。
我决定在本地Elasticsearch 7.10上测试这个，下面是我使用的设置

PUT /test-index-v1

PUT /test-index-v1/_mapping
{
        "properties": {
            "customer_id": {
                "type": "keyword"
            },
            "country": {
                "type": "keyword"
            },
            "department": {
                "type": "keyword"
            },
            "creation-date": {
              "type": "date"
            }
        }
}

正如你所看到的，我在字段上使用了keyword，这样我们就可以使用-排序，聚合等等。
在我创建了索引后，我导入了您作为示例给出的文档

POST /test-index-v1/_doc
 {
   "customer_id" : "345",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-25"
}

POST /test-index-v1/_doc
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-25"
}

POST /test-index-v1/_doc
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-24"
}

然后我执行了这个搜索查询，包括customer_id上的must match：

POST /test-index-v1/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "creation_date": {
              "gte": "2021-06-23",
              "lte": "2021-08-23"
            }
          }
        },
        {
          "match": {
            "country": "USA"
          }
        },
        {
          "match": {
            "customer_id": "123"
          }
        }
      ]
    }
  },
  "aggs": {
    "customer_agg": {
      "terms": {
        "field": "customer_id",
        "min_doc_count": 2
      }
    }
  }
}

此查询也将返回搜索结果。仅使用聚合不会返回searchHits。
以下是我收到的回复：

"took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.6035349,
    "hits" : [
      {
        "_index" : "test-index-v1",
        "_type" : "_doc",
        "_id" : "vbVD9HsBRVWFAvvZTW-l",
        "_score" : 1.6035349,
        "_source" : {
          "customer_id" : "123",
          "country" : "USA",
          "department" : "IT",
          "creation_date" : "2021-06-25"
        }
      },
      {
        "_index" : "test-index-v1",
        "_type" : "_doc",
        "_id" : "vrVD9HsBRVWFAvvZU29q",
        "_score" : 1.6035349,
        "_source" : {
          "customer_id" : "123",
          "country" : "USA",
          "department" : "IT",
          "creation_date" : "2021-06-24"
        }
      }
    ]
  },
  "aggregations" : {
    "customer_agg" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "123",
          "doc_count" : 2
        }
      ]
    }
  }
}

希望这对你的问题有帮助。如果您有关于Elastic的其他问题，请随时留言！：）
编辑：
关于在某个日期范围内按customer_id分组，我使用了以下查询：

POST /test-index-v1/_search
{
  "aggs": {
    "group_by_customer_id": {
      "terms": {
        "field": "customer_id"
      },
      "aggs": {
        "dates_between": {
          "filter": {
            "range": {
              "creation_date": {
                "gte": "2020-06-23",
                "lte": "2021-06-24"
              }
            }
          }
        }
      }
    }
  }
}

答案是：

"aggregations" : {
    "group_by_customer_id" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "123",
          "doc_count" : 2,
          "dates_between" : {
            "doc_count" : 1
          }
        },
        {
          "key" : "345",
          "doc_count" : 1,
          "dates_between" : {
            "doc_count" : 0
          }
        }
      ]
    }
  }

赞(0）回复(0）举报 2023-04-29

kgqe7b3p2#

要获取文档列表，可以使用top_hits聚合，它最多返回100个文档。terms聚合documents_count中的第一个参数size指定返回多少个文档值。在customer_id中，有123和345，size设置为1将只返回其中一个作为customer_id的文档，它可以返回2个文档（123）或1个文档（345）。但是，如果min_doc_count设置为2，则它将仅返回customer_id 123的2个文档。即使将size增加到2，也会返回customer_id 123的2个文档，因为345只有1个文档。top_hits聚合中的size用于文档的最大数量。如果有50个id为123的文档和60个id为345的文档，并且min_doc_count设置为2，terms中的第一个size设置为2，则通过将top_hits中的第二个size设置为100，它最多可以返回100个文档。您可以尝试将其与时间范围过滤器结合使用。

POST /index_name/_search
{
  "size": 0,
  "aggs": {
    "documents_count": {
      "terms": {
        "field": "customer_id",
        "min_doc_count": 2,
        "size": 1
      },
      "aggs": {
        "duplicate_documents": {
          "top_hits": {
            "size": 5
          }
        }
      }
    }
  }
}

赞(0）回复(0）举报 2023-04-29

我来回答

Elasticsearch查询以获取具有某个属性的最小出现次数的文档列表

2条答案

相关问题

热门标签

最新问答