Elasticsearch查询以获取具有某个属性的最小出现次数的文档列表

rdrgkggo  于 2023-04-29  发布在  ElasticSearch
关注(0)|答案(2)|浏览(108)

我有一个这样的文件索引

[
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-23"
   ...
 },
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-24"
   ...
 },
 {
   "customer_id" : "345",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-25"
   ...
 }
]

我想从特定的国家得到所有文件的清单。g USA,在给予时间范围内,至少出现2次相同的customer_id。有了上面的数据,它应该会返回

[
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-24"
   ...
 }
]

现在,我尝试了下面的ES查询

POST /index_name/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "creation_date": {
              "gte": "2021-06-23",
              "lte": "2021-08-23"
            }
          }
        },
        {
          "match": {
            "country": "USA"
          }
        }
      ]
    }
  },
  "aggs": {
    "customer_agg": {
      "terms": {
        "field": "customer_id",
        "min_doc_count": 2
      }
    }
  }
}

上面的查询返回以下结果

"hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.5587491,
    "hits" : [...]
    ]
  },
  "aggregations" : {
    "person_agg" : {
      "doc_count_error_upper_bound" : 1,
      "sum_other_doc_count" : 1,
      "buckets" : [
        {
          "key" : "customer_id",
          "doc_count" : 2
        }
      ]
    }
  }

我不需要bucket列表作为响应,而只需要满足条件的文档列表。我怎么才能做到呢?

fjnneemd

fjnneemd1#

乍一看,我注意到,在搜索查询中,您正在搜索一个名为creation_timestamp的字段,但在文档的Map中,您说您想要进行范围检查的日期字段被称为creation_date
我决定在本地Elasticsearch 7.10上测试这个,下面是我使用的设置

PUT /test-index-v1

PUT /test-index-v1/_mapping
{
        "properties": {
            "customer_id": {
                "type": "keyword"
            },
            "country": {
                "type": "keyword"
            },
            "department": {
                "type": "keyword"
            },
            "creation-date": {
              "type": "date"
            }
        }
}

正如你所看到的,我在字段上使用了keyword,这样我们就可以使用-排序,聚合等等。
在我创建了索引后,我导入了您作为示例给出的文档

POST /test-index-v1/_doc
 {
   "customer_id" : "345",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-25"
}

POST /test-index-v1/_doc
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-25"
}

POST /test-index-v1/_doc
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-24"
}

然后我执行了这个搜索查询,包括customer_id上的must match

POST /test-index-v1/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "creation_date": {
              "gte": "2021-06-23",
              "lte": "2021-08-23"
            }
          }
        },
        {
          "match": {
            "country": "USA"
          }
        },
        {
          "match": {
            "customer_id": "123"
          }
        }
      ]
    }
  },
  "aggs": {
    "customer_agg": {
      "terms": {
        "field": "customer_id",
        "min_doc_count": 2
      }
    }
  }
}

此查询也将返回搜索结果。仅使用聚合不会返回searchHits。
以下是我收到的回复:

"took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.6035349,
    "hits" : [
      {
        "_index" : "test-index-v1",
        "_type" : "_doc",
        "_id" : "vbVD9HsBRVWFAvvZTW-l",
        "_score" : 1.6035349,
        "_source" : {
          "customer_id" : "123",
          "country" : "USA",
          "department" : "IT",
          "creation_date" : "2021-06-25"
        }
      },
      {
        "_index" : "test-index-v1",
        "_type" : "_doc",
        "_id" : "vrVD9HsBRVWFAvvZU29q",
        "_score" : 1.6035349,
        "_source" : {
          "customer_id" : "123",
          "country" : "USA",
          "department" : "IT",
          "creation_date" : "2021-06-24"
        }
      }
    ]
  },
  "aggregations" : {
    "customer_agg" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "123",
          "doc_count" : 2
        }
      ]
    }
  }
}

希望这对你的问题有帮助。如果您有关于Elastic的其他问题,请随时留言!:)
编辑:
关于在某个日期范围内按customer_id分组,我使用了以下查询:

POST /test-index-v1/_search
{
  "aggs": {
    "group_by_customer_id": {
      "terms": {
        "field": "customer_id"
      },
      "aggs": {
        "dates_between": {
          "filter": {
            "range": {
              "creation_date": {
                "gte": "2020-06-23",
                "lte": "2021-06-24"
              }
            }
          }
        }
      }
    }
  }
}

答案是:

"aggregations" : {
    "group_by_customer_id" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "123",
          "doc_count" : 2,
          "dates_between" : {
            "doc_count" : 1
          }
        },
        {
          "key" : "345",
          "doc_count" : 1,
          "dates_between" : {
            "doc_count" : 0
          }
        }
      ]
    }
  }
kgqe7b3p

kgqe7b3p2#

要获取文档列表,可以使用top_hits聚合,它最多返回100个文档。terms聚合documents_count中的第一个参数size指定返回多少个文档值。在customer_id中,有123和345,size设置为1将只返回其中一个作为customer_id的文档,它可以返回2个文档(123)或1个文档(345)。但是,如果min_doc_count设置为2,则它将仅返回customer_id 123的2个文档。即使将size增加到2,也会返回customer_id 123的2个文档,因为345只有1个文档。top_hits聚合中的size用于文档的最大数量。如果有50个id为123的文档和60个id为345的文档,并且min_doc_count设置为2,terms中的第一个size设置为2,则通过将top_hits中的第二个size设置为100,它最多可以返回100个文档。您可以尝试将其与时间范围过滤器结合使用。

POST /index_name/_search
{
  "size": 0,
  "aggs": {
    "documents_count": {
      "terms": {
        "field": "customer_id",
        "min_doc_count": 2,
        "size": 1
      },
      "aggs": {
        "duplicate_documents": {
          "top_hits": {
            "size": 5
          }
        }
      }
    }
  }
}

相关问题