elasticsearch 在具有多个元素的文档上进行多字段聚合会产生意外结果

roqulrg3  于 2022-12-11  发布在  ElasticSearch
关注(0)|答案(1)|浏览(122)

我有一些文档具有以下结构(对于示例来说非常简单):

"documents": [
    {
        "name": "Document 1",
        "collections" : [
            {
                "id": 30,
                "title" : "Research"
            },
            {
                "id": 45,
                "title" : "Events"
            },
            {
                "id" : 52,
                "title" : "International"
            }
        ]
    },
    {
        "name": "Document 2",
        "collections" : [
            {
                "id": 45,
                "title" : "Events"
            },
            {
                "id" : 63,
                "title" : "Development"
            }
        ]
    }
]

我需要集合的聚合。当我这样做时,效果很好:

"aggs": {
        "collections": {
            "terms": {
                "field": "collections.title",
                "size": 30
            }
        }
    }

我得到了一个很好的结果,正如预期的那样:

"aggregations" : {
        "collections" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
                {
                    "key" : "Research",
                    "doc_count" : 18
                },
                {
                    "key" : "Events",
                    "doc_count" : 14
                },
                {
                    "key" : "International",
                    "doc_count" : 13
                },
                {
                    "key" : "Development",
                    "doc_count" : 8
                }
            ]
        }
    }

不过,我也想把身份证包括在内。所以我试了一下:

"aggs": {
        "collections": {
            "terms": {
                "field": "collections.title",
                "size": 30
            }
        },
        "aggs": {
            "id": {
                "terms": {
                    "field": "collections.id",
                    "size": 1
                }
            }
        }
    }

这就是结果:

"aggregations" : {
        "collections" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
                {
                    "key" : "Research",
                    "doc_count" : 18,
                    "id" : {
                        "doc_count_error_upper_bound" : 0,
                        "sum_other_doc_count" : 0,
                        "buckets" : [
                            {
                                "key" : "30",
                                "doc_count" : 1
                            }
                        ]
                    }
                },
                {
                    "key" : "Events",
                    "doc_count" : 14,
                    "id" : {
                        "doc_count_error_upper_bound" : 0,
                        "sum_other_doc_count" : 0,
                        "buckets" : [
                            {
                                "key" : "45",
                                "doc_count" : 1
                            }
                        ]
                    }
                },
                {
                    "key" : "International",
                    "doc_count" : 13,
                    "id" : {
                        "doc_count_error_upper_bound" : 0,
                        "sum_other_doc_count" : 0,
                        "buckets" : [
                            {
                                "key" : "52",
                                "doc_count" : 1
                            }
                        ]
                    }
                },
                {
                    "key" : "Development",
                    "doc_count" : 8,
                    "id" : {
                        "doc_count_error_upper_bound" : 0,
                        "sum_other_doc_count" : 0,
                        "buckets" : [
                            {
                                "key" : "45",
                                "doc_count" : 1
                            }
                        ]
                    }
                }
            ]
        }
    }

乍一看,它看起来不错。但仔细看,它的最后一个元素与发展(向下滚动)。id应该是63,但却是45。我不清楚为什么会这样,但我找不到解决方法。我也尝试了multi_terms,但它给出了类似的结果。我认为这个问题与文档中有多个集合的事实有关。有人知道解决这个问题的正确方法吗?

hsgswve4

hsgswve41#

原因是在一个对象类型Map中,“title”和“id”之间没有关系,所有的东西都被Elasticsearch隐藏起来了,所以:

"collections" : [
            {
                "id": 30,
                "title" : "Research"
            },
            {
                "id": 45,
                "title" : "Events"
            },
            {
                "id" : 52,
                "title" : "International"
            }
        ]

变成:

"collections.id": [30,45,52],
"collections.title": [Research, Events, International]

Elasticsearch不知道id 30属于研究,或者id 45属于事件。
必须使用“nested”类型来保持嵌套属性之间的关系。https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html

解决方案:使用嵌套字段类型
Map

PUT test_nestedaggs
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "collections": {
        "type": "nested",
        "properties": {
          "title": {
            "type": "keyword"
          },
          "id": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

文件

POST test_nestedaggs/_doc
{
  "name": "Document 1",
  "collections": [
    {
      "id": 30,
      "title": "Research"
    },
    {
      "id": 45,
      "title": "Events"
    },
    {
      "id": 52,
      "title": "International"
    }
  ]
}
    
POST test_nestedaggs/_doc
{
  "name": "Document 2",
  "collections": [
    {
      "id": 45,
      "title": "Events"
    },
    {
      "id": 63,
      "title": "Development"
    }
  ]
}

查询

POST test_nestedaggs/_search?size=0
{
  "aggs": {
    "nested_collections": {
      "nested": {
        "path": "collections"
      },
      "aggs": {
        "collections": {
          "terms": {
            "field": "collections.title"
          },
          "aggs": {
            "ids": {
              "terms": {
                "field": "collections.id"
              }
            }
          }
        }
      }
    }
  }
}

结果

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "nested_collections": {
      "doc_count": 5,
      "collections": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "Events",
            "doc_count": 2,
            "ids": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "45",
                  "doc_count": 2
                }
              ]
            }
          },
          {
            "key": "Development",
            "doc_count": 1,
            "ids": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "63",
                  "doc_count": 1
                }
              ]
            }
          },
          {
            "key": "International",
            "doc_count": 1,
            "ids": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "52",
                  "doc_count": 1
                }
              ]
            }
          },
          {
            "key": "Research",
            "doc_count": 1,
            "ids": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "30",
                  "doc_count": 1
                }
              ]
            }
          }
        ]
      }
    }
  }
}

你可以阅读我写的一篇文章了解详情:
https://opster.com/guides/elasticsearch/data-architecture/elasticsearch-nested-field-object-field/

  • 注:如果子文档的数量太大,并且您要进行大量更新,请考虑更改数据模型,因为每个子文档在索引中都是独立的文档,每次更新子文档时,整个结构都将重新索引,这可能会影响性能,您可以添加的嵌套文档的最大数量也有限制。如果数量很小(如示例中所示),则没有问题。*

相关问题