athena sql在一列上是不同的，但返回多个？

tez616oj 于 2021-07-24 发布在 Java

关注(0)|答案(3)|浏览(286)

我似乎找不到一个简单的答案，而且我是sql的初学者，我在亚马逊雅典娜做这个。我想在一列上有一个distinct，但返回几个没有distinct的结果。这是我的密码：

SELECT DISTINCT line_item_resource_id
FROM table
WHERE product_servicename = 'Amazon Elastic Compute Cloud'
AND line_item_usage_account_id = '544934960'
AND line_item_usage_type LIKE '%BoxUsage%'
AND identity_time_interval = '2020-06-29T00:00:00Z/2020-06-30T00:00:00Z';

我希望distinct只在第\u行项目\u资源\u id上，但返回所有这些：

line_item_resource_id, line_item_usage_start_date, 
line_item_usage_end_date, line_item_usage_account_id, 
line_item_availability_zone, line_item_product_code, product_instance_type, 
pricing_term, product_operating_system, product_servicename, 
line_item_line_item_type, line_item_usage_type, line_item_operation, 
line_item_usage_amount

此代码只会产生行\项\资源\ id。如何只在该列上获得distinct，而返回其余列？

sql Distinct amazon-athena

来源：https://stackoverflow.com/questions/62785926/athena-sql-distinct-on-one-column-but-return-several

3条答案

按热度按时间

5gfr0r5j1#

maryam的答案是正确的，下面是一个更详细的版本，使用 ARBITRARY 雅典娜提供的功能，以及 SUM :

SELECT 
  line_item_resource_id,
  MIN(line_item_usage_start_date) AS line_item_usage_start_date, 
  MAX(line_item_usage_end_date) AS line_item_usage_end_date,
  ARBITRARY(line_item_usage_account_id) AS line_item_usage_account_id,
  ARBITRARY(line_item_availability_zone) AS line_item_availability_zone,
  ARBITRARY(line_item_product_code) AS line_item_product_code,
  ARBITRARY(product_instance_type) AS product_instance_type,
  ARBITRARY(pricing_term) AS pricing_term,
  ARBITRARY(product_operating_system) AS product_operating_system,
  ARBITRARY(product_servicename) AS product_servicename,
  ARBITRARY(line_item_line_item_type) AS line_item_line_item_type,
  ARBITRARY(line_item_usage_type) AS line_item_usage_type,
  ARBITRARY(line_item_operation) AS line_item_operation, 
  SUM(line_item_usage_amount) AS line_item_usage_amount
FROM table
WHERE product_servicename = 'Amazon Elastic Compute Cloud'
AND line_item_usage_account_id = '544934960'
AND line_item_usage_type LIKE '%BoxUsage%'
AND identity_time_interval = '2020-06-29T00:00:00Z/2020-06-30T00:00:00Z'
GROUP BY line_item_resource_id

这里发生的事情是 line_item_resource_id 每个不同的资源id将作为结果中的一行结束- 但由于该列的每一个不同值都将出现在数据中的多行上，我们需要告诉雅典娜如何将所有这些行展平为一行，否则雅典娜不知道如何产生你想要的结果。
这样做的方法是通过聚合函数。它们接受多个值并生成一个值。当列是数字时，通常希望对组的值求和，我在上面的示例中使用 line_item_usage_amount 列，因为我知道这个数据集，而且我知道这是一个要求和的列。
对于其他包含字符串数据的列，例如 pricing_term 你怎么压平它取决于你想要什么。其他大多数列对于同一资源id只有一个值，如 pricing_term 以及 product_servicename . 在雅典娜有一个函数叫做 ARBITRARY 它执行它所说的：它从组中选择一个任意（非空）值。当所有值都相同时，可以选择任意值，这无关紧要。当有多个值，但您不关心选择哪个值时，此函数也是最好使用的。
例如，在某些情况下，组中的一列可能有多个值，并且它们之间有一定的顺序 line_item_usage_start_date ，和 line_item_usage_end_date . 在这种情况下，您可以使用 MIN 以及 MAX 获取第一个或最后一个值。
在有多个值并且您想要选择一个特定值的情况下，有许多聚合函数可供选择，并且您可以做出相当复杂的选择。

赞(0）回复(0）举报 2021-07-24

bcs8qyzn2#

这不可能，但你可以 group by line_item_resource_id 应用聚合函数，比如 max 或者 count 在其他列上，然后您可以得到 line_item_resource_id 例如 max 其他列的。但如果你只想让他们 line_item_resource_id 您可以这样做：

with temporary_table as (
SELECT line_item_resource_id, count( line_item_resource_id ) as cnt
FROM table
WHERE product_servicename = 'Amazon Elastic Compute Cloud'
AND line_item_usage_account_id = '544934960'
AND line_item_usage_type LIKE '%BoxUsage%'
AND identity_time_interval = '2020-06-29T00:00:00Z/2020-06-30T00:00:00Z'
GROUP BY line_item_resource_id
) SELECT * FROM table
 WHERE line_item_resource_id in 
(select line_item_resource_id from temporary_table where cnt is 1)
AND product_servicename = 'Amazon Elastic Compute Cloud'
AND line_item_usage_account_id = '544934960'
AND line_item_usage_type LIKE '%BoxUsage%'
AND identity_time_interval = '2020-06-29T00:00:00Z/2020-06-30T00:00:00Z'

赞(0）回复(0）举报 2021-07-24

sr4lhrrt3#

我想在这里提出另一个解决方案，使用 ROW_NUMBER() 我将在这里展示基本的解决方案，当然，row\u number（）有更多的可能性（比如在分区中执行order by，等等…）
在这个解决方案中，您不需要在每一列之前都编写一个聚合函数，只需使用 * . 这使得查询更加简短和清晰。。
所以你可以：

WITH tmp_table AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY line_item_resource_id) rn
  FROM table
  WHERE product_servicename = 'Amazon Elastic Compute Cloud'
    AND line_item_usage_account_id = '544934960'
    AND line_item_usage_type LIKE '%BoxUsage%'
    AND identity_time_interval = '2020-06-29T00:00:00Z/2020-06-30T00:00:00Z'
)    
SELECT *
FROM tmp_table
WHERE rn = 1

赞(0）回复(0）举报 2021-07-24

我来回答

athena sql在一列上是不同的，但返回多个？

3条答案

相关问题

热门标签

最新问答