SQL Server Best query to get multiple columns from a GroupBy of two JOIN-ed tables?

vuktfyat 于 2023-06-21 发布在其他

关注(0)|答案(4)|浏览(143)

I'm trying to write a SQL query to get multiple columns from a joined table, and I've seen there are two approaches, either to have multiple columns in the GROUPBY, or use an aggregator on the results, but I'm not sure which of these is the best approach, or if there is a performance benefit to either?

As an example, I have user and purchases tables:

dbo.[User]
| Id (PK) | Name |
| ------------ | ------------ |
| 1 | Name1 |
| 2 | Name2 |

dbo.Purchases

Id (PK)	UserId (FK)	Quantity
1	1	2
2	1	3
3	2	5

I would like to get the UserId, Name and total quantity for all the users, which I think can be done in two ways:

Approach A:

SELECT u.Id, u.[Name], SUM(p.Quantity) as Quantity
FROM dbo.[User] u
LEFT JOIN dbo.Purchases p ON p.UserId = u.Id
GROUP BY u.Id, u.[Name]

Approach B:

SELECT u.Id, MAX(u.[Name]), SUM(p.Quantity) as Quantity
FROM dbo.[User] u
LEFT JOIN dbo.Purchases p ON p.UserId = u.Id
GROUP BY u.Id

Both of these seem strange to me though, because it should be clear from User.Id being a primary key that if I am grouping by User.Id then any other fields from [User] must have a unique value, so I'm confused why something like the following doesn't work:

--- INCORRECT ---
SELECT u.Id, u.[Name], SUM(p.Quantity) as Quantity
FROM dbo.[User] u
LEFT JOIN dbo.Purchases p ON p.UserId = u.Id
GROUP BY u.Id

Which approach (A) or (B) - or a different approach - am I best to use?

sql-server

来源：https://stackoverflow.com/questions/76472762/best-query-to-get-multiple-columns-from-a-groupby-of-two-join-ed-tables

4条答案

按热度按时间

dwbf0jvd1#

Another way which might be more natural is the following:

SELECT u.Id, u.[Name], totalQuantity
FROM dbo.[User] u
LEFT JOIN (
   select  SUM(p.Quantity) as TotalQuantity
   , UserId
   FROM   dbo.Purchases p 
   GROUP BY UserId
  ) p
ON p.UserId = u.Id

This groups purchases separately and then joins them back. There's no performance hit usually since sql server can figure out and LEFT JOIN "through" the User table. Ie, it doesn't materialize whole purchase table unnecessarily.

Similar to this is the OUTER APPLY (cross apply should give same results in this case too i think) approach:

SELECT u.Id, u.[Name], totalQuantity
FROM dbo.[User] u
OUTER APPLY (
   select  SUM(p.Quantity) as TotalQuantity
   FROM   dbo.Purchases p 
   WHERE p.UserId = u.Id
  ) p

These should generate pretty identical plans in most cases, but of course the best is to test it

赞(0）回复(0）举报 2023-06-21

oiopk7p52#

Try with the below code to get the desired result set.

Create table [user]
(
ID int,
[Name] varchar(20)
)
insert into [user]
Select 1, 'Name1'
union all 
Select 2, 'Name2'

Create table [Purchases]
(
ID int,
userID int,
Quantity int
)
insert into [Purchases]
Select 1,1, 2
union all 
Select 2,1,3
Union all 
Select 3,2,5

--Select Query
Select p.userID, u.[Name], SUM(p.Quantity) AS TotalQuantity
from purchases p
JOIN [user] u
on p.userid = u.id
GROUP BY p.userid,u.Name

OUTPUT:
userID  Name    TotalQuantity
-------------------------
1       Name1       5
2       Name2       5

赞(0）回复(0）举报 2023-06-21

ddarikpa3#

The only difference between approach A and approach B is that there is no need to include the name column under the GROUP BY section as you are commanding the MAX value for name under approach B.

I have just attempted to replicate your results (using slightly different notation) and either approach can work.

Approach A

select t1.id, t1.name, sum(t2.quantity) from table1 as t1 inner join table2 as t2 on t1.id=t2.id group by t1.id, t1.name order by t1.id;

Approach B

select t1.id, max(t1.name) as name, sum(t2.quantity) from table1 as t1 inner join table2 as t2 on t1.id=t2.id group by t1.id order by t1.id;

For Approach B, the only change I would make is to specify max(t1.name) as name , otherwise the column that displays the name will be called max instead of name.

赞(0）回复(0）举报 2023-06-21

4si2a6ki4#

Combined answer based on the suggestions to run the queries to see how they perform.

Note: I ran this for my specific example, so the timings and query plan weightings apply to my use-case, but were helpful to me to understand the differences. (Timings are just approximate - they vary by 100-200ms between executions.)

Approach A

SELECT uAct.Id, uAct.PrimaryEmail, uAct.LatestActivity FROM
    (SELECT u.Id, u.PrimaryEmail, Max(ta.ActivityTime) as LatestActivity
        FROM dbo.[User] u
        LEFT JOIN dbo.TimedActivity ta ON ta.UserId = u.Id
        GROUP BY u.Id, u.PrimaryEmail) uAct
    WHERE [someFilter])

Time: 00:00:00.290

Query Plan:

Approach B

SELECT uAct.Id, PrimaryEmail, uAct.LatestActivity FROM
    (SELECT u.Id, MAX(u.PrimaryEmail) as PrimaryEmail, Max(ta.ActivityTime) as LatestActivity
        FROM dbo.[User] u
        LEFT JOIN dbo.TimedActivity ta ON ta.UserId = u.Id
        GROUP BY u.Id) uAct
    WHERE [someFilter])

Time: 00:00:00.247

Query Plan:

Approach C (suggested by @siggemannen)

SELECT u.Id, u.PrimaryEmail, uAct.LatestActivity FROM
    dbo.[User] u
    LEFT JOIN(
        SELECT MAX(ta.ActivityTime) as LatestActivity
        , UserId
        FROM dbo.TimedActivity ta
        GROUP BY UserId
    ) uAct
    ON uAct.UserId = u.Id
    WHERE [someFilter])

Time: 00:00:00.135

Query Plan:

Approach D (suggested by @siggemannen)

SELECT u.Id, u.PrimaryEmail, uAct.LatestActivity FROM
    dbo.[User] u
    OUTER APPLY (
        SELECT MAX(ta.ActivityTime) as LatestActivity
        FROM dbo.TimedActivity ta
        WHERE ta.UserId = u.Id
    ) uAct
    WHERE [someFilter])

Time: 00:00:01.707

Query Plan:

Conclusion Approaches A, B and C seem to have very similar performance and execution. Approach D is much slower for my specific case. I'm supposing is due to creating a whole table inside the query? The key difference is the "Index Spool" taking up 82% of the execution time there.

I don't know much about query plans to identify any key differences in performance between A, B and C. I suppose this is likely to come down to relative sizes of tables - in my case not much difference at all.

赞(0）回复(0）举报 2023-06-21

我来回答

SQL Server Best query to get multiple columns from a GroupBy of two JOIN-ed tables?

4条答案

相关问题

热门标签

最新问答