I'm trying to write a SQL query to get multiple columns from a joined table, and I've seen there are two approaches, either to have multiple columns in the GROUPBY, or use an aggregator on the results, but I'm not sure which of these is the best approach, or if there is a performance benefit to either?
As an example, I have user and purchases tables:
dbo.[User]
| Id (PK) | Name |
| ------------ | ------------ |
| 1 | Name1 |
| 2 | Name2 |
dbo.Purchases
Id (PK) | UserId (FK) | Quantity |
---|---|---|
1 | 1 | 2 |
2 | 1 | 3 |
3 | 2 | 5 |
I would like to get the UserId, Name and total quantity for all the users, which I think can be done in two ways:
Approach A:
SELECT u.Id, u.[Name], SUM(p.Quantity) as Quantity
FROM dbo.[User] u
LEFT JOIN dbo.Purchases p ON p.UserId = u.Id
GROUP BY u.Id, u.[Name]
Approach B:
SELECT u.Id, MAX(u.[Name]), SUM(p.Quantity) as Quantity
FROM dbo.[User] u
LEFT JOIN dbo.Purchases p ON p.UserId = u.Id
GROUP BY u.Id
Both of these seem strange to me though, because it should be clear from User.Id
being a primary key that if I am grouping by User.Id
then any other fields from [User]
must have a unique value, so I'm confused why something like the following doesn't work:
--- INCORRECT ---
SELECT u.Id, u.[Name], SUM(p.Quantity) as Quantity
FROM dbo.[User] u
LEFT JOIN dbo.Purchases p ON p.UserId = u.Id
GROUP BY u.Id
Which approach (A) or (B) - or a different approach - am I best to use?
4条答案
按热度按时间dwbf0jvd1#
Another way which might be more natural is the following:
This groups purchases separately and then joins them back. There's no performance hit usually since sql server can figure out and LEFT JOIN "through" the User table. Ie, it doesn't materialize whole purchase table unnecessarily.
Similar to this is the OUTER APPLY (cross apply should give same results in this case too i think) approach:
These should generate pretty identical plans in most cases, but of course the best is to test it
oiopk7p52#
Try with the below code to get the desired result set.
ddarikpa3#
The only difference between approach A and approach B is that there is no need to include the name column under the GROUP BY section as you are commanding the MAX value for name under approach B.
I have just attempted to replicate your results (using slightly different notation) and either approach can work.
Approach A
select t1.id, t1.name, sum(t2.quantity) from table1 as t1 inner join table2 as t2 on t1.id=t2.id group by t1.id, t1.name order by t1.id;
Approach B
select t1.id, max(t1.name) as name, sum(t2.quantity) from table1 as t1 inner join table2 as t2 on t1.id=t2.id group by t1.id order by t1.id;
For Approach B, the only change I would make is to specify
max(t1.name) as name
, otherwise the column that displays the name will be called max instead of name.4si2a6ki4#
Combined answer based on the suggestions to run the queries to see how they perform.
Note: I ran this for my specific example, so the timings and query plan weightings apply to my use-case, but were helpful to me to understand the differences. (Timings are just approximate - they vary by 100-200ms between executions.)
Approach A
Time: 00:00:00.290
Query Plan:
Approach B
Time: 00:00:00.247
Query Plan:
Approach C (suggested by @siggemannen)
Time: 00:00:00.135
Query Plan:
Approach D (suggested by @siggemannen)
Time: 00:00:01.707
Query Plan:
Conclusion Approaches A, B and C seem to have very similar performance and execution. Approach D is much slower for my specific case. I'm supposing is due to creating a whole table inside the query? The key difference is the "Index Spool" taking up 82% of the execution time there.
I don't know much about query plans to identify any key differences in performance between A, B and C. I suppose this is likely to come down to relative sizes of tables - in my case not much difference at all.