I have a table that contains some data. There are some attributes that are used to 'group' the rows and there is a date column. I'm trying to improve the performance of a query that basically gets TOP X rows from this table, where it returns only the latest row if the data is in the past + all future rows.
My data:
| Id | GroupingId | Date | Whatever |
| ------------ | ------------ | ------------ | ------------ |
| 1 | 1 | 2023-01-01 | Value1 |
| 2 | 1 | 2023-01-02 | Value2 |
| 3 | 2 | 2023-01-03 | Value3 |
| 4 | 1 | 2040-01-01 | Value1 |
My query:
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY GroupingId
ORDER BY Date DESC)
as rnk
FROM myData
WHERE Date <= SYSUTCDATETIME()
UNION
SELECT *, 1 as rnk
FROM myData
WHERE Date > SYSUTCDATETIME()
)
SELECT *
FROM cte
WHERE rnk = 1
ORDER BY GroupingId
OFFSET 0 ROWS Fetch NEXT 100 ROWS ONLY
My expected output:
Id | GroupingId | Date | Whatever |
---|---|---|---|
2 | 1 | 2023-01-02 | Value2 |
4 | 1 | 2040-01-01 | Value1 |
3 | 2 | 2023-01-03 | Value3 |
Note that the above is huge simplification of the actual data structure. But This seems to be the culprit. Basically, due to the order by (and possibly filtering conditions that might show up), sql server needs to 'materialize' (aka read from disk) the whole first part of the cte. I'm looking for a way to modify the query/index strategy/indexed view so that the query can be executed without loading everything into memory and just reading top X rows from index/whatever. In essence, this query needs to execute instantly, even if the table has many milions of rows.
EDIT 1:
Execution plan for the above here
Index that is already there:
CREATE NONCLUSTERED INDEX [idx_groupingId_date_test] ON [dbo].[myData]
(
[GroupingId] ASC,
[Date] DESC
)
Some other useful info:
Yes, the data is distributed so that there is much more data to be expected in the past. E.g. for every GroupingId
there might be couple of rows in the future but dozens/hundreds in the past.
Removing the union makes the query super fast. This is actually the recent change that caused the performance problem that I'm trying to fix.
3条答案
按热度按时间nc1teljy1#
I don't believe that the
MERGE JOIN
is a result of theUNION ALL
. usually whenMERGE JOIN
is the most expensive part is due to the fact that one on of the sides has a big residual. And in this case, you are having up to 100 more rows in the upper SELECT that in the lower one.I would try something like
This should remove the
MERGE JOIN
and leave you withCONCATENATION
[Edit 1] Thank you for posting the query plan. try creating the suggested index:
fhity93d2#
You can try following approacg, instead of making a row_number over the whole table firs gather the "last" of evry groupingid, the get only 1 row per grouping incase there are multiple entries, as you only have n groupingid and some baggage you should get fatser to your result.
for this query only ([GroupingId],[Date]) need a comnied index,which must be expended with your actual data.
fiddle
vuv7lop33#
Ok, some comments got me thinking and on a right track. I figured, I basically need a way to get rid of the Union/merge join from the query plan and have nested loops, so the query can be executed sequentially all the way.
This gets me from 1 minute in the original query (or the other answers so far) to below 100ms. New query plan: https://www.brentozar.com/pastetheplan/?id=r1sQdhM0s