SQL Server How to make this query faster: TOP N over TOP 1 in past + all in future

fnatzsnv 于 2023-02-28 发布在其他

关注(0)|答案(3)|浏览(139)

I have a table that contains some data. There are some attributes that are used to 'group' the rows and there is a date column. I'm trying to improve the performance of a query that basically gets TOP X rows from this table, where it returns only the latest row if the data is in the past + all future rows.

My data:
| Id | GroupingId | Date | Whatever |
| ------------ | ------------ | ------------ | ------------ |
| 1 | 1 | 2023-01-01 | Value1 |
| 2 | 1 | 2023-01-02 | Value2 |
| 3 | 2 | 2023-01-03 | Value3 |
| 4 | 1 | 2040-01-01 | Value1 |

My query:

WITH cte AS (
        SELECT *,
            ROW_NUMBER() OVER (
                PARTITION BY GroupingId
                ORDER BY Date DESC) 
            as rnk
        FROM myData
        WHERE Date <= SYSUTCDATETIME()
        UNION
        SELECT *, 1 as rnk
        FROM myData
        WHERE Date > SYSUTCDATETIME()
    )
    SELECT *
    FROM cte
    WHERE rnk = 1
    ORDER BY GroupingId
    OFFSET 0 ROWS Fetch NEXT 100 ROWS ONLY

My expected output:

Id	GroupingId	Date	Whatever
2	1	2023-01-02	Value2
4	1	2040-01-01	Value1
3	2	2023-01-03	Value3

Note that the above is huge simplification of the actual data structure. But This seems to be the culprit. Basically, due to the order by (and possibly filtering conditions that might show up), sql server needs to 'materialize' (aka read from disk) the whole first part of the cte. I'm looking for a way to modify the query/index strategy/indexed view so that the query can be executed without loading everything into memory and just reading top X rows from index/whatever. In essence, this query needs to execute instantly, even if the table has many milions of rows.

EDIT 1:

Execution plan for the above here

Index that is already there:

CREATE NONCLUSTERED INDEX [idx_groupingId_date_test] ON [dbo].[myData]
(
    [GroupingId] ASC,
    [Date] DESC
)

Some other useful info:

Yes, the data is distributed so that there is much more data to be expected in the past. E.g. for every GroupingId there might be couple of rows in the future but dozens/hundreds in the past.

Removing the union makes the query super fast. This is actually the recent change that caused the performance problem that I'm trying to fix.

sql-server

来源：https://stackoverflow.com/questions/75524021/how-to-make-this-query-faster-top-n-over-top-1-in-past-all-in-future

3条答案

按热度按时间

nc1teljy1#

I don't believe that the MERGE JOIN is a result of the UNION ALL . usually when MERGE JOIN is the most expensive part is due to the fact that one on of the sides has a big residual. And in this case, you are having up to 100 more rows in the upper SELECT that in the lower one.

I would try something like

CREATE TABLE myData (
    Id int,
    GroupingId  INT,
    [Date] DATE,
    Whatever varchar(100)
)
INSERT INTO myData(Id,GroupingId, Date,Whatever)
          SELECT 1 Id, 1    GroupingId, '2023-01-01' Date, 'Value1' Whatever
union all select 2,    1              , '2023-01-02'     , 'Value2'
union all select 3,    2              , '2023-01-03'     , 'Value3'
union all select 4,    1              , '2040-01-01'     ,' Value1'

WITH cte AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY GroupingId
            ORDER BY Date DESC) 
        as rnk
    FROM myData
    WHERE Date <= SYSUTCDATETIME()
)
SELECT *
FROM cte
WHERE rnk = 1
UNION ALL 
SELECT *, 1 as rnk
FROM myData
WHERE Date > SYSUTCDATETIME()
ORDER BY GroupingId
OFFSET 0 ROWS Fetch NEXT 100 ROWS ONLY

This should remove the MERGE JOIN and leave you with CONCATENATION

[Edit 1] Thank you for posting the query plan. try creating the suggested index:

CREATE NONCLUSTERED INDEX [idx_date_test] ON [dbo].[myData]
(
    [Date] DESC
)
INCLUDE (
    groupingId, 
    id
)

赞(0）回复(0）举报 2023-02-28

fhity93d2#

You can try following approacg, instead of making a row_number over the whole table firs gather the "last" of evry groupingid, the get only 1 row per grouping incase there are multiple entries, as you only have n groupingid and some baggage you should get fatser to your result.

for this query only ([GroupingId],[Date]) need a comnied index,which must be expended with your actual data.

CREATE TABLE myData
    ([Id] int, [GroupingId] int, [Date] Datetime, [Whatever] varchar(6)
  , Primary key(Id), INDEX IX_MyTable4 NONCLUSTERED ([GroupingId],[Date]) )
;
    
INSERT INTO myData
    ([Id], [GroupingId], [Date], [Whatever])
VALUES
    (1, 1, '2023-01-01', 'Value1'),
    (2, 1, '2023-01-02', 'Value2'),
    (3, 2, '2023-01-03', 'Value3'),
    (4, 1, '2040-01-01', 'Value1')
;

4 rows affected

WITH CTE AS (SELECT m1.* FROM myData m1 
  JOIN (SELECT MAX([Date]) max_date, [GroupingId] FROM myData WHERE [Date]  <= SYSUTCDATETIME() GROUP BY [GroupingId]) m2
  ON m1.[GroupingId] = m2.[GroupingId] AND m1.[Date] = m2.max_date WHERE m1.[Date]  <= SYSUTCDATETIME()),
CTE2 as (
        SELECT *,
            ROW_NUMBER() OVER (
                PARTITION BY GroupingId
                ORDER BY Date DESC) 
            as rnk
        FROM CTE)
SELECT * FROM CTE2 WHERE rnk = 1
UNION ALL
        SELECT *,1
        FROM myData
        WHERE Date > SYSUTCDATETIME()

ORDER BY GroupingId
OFFSET 0 ROWS Fetch NEXT 100 ROWS ONLY

Id	GroupingId	Date	Whatever	rnk
2	1	2023-01-02 00:00:00.000	Value2	1
4	1	2040-01-01 00:00:00.000	Value1	1
3	2	2023-01-03 00:00:00.000	Value3	1

fiddle

赞(0）回复(0）举报 2023-02-28

vuv7lop33#

Ok, some comments got me thinking and on a right track. I figured, I basically need a way to get rid of the Union/merge join from the query plan and have nested loops, so the query can be executed sequentially all the way.

SELECT TOP 100 * 
FROM (
    SELECT *
    FROM mydata o
    WHERE (o.date > SYSUTCDATETIME() OR (
        o.id = (SELECT TOP 1 id FROM mydata s WHERE s.groupingid = o.groupingid AND s.date < SYSUTCDATETIME() ORDER BY DATE DESC)))
) r
ORDER BY r.groupingid

This gets me from 1 minute in the original query (or the other answers so far) to below 100ms. New query plan: https://www.brentozar.com/pastetheplan/?id=r1sQdhM0s

赞(0）回复(0）举报 2023-02-28

我来回答

SQL Server How to make this query faster: TOP N over TOP 1 in past + all in future

3条答案

相关问题

热门标签

最新问答