SQL Server How to make this query faster: TOP N over TOP 1 in past + all in future

fnatzsnv  于 2023-02-28  发布在  其他
关注(0)|答案(3)|浏览(139)

I have a table that contains some data. There are some attributes that are used to 'group' the rows and there is a date column. I'm trying to improve the performance of a query that basically gets TOP X rows from this table, where it returns only the latest row if the data is in the past + all future rows.

My data:
| Id | GroupingId | Date | Whatever |
| ------------ | ------------ | ------------ | ------------ |
| 1 | 1 | 2023-01-01 | Value1 |
| 2 | 1 | 2023-01-02 | Value2 |
| 3 | 2 | 2023-01-03 | Value3 |
| 4 | 1 | 2040-01-01 | Value1 |

My query:

WITH cte AS (
        SELECT *,
            ROW_NUMBER() OVER (
                PARTITION BY GroupingId
                ORDER BY Date DESC) 
            as rnk
        FROM myData
        WHERE Date <= SYSUTCDATETIME()
        UNION
        SELECT *, 1 as rnk
        FROM myData
        WHERE Date > SYSUTCDATETIME()
    )
    SELECT *
    FROM cte
    WHERE rnk = 1
    ORDER BY GroupingId
    OFFSET 0 ROWS Fetch NEXT 100 ROWS ONLY

My expected output:

IdGroupingIdDateWhatever
212023-01-02Value2
412040-01-01Value1
322023-01-03Value3

Note that the above is huge simplification of the actual data structure. But This seems to be the culprit. Basically, due to the order by (and possibly filtering conditions that might show up), sql server needs to 'materialize' (aka read from disk) the whole first part of the cte. I'm looking for a way to modify the query/index strategy/indexed view so that the query can be executed without loading everything into memory and just reading top X rows from index/whatever. In essence, this query needs to execute instantly, even if the table has many milions of rows.

EDIT 1:

Execution plan for the above here

Index that is already there:

CREATE NONCLUSTERED INDEX [idx_groupingId_date_test] ON [dbo].[myData]
(
    [GroupingId] ASC,
    [Date] DESC
)

Some other useful info:

Yes, the data is distributed so that there is much more data to be expected in the past. E.g. for every GroupingId there might be couple of rows in the future but dozens/hundreds in the past.

Removing the union makes the query super fast. This is actually the recent change that caused the performance problem that I'm trying to fix.

nc1teljy

nc1teljy1#

I don't believe that the MERGE JOIN is a result of the UNION ALL . usually when MERGE JOIN is the most expensive part is due to the fact that one on of the sides has a big residual. And in this case, you are having up to 100 more rows in the upper SELECT that in the lower one.

I would try something like

CREATE TABLE myData (
    Id int,
    GroupingId  INT,
    [Date] DATE,
    Whatever varchar(100)
)
INSERT INTO myData(Id,GroupingId, Date,Whatever)
          SELECT 1 Id, 1    GroupingId, '2023-01-01' Date, 'Value1' Whatever
union all select 2,    1              , '2023-01-02'     , 'Value2'
union all select 3,    2              , '2023-01-03'     , 'Value3'
union all select 4,    1              , '2040-01-01'     ,' Value1'

WITH cte AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY GroupingId
            ORDER BY Date DESC) 
        as rnk
    FROM myData
    WHERE Date <= SYSUTCDATETIME()
)
SELECT *
FROM cte
WHERE rnk = 1
UNION ALL 
SELECT *, 1 as rnk
FROM myData
WHERE Date > SYSUTCDATETIME()
ORDER BY GroupingId
OFFSET 0 ROWS Fetch NEXT 100 ROWS ONLY

This should remove the MERGE JOIN and leave you with CONCATENATION

[Edit 1] Thank you for posting the query plan. try creating the suggested index:

CREATE NONCLUSTERED INDEX [idx_date_test] ON [dbo].[myData]
(
    [Date] DESC
)
INCLUDE (
    groupingId, 
    id
)
fhity93d

fhity93d2#

You can try following approacg, instead of making a row_number over the whole table firs gather the "last" of evry groupingid, the get only 1 row per grouping incase there are multiple entries, as you only have n groupingid and some baggage you should get fatser to your result.

for this query only ([GroupingId],[Date]) need a comnied index,which must be expended with your actual data.

CREATE TABLE myData
    ([Id] int, [GroupingId] int, [Date] Datetime, [Whatever] varchar(6)
  , Primary key(Id), INDEX IX_MyTable4 NONCLUSTERED ([GroupingId],[Date]) )
;
    
INSERT INTO myData
    ([Id], [GroupingId], [Date], [Whatever])
VALUES
    (1, 1, '2023-01-01', 'Value1'),
    (2, 1, '2023-01-02', 'Value2'),
    (3, 2, '2023-01-03', 'Value3'),
    (4, 1, '2040-01-01', 'Value1')
;
4 rows affected
WITH CTE AS (SELECT m1.* FROM myData m1 
  JOIN (SELECT MAX([Date]) max_date, [GroupingId] FROM myData WHERE [Date]  <= SYSUTCDATETIME() GROUP BY [GroupingId]) m2
  ON m1.[GroupingId] = m2.[GroupingId] AND m1.[Date] = m2.max_date WHERE m1.[Date]  <= SYSUTCDATETIME()),
CTE2 as (
        SELECT *,
            ROW_NUMBER() OVER (
                PARTITION BY GroupingId
                ORDER BY Date DESC) 
            as rnk
        FROM CTE)
SELECT * FROM CTE2 WHERE rnk = 1
UNION ALL
        SELECT *,1
        FROM myData
        WHERE Date > SYSUTCDATETIME()

ORDER BY GroupingId
OFFSET 0 ROWS Fetch NEXT 100 ROWS ONLY
IdGroupingIdDateWhateverrnk
212023-01-02 00:00:00.000Value21
412040-01-01 00:00:00.000Value11
322023-01-03 00:00:00.000Value31

fiddle

vuv7lop3

vuv7lop33#

Ok, some comments got me thinking and on a right track. I figured, I basically need a way to get rid of the Union/merge join from the query plan and have nested loops, so the query can be executed sequentially all the way.

SELECT TOP 100 * 
FROM (
    SELECT *
    FROM mydata o
    WHERE (o.date > SYSUTCDATETIME() OR (
        o.id = (SELECT TOP 1 id FROM mydata s WHERE s.groupingid = o.groupingid AND s.date < SYSUTCDATETIME() ORDER BY DATE DESC)))
) r
ORDER BY r.groupingid

This gets me from 1 minute in the original query (or the other answers so far) to below 100ms. New query plan: https://www.brentozar.com/pastetheplan/?id=r1sQdhM0s

相关问题