SQL Server How to delete data based on time interval

gg58donl  于 2023-03-28  发布在  其他
关注(0)|答案(4)|浏览(154)

I have a table in my database.The table has a field VehId(Int) and trackdt(Datetime)

My table has 3000 billion rows(yes, that's right 3000 billion). so i want to delete old data. but i want to delete data based on time interval.

I want to delete records of every 20 seconds for each vehicle.

Below is the table I have

VehId  Trackdt
1      2017-05-20 00:00:30.000
2      2017-05-20 00:00:32.000         
2      2017-05-20 00:00:42.000
1      2017-05-20 00:00:40.000
2      2017-05-20 00:00:52.000
1      2017-05-20 00:00:50.000
1      2017-05-20 00:01:00.000
2      2017-05-20 00:01:02.000
1      2017-05-20 00:01:10.000
1      2017-05-20 00:01:20.000
2      2017-05-20 00:01:12.000
1      2017-05-20 00:01:30.000
2      2017-05-20 00:01:22.000
2      2017-05-20 00:01:32.000

After delete data should be like below

VehId  TRackdt
1      2017-05-20 00:00:30.000
2      2017-05-20 00:00:32.000
1      2017-05-20 00:01:00.000
2      2017-05-20 00:01:02.000
1      2017-05-20 00:01:30.000
2      2017-05-20 00:01:32.000

I tried below query but it is taking too much time

ALTER PROCEDURE [dbo].[DELETEINTERVALDATA]

@FROMDATE DATETIME,
@TODATE DATETIME,
@INTERVAL INT,
@FLAG INT

AS
BEGIN   
SET NOCOUNT ON;

DECLARE @TRACKDT DATETIME
DECLARE @I INT =1
DECLARE @V INT =1
DECLARE @COUNT INT
DECLARE @VCOUNT INT
DECLARE @STARTDATE DATETIME = ''
DECLARE @VEHID INT
DECLARE @TIMEDIFF INT

 CREATE TABLE #TEMPVEHICLE
 (
    SNO INT IDENTITY(1,1),
    VEHID INT
 )

 CREATE TABLE #TEMPLOG
 (
    SNO INT IDENTITY(1,1),
    TRACKDT DATETIME
 )

 IF (@FLAG = 1 )
    BEGIN

        INSERT INTO #TEMPVEHICLE (VEHID) SELECT VEHID FROM VEHICLEMASTER ORDER BY VEHID         

        SELECT @VCOUNT = COUNT(SNO) FROM #TEMPVEHICLE 

        WHILE (@V <= @VCOUNT) 
            BEGIN               

                SELECT @VEHID = VEHID FROM #TEMPVEHICLE WHERE SNO = @V

                INSERT INTO #TEMPLOG(TRACKDT) SELECT TRACKDT 
                FROM TRACKINGLOG WITH(NOLOCK)
                WHERE TRACKDT BETWEEN @FROMDATE AND @TODATE AND VEHID = @VEHID
                ORDER BY TRACKDT ASC

                SELECT @COUNT = COUNT(SNO) FROM #TEMPLOG 

                WHILE (@I <= @COUNT)
                    BEGIN                           
                        SELECT @TRACKDT=TRACKDT FROM #TEMPLOG WHERE SNO = @I                            
                        IF (@I = 1)
                            BEGIN
                                SELECT @STARTDATE = @TRACKDT
                            END
                        ELSE
                            BEGIN
                                SELECT @TIMEDIFF = DATEDIFF(SECOND,@STARTDATE,@TRACKDT)
                                IF @TIMEDIFF <= 20
                                    BEGIN
                                        DELETE FROM TRACKINGLOG WHERE TRACKDT = @TRACKDT AND VEHID = @VEHID                                         
                                    END
                                ELSE
                                    BEGIN
                                        SELECT @STARTDATE = @TRACKDT
                                    END
                            END
                        SELECT  @I = @I + 1
                    END
                TRUNCATE TABLE #TEMPLOG
                SELECT  @V = @V + 1,@STARTDATE= '',@I=1
            END
        DROP TABLE #TEMPLOG
        DROP TABLE #TEMPVEHICLE
    END
   END

How can I write query which delete data based on time interval and should be quick??

Thanks in advance

6qqygrtg

6qqygrtg1#

Your desired output corresponds to a 30 second interval.

You can retrieve previous values in a result set using window functions. For example, LAG(trackdt,1) will return the previous value. LEAD the next. FIRST_VALUE will return the first value in a set.

The query :

select * ,
    FIRST_VALUE(trackdt) over (partition by vehid order by trackdt) as t0
from @mytable

Will return the first trackdt value per vehicle, when the rows are ordered by trackdt .

This way we can calculate the interval between the current and first value in the partition with datediff . The resulting query is a bit ugly:

select * ,
    FIRST_VALUE(trackdt) over (partition by vehid order by trackdt) as t0,
    datediff(s,FIRST_VALUE(trackdt) over (partition by vehid order by trackdt),
               trackdt) as interval
from @mytable

Dividing by 30 would give us the interval bucket to which each row belongs.

Calculating the remainder though, would return 0 for the first row in every 30-second bucket:

select * ,
    FIRST_VALUE(trackdt) over (partition by vehid order by trackdt) as t0,
    datediff(s,FIRST_VALUE(trackdt) over (partition by vehid order by trackdt),
               trackdt) %30 as remainder
from @mytable

1   2017-05-20 00:01:00.000 2017-05-20 00:00:30.000 0
1   2017-05-20 00:01:10.000 2017-05-20 00:00:30.000 10
1   2017-05-20 00:01:20.000 2017-05-20 00:00:30.000 20
1   2017-05-20 00:01:30.000 2017-05-20 00:00:30.000 0
2   2017-05-20 00:00:32.000 2017-05-20 00:00:32.000 0

We can use one or more CTEs to tidy-up this query and perform the deletion:

with start_times as
(
    select * ,
        FIRST_VALUE(trackdt) over (partition by vehid order by trackdt) as t0
    from @mytable
),
intervals as
(
    select * ,
        datediff(s,t0,trackdt) %30 as rem
    from start_times
)
delete 
from intervals
where rem<>0

This query depends on having values every 30 seconds.

In general, you can use the ROW_NUMBER() function to identify records within an interval bucket and simply pick the first row in a bucket. Since we want to retain the first row, we pick anything that has a ROW_NUMBER>1:

with start_times as
(
    select * ,
        FIRST_VALUE(trackdt) over (partition by vehid order by trackdt) as t0
    from #mytable
),
intervals as
(
    select * ,
        datediff(s,t0,trackdt) /30 as interval
    from start_times
),
ordered as
(
    select *, 
           ROW_NUMBER() over(partition by vehid,interval order by trackdt) row_num 
    from intervals
)
select vehid,trackdt
from ordered
where row_num>1

Even though this query is fast, I wouldn't use it on a 3 trillion row table. Windowing often results in spooling - temporary results are stored in tempdb to allow window calculations. It would be better to select the IDs of the rows that have to be deleted, insert them into a temporary table and then use a DELETE with a join clause:

DELETE HugeTable
From HugeTable
    INNER JOIN TempTable on TempTable.ID=HugeTable.ID

Even so, you probably want to batch deletions. In this case you could use the NTILE function to calculate a batch number for each row between 1 and N:

select ID,NTILE(100) over(order by vehid,trackdt) as batch_number
from ordered
where row_num=1

This will calculate a batch number between 1 and 100. You can store this in the temporary table and delete one batch of rows at a time.

bis0qfac

bis0qfac2#

You want to delete a record if the time was 20 seconds or less after the previous time. But if the previous time gets deleted you than you want to compare with the time before that etc. You will need some kind of loping for that. A recursive table expression or regular looping.

Declare @myTable table (vehid int, trackdt datetime)
insert into @mytable values
(1,      '2017-05-20 00:00:30.000'),
(2,      '2017-05-20 00:00:32.000'),         
(2,      '2017-05-20 00:00:42.000'),
(1,      '2017-05-20 00:00:40.000'),
(2,      '2017-05-20 00:00:52.000'),
(1,      '2017-05-20 00:00:50.000'),
(1,      '2017-05-20 00:01:00.000'),
(2,      '2017-05-20 00:01:02.000'),
(1,      '2017-05-20 00:01:10.000'),
(1,      '2017-05-20 00:01:20.000'),
(2,      '2017-05-20 00:01:12.000'),
(1,      '2017-05-20 00:01:30.000'),
(2,      '2017-05-20 00:01:22.000'),
(2,      '2017-05-20 00:01:32.000')

--set @@rowcount to 1
select 1

while @@ROWCOUNT > 0
begin
    DELETE T1
    from @mytable t1

    --previous time
    outer apply (select top 1 trackdt from  @mytable where vehid = t1.vehid and trackdt < t1.trackdt order by 1 desc)t2
    --previous time before that
    outer apply (select top 1 trackdt from  @mytable where vehid = t1.vehid and trackdt < t2.trackdt order by 1 desc)t3

    --previous time was less or equal to 20 seconds
    where DATEDIFF(second,t2.trackdt,t1.trackdt)<=20

    --previous time before that was more than 20 seconds or there is no time before
    and (DATEDIFF(second,t3.trackdt,t2.trackdt)>20 or t3.trackdt is null)

end

select * from @mytable
pgvzfuti

pgvzfuti3#

With 30TBs of records it will take a week or more even with the fastest method.

You could try

  1. Select and export records at intervals into CSVs - multiple batches
  2. Create a new table for the new records
  3. Import the CSV files with something like pg_bulkload
  4. Delete the old table and rename the new one

I'd first do the first 3 steps on one 10-million-record CSV file and calculate how long it will take to move the whole thing this way

yeotifhr

yeotifhr4#

Please read about Parameter sniffing . Assuming your stored procedure is slow BUT correct, I have updated your SP. Kindly Try :

ALTER PROCEDURE [dbo].[DELETEINTERVALDATA]

@FROMDATE1 DATETIME,
@TODATE1 DATETIME,
@INTERVAL1 INT,
@FLAG1 INT

AS
BEGIN   
SET NOCOUNT ON;

DECLARE @TRACKDT DATETIME
DECLARE @I INT =1
DECLARE @V INT =1
DECLARE @COUNT INT
DECLARE @VCOUNT INT
DECLARE @STARTDATE DATETIME = ''
DECLARE @VEHID INT
DECLARE @TIMEDIFF INT

DECLARE @FROMDATE DATETIME = @FROMDATE1
DECLARE @TODATE DATETIME = @TODATE1
DECLARE @INTERVAL INT = @INTERVAL1
DECLARE @FLAG INT = @FLAG1

 CREATE TABLE #TEMPVEHICLE
 (
    SNO INT IDENTITY(1,1),
    VEHID INT
 )

 CREATE TABLE #TEMPLOG
 (
    SNO INT IDENTITY(1,1),
    TRACKDT DATETIME
 )

 IF (@FLAG = 1 )
    BEGIN

        INSERT INTO #TEMPVEHICLE (VEHID) SELECT VEHID FROM VEHICLEMASTER ORDER BY VEHID         

        SELECT @VCOUNT = COUNT(SNO) FROM #TEMPVEHICLE 

        WHILE (@V <= @VCOUNT) 
            BEGIN               

                SELECT @VEHID = VEHID FROM #TEMPVEHICLE WHERE SNO = @V

                INSERT INTO #TEMPLOG(TRACKDT) SELECT TRACKDT 
                FROM TRACKINGLOG WITH(NOLOCK)
                WHERE TRACKDT BETWEEN @FROMDATE AND @TODATE AND VEHID = @VEHID
                ORDER BY TRACKDT ASC

                SELECT @COUNT = COUNT(SNO) FROM #TEMPLOG 

                WHILE (@I <= @COUNT)
                    BEGIN                           
                        SELECT @TRACKDT=TRACKDT FROM #TEMPLOG WHERE SNO = @I                            
                        IF (@I = 1)
                            BEGIN
                                SELECT @STARTDATE = @TRACKDT
                            END
                        ELSE
                            BEGIN
                                SELECT @TIMEDIFF = DATEDIFF(SECOND,@STARTDATE,@TRACKDT)
                                IF @TIMEDIFF <= 20
                                    BEGIN
                                        DELETE FROM TRACKINGLOG WHERE TRACKDT = @TRACKDT AND VEHID = @VEHID                                         
                                    END
                                ELSE
                                    BEGIN
                                        SELECT @STARTDATE = @TRACKDT
                                    END
                            END
                        SELECT  @I = @I + 1
                    END
                TRUNCATE TABLE #TEMPLOG
                SELECT  @V = @V + 1,@STARTDATE= '',@I=1
            END
        DROP TABLE #TEMPLOG
        DROP TABLE #TEMPVEHICLE
    END
   END

相关问题