I have a set of rows which looks like this, ordered by First_Date
Id First_Date Second_Date
ABC12345 2022-05-18 2022-08-16
ABC12345 2022-05-23 2022-08-29
ABC12345 2022-06-06 2022-11-23
ABC12345 2022-06-16 2022-11-23
ABC12345 2022-08-16 2022-11-23
ABC12345 2022-11-01 2022-11-29
After always keeping the first rows, I want to include all following rows where the first_date does not fall between (exclusive) the two dates in this first record.
Once there is a row which is not excluded (example above would be fifth record), then it would check the next row against this until the end.
So I only want to keep dates where there is no overlap at all - same dates are OK, just not between.
So I want the final result to look like this
Id First_Date Second_Date Flag
ABC12345 2022-05-18 2022-08-16 Keep
ABC12345 2022-05-23 2022-08-29 Exclude
ABC12345 2022-06-06 2022-11-23 Exclude
ABC12345 2022-06-16 2022-11-23 Exclude
ABC12345 2022-08-16 2022-11-23 Keep
ABC12345 2022-11-01 2022-11-29 Exclude
Is there any way of doing this. I've tried LAG function, but can't get it to work.
1条答案
按热度按时间pqwbnv8z1#
I'll assume that if multiple Id values are present, each should be handled separately.
The following will use several Common Table Expressions (CTEs) to (1) Define a well-defined row ordering, (2) Define a relationship between each row and the first eligible "following" row, and finally (3) Chain the linked rows together using a recursive CTE that starts with the first row for each Id (RowNum = 1) and follows the chain of linked rows.
Following that, you can either select the chained rows directly to get the filtered data, or join them back to the original data to generate the Keep/Exclude flag as you posted in your requested results.
The OrderedData step might be skipped if you have no duplicate First_Date for a given Id. In that case, you could replace most occurrences or RowNum with First_Date. If there can be duplicate First_Date values, the next question is which record to chose - the one with the earliest Second_Date or the latest?
For efficiency you should probably place an index on Data(Id, First_Date) at a minimum.
See this db<>fiddle for a working demo with additional test data.
Sample results:
| Id | First_Date | Second_Date | RowNum | Flag |
| ------------ | ------------ | ------------ | ------------ | ------------ |
| ABC12345 | 2022-05-18 | 2022-08-16 | 1 | Keep |
| ABC12345 | 2022-05-23 | 2022-08-29 | 2 | Exclude |
| ABC12345 | 2022-06-06 | 2022-11-23 | 3 | Exclude |
| ABC12345 | 2022-06-16 | 2022-11-23 | 4 | Exclude |
| ABC12345 | 2022-08-16 | 2022-11-23 | 5 | Keep |
| ABC12345 | 2022-11-01 | 2022-11-29 | 6 | Exclude |
| DEF00001 | 2022-01-01 | 2022-01-02 | 1 | Keep |
| DEF00001 | 2022-01-02 | 2022-01-05 | 2 | Keep |
| DEF00001 | 2022-01-03 | 2022-01-06 | 3 | Exclude |
| DEF00001 | 2022-01-04 | 2022-01-07 | 4 | Exclude |
| DEF00001 | 2022-01-05 | 2022-01-08 | 5 | Keep |
| DEF00001 | 2022-01-10 | 2022-01-20 | 6 | Keep |
| DEF00001 | 2022-01-15 | 2022-01-25 | 7 | Exclude |
| DEF00001 | 2022-01-22 | 2022-01-31 | 8 | Keep |