At my new job where I am NOT a SQL expert, I am trying to make a sql query in SSMS. I am trying to find all transactions that have 'insurance' or 'ins' in the Description field. So my code looks something like this (I have simplified/obfuscated it)
SELECT
a.Customer,
b.Amount,
c.Description
FROM
table1 as a
INNER JOIN table2 as b ON a.CustKey = b.AmtKey
INNER JOIN table 3 as c ON a.CustKey = c.DescKey
WHERE
(
(c.Description LIKE '%insurance%' OR c.Description LIKE '%ins%')
AND
(
b.Description NOT LIKE '%domains%' AND b.Description NOT LIKE '%vitamins%' AND
b.Description NOT LIKE '%Gainsville%' AND b.Description NOT LIKE '%circle k%'
b.Description NOT LIKE '%insane%'
))
So the Description field is an nvarchar(250) data type. It generally looks something like this "ALFA MUTUAL INS EFT PYMTS PPD SMITH JOHN" or "VISA/DEBIT-CARD POS STATE FARM INSURANCE 8009566310 IL 8675309 0102" but there are many other different ways that it looks. And some even have locations in the Description.
My problem is that '%ins%' is pulling in many results that are not relevant. For example, it will pull in all transactions that were made at a business called "Insane Cookies". So I add NOT LIKE '%insane%'. But than it will include ALL transactions that were made by John Smith Atkins who purchased insurance but also made purchases at Circle K gas stations. I do NOT want to exclude the 'insurance' purchases, but I do want to exclude his gas station purchases. So I add NOT LIKE '%circle k%'. This has made me create a very long list of NOT LIKE statements (about 48 NOT LIKE '%value%' statements)
Because the Description field is essentially a string, I cannot just use 'ins' or 'insurance'. I have to use the wildcard characters. My query actually works but I have sooooo many NOT LIKE statements and each NOT LIKE statements has two wildcard characters that it is bogging down the system. They (DBAs) are instructing me to modify my query. I knew this would happen. I tried to ask for help because so many NOT LIKE statements + wildcards = poor performance, but "apparently" everyone who could help is too busy. So I'm here.
This is my first time posting/asking for help on here. So I apologize if I overstated or understated my predicament. Feel free to ask and I will provide clarification. I'd appreciate ANY help.
TLDR: what are alternatives methods to a long list of NOT LIKE + wildcards?
Edit: change "or" in NOT LIKE area to "and". I have it as "and" in my query.
2条答案
按热度按时间rryofs0p1#
Parsing out text to get what you're after will never be 100% correct. If someone enters
INS
meaningImmigration & Naturalization Services
you're going to get a false positive you can't ever manage. The best you can do is "Default" an option; but rely on user intervention to set a flag/value to know it's insurance related.Knowing this there may be a way to make your query more performant using SET based processing; avoiding a Not like.
Essentially get all the possible INS records in a set; then exclude all the ones that match your not like criteria but using a like instead.
In a perfect world you'd have some sort of record indicator to indicate this was insurance related and key off it instead. Are you sure there's not another way to identify insurance related records besides scraping text?
Alternatively you could add a computed (persisted?) column to the database which evaluates the text for the desired string excluding the undesired strings in a case expression pushing the analysis/performance to the save of a record vs the read. This field would simply contain a 1 if insurance related 0 if not and have an evaluation using a case expression something like:
but again no matter what, you'll never hit 100% success.
sz81bmfz2#
To get rid of the
NOT LIKE
list, you need to think about word boundaries, as @siggemannen also pointed out:You could just put spaces around these words to eliminate false matches, and also around the original values not to exclude when they are the first or last words:
('
[space]' + c.Description + '
[space]' LIKE '%
[space]insurance
[space]%' OR '
[space]' + c.Description + '
[space]' LIKE '%
[space]ins
[space]%')
A more complicated but also potentially more performant way of expressing the equivalent is:
(c.Description LIKE 'insurance
[space]%' OR c.Description LIKE 'ins
[space]%' OR '
[space]' + c.Description + '
[space]' LIKE '%
[space]insurance
[space]%' OR '
[space]' + c.Description + '
[space]' LIKE '%
[space]ins
[space]%')
First, regarding
c.Description LIKE '%insurance%' OR c.Description LIKE '%ins%'
- as it isOR
, the first is redundant, asinsurance
containsins
. Thereforeins
remains, but no abbreviation can contain any character before this word, so there should be a space:'
[space]' + c.Description LIKE '%
[space]ins%'
a potentially more performant equivalent is
c.Description LIKE 'ins%' OR '
[space]' + c.Description LIKE '%
[space]ins%'
.This doesn't eliminate all
NOT LIKE
s, but many like "domains" or "Gainsville" will go away and only "insane" and the likes remain - however this approach should be avoided, as there are a ton of other words that you will have to exclude in the future. It is more reliable to include ALL possible abbreviations in point 1, as there are less variants.