SQL Server SQL: Sanatize & Parse String in SQL view

92vpleto  于 2023-06-21  发布在  其他
关注(0)|答案(2)|浏览(116)

I need to build a view that cleans and splits up some unsanatized data.

The Source-Field in question contains First and Middle-Names in the same field. I need those Names divided into the First name and the rest remaining string should get shown in a separate column for the middle names. If that first Name is surrounded with double quotes that shouldn't get split up.

At the same time, someone somehow managed to add control chars like CR, LF but NBSP or TAB into those fields, so i need to remove those first.

Here is the data i'm currently testing with:

create table #Names(
 [FirstName] [varchar](120)
)
insert into #Names (FirstName)
values
('"Le Firstname" MiddleName SomeName'),
(char(13)+char(10)+'Copy Paste'+char(9)+'King'),
('Paul'),
('King "foo bar" "the 3rd"')

And this is the result I'm supposed to get out of this:

FirstName       MiddleNames
--------------- ----------------------
"Le Firstname"  MiddleName SomeName
Copy            Paste King
Paul
King            "foo bar" "the 3rd"

My personal goal is to be performant (there are ~100k Records in that table) and to have the code at least to some degree readable for my future self in 2 years.

Oh and i'd prefer to keep the extra Objects at a minimum.

ty in advance, I'm tinkering with split_string & cross apply at the moment but I'm not sure how to get the "-enclosed strings into one at the moment

nxagd54h

nxagd54h1#

It might need some extra tweaking, but see this:

Tweaking like

  • Replacing CHAR(13) and/or CHAR(10) but a space before doing anything else.
  • A situation like only 1 " or more then 2 " might need more testing.
SELECT
   FirstName,
   SUBSTRING(FirstNameORG,LEN(FirstName)+1,LEN(FirstNameORG)) as MiddleNames
FROM (
   SELECT
      CASE WHEN firstSpace < FirstDoubleQuoot
           THEN SUBSTRING(FirstName,1,firstSpace-1)
           WHEN secondDoublequoot > firstDoublequoot
           THEN SUBSTRING(FirstName,firstDoublequoot-1,secondDoublequoot-firstDoublequoot+2)
           WHEN firstSpace>1
           THEN SUBSTRING(FirstName,1,firstSpace-1)
           ELSE FirstName
      END as FirstName,
     FirstName as FirstNameORG
   FROM (
      SELECT
         FirstName,
         CHARINDEX(' ',FirstName) firstSpace,
         CHARINDEX('"',FirstName) firstDoubleQuoot,
         CHARINDEX('"',FirstName,CHARINDEX('"',FirstName)+2) secondDoubleQuoot
      FROM #Names
   ) x
) y;

output:

FirstNameMiddleNames
"Le Firstname"  MiddleName SomeName
Copy  Paste King
Paul
King  "foo bar" "the 3rd"

see: DBFIDDLE

mzsu5hc0

mzsu5hc02#

I got to the solution thanks to @Luuk

  • I fogot i only needed to group up the "-Groups when the field starts with them
  • I was afraid i had to sanatize the FirstName-Field in each and every parsing, But the subquery solved that one for me.

Here is the Script I'm adapting for my productive solution now:

select ID ADR_ID, First_Name, trim(RIGHT(ORGFirstName, LEN(ORGFirstName) - len(First_Name))) as Middle_Name from
(
    SELECT
        Qua.ID ID,
        Qua.FirstName ORGFirstName,
        case
            when (FstQuote = 1) AND (SndQuote <> 0) then
                LEFT(Qua.FirstName, SndQuote)
            when FstBlank <> 0 then
                LEFT(Qua.FirstName, FstBlank - 1)
            else
                Qua.FirstName
        end as First_Name
    FROM (
        select
            ID,
            FirstName,
            CHARINDEX('"', San.FirstName) FstQuote,
            CHARINDEX('"', San.FirstName, 2) SndQuote,
            CHARINDEX(' ', San.FirstName) FstBlank
        from (
            SELECT
                ID,
                TRIM(REPLACE(REPLACE(REPLACE(REPLACE(FirstName, char(10), ''), char(13), ''), char(9), ' '), '  ', ' ')) FirstName
            FROM #Names) as San --Sanatized
        ) as Qua -- Quantified
) as FirstNamesParsed

There is a limit on how much you can save the users from faulty data. So i decided to ignore lonely quotes and if a user decides to copy-paste 15 tabstops into between two middle names, it will show as well.

相关问题