SQL Server - Remove all non-printable ASCII characters

deyfvvtc  于 2023-06-21  发布在  SQL Server
关注(0)|答案(4)|浏览(134)

We recently migrated from SQL Server 2012 to SQL Server 2014 and all our FOR XML code started throwing errors about non-printable ASCII characters. I wrote this horrible function to remove non-printable ASCII characters as a quick fix. I want to replace it with something cleaner. Is there a way to do this?

ALTER FUNCTION [dbo].[remove_non_printable_chars]
(@input_string nvarchar(max))
RETURNS nvarchar(max)
BEGIN
    RETURN
    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    REPLACE(REPLACE(@input_string,
        CHAR(1), ''),CHAR(2), ''),CHAR(3), ''),CHAR(4), ''),CHAR(5), ''),CHAR(6), ''),CHAR(7), ''),CHAR(8), ''),CHAR(9), ''),CHAR(10), ''),
        CHAR(11), ''),CHAR(12), ''),CHAR(13), ''),CHAR(14), ''),CHAR(15), ''),CHAR(16), ''),CHAR(17), ''),CHAR(18), ''),CHAR(19), ''),CHAR(20), ''),
        CHAR(21), ''),CHAR(22), ''),CHAR(23), ''),CHAR(24), ''),CHAR(25), ''),CHAR(26), ''),CHAR(27), ''),CHAR(28), ''),CHAR(29), ''),CHAR(30), ''),
        CHAR(31), ''), NCHAR(0) COLLATE Latin1_General_100_BIN2, '')
END

Here's the FOR XML code that broke. (I did not write this. It was already in the code base).

SELECT @HTMLTableData =
(
    SELECT  HTMLRows 
    FROM (
        SELECT N'<tr>' + HTMLRow + N'</tr>' AS HTMLRows 
        FROM @HTMLRowData
    ) mi            
    FOR XML PATH(''), TYPE
).value('/', 'NVARCHAR(MAX)')
6rvt4ljy

6rvt4ljy1#

Another Option.

This function will replace control characters and correct any residual repeating spaces. For example Jane Smith{13}was here will not be returned as Jane Smithwas here , but rather Jane Smith was here

CREATE FUNCTION [dbo].[udf-Str-Strip-Control](@S varchar(max))
Returns varchar(max)
Begin
    ;with  cte1(N) As (Select 1 From (Values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) N(N)),
           cte2(C) As (Select Top (32) Char(Row_Number() over (Order By (Select NULL))-1) From cte1 a,cte1 b)
    Select @S = Replace(@S,C,' ')
     From  cte2

    Return ltrim(rtrim(replace(replace(replace(@S,' ','†‡'),'‡†',''),'†‡',' ')))
End
--Select [dbo].[udf-Str-Strip-Control]('Michael        '+char(13)+char(10)+'LastName')  --Returns: Michael LastName
olmpazwi

olmpazwi2#

In-line version:

create function [dbo].[remove_non_printable_chars] (@input_string nvarchar(max))
returns table with schemabinding as return (
  select 
    replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
    replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
    replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
    replace(replace(@input_string collate latin1_general_100_bin2,
        char(1), ''),char(2), ''),char(3), ''),char(4), ''),char(5), ''),char(6), ''),char(7), ''),char(8), ''),char(9), ''),char(10), ''),
        char(11), ''),char(12), ''),char(13), ''),char(14), ''),char(15), ''),char(16), ''),char(17), ''),char(18), ''),char(19), ''),char(20), ''),
        char(21), ''),char(22), ''),char(23), ''),char(24), ''),char(25), ''),char(26), ''),char(27), ''),char(28), ''),char(29), ''),char(30), ''),
        char(31), ''), char(0) , '') 
     as clean_string
);
go

And use it like so:

select c.clean_string
from dbo.remove_non_printable_chars(@dirtystring) c

or

select ...
  , c.clean_string
from t
  cross apply dbo.remove_non_printable_chars(t.dirty_string) c

Reference:

lrl1mhuk

lrl1mhuk3#

Here is a similar answer of the previous inline-table valued function answer ( https://stackoverflow.com/a/43148897/2864740 ) for this question. The primary change is this uses a patindex guard first and is much faster when only a small percentage of the rows contain characters that need to be replaced.

Thus the two huge improvements from the original scalar function:

  • Use an inline table-valued function. This is much faster as it allows SQL Server to directly in-line the code in the query plan. I try to avoid scalar functions in queries designed to scale, as a normal scalar function can be a huge performance sap (even with schemabinding) and prevent optimizations such as parallelism.
  • Use patindex for an initial guard check. This changes the number of characters SQL must exmine when there are no control characters to replace from O(n * num_replace_calls) to ~O(n) . Since most of the data data (in my case) contains no control characters, this can result in a huge performance increase.
-- Only accepts VARCHAR(8000) to avoid a conversion to VARCHAR(MAX);
-- use the suitable input type, which might even be NVARCHAR(MAX).
CREATE FUNCTION DropControlCharactersTv(@str VARCHAR(8000))
RETURNS TABLE
WITH SCHEMABINDING
AS
    RETURN SELECT CleanedString = CASE
    -- No-op.
    WHEN @str IS NULL or @str = '' THEN @str
    -- If any of the non-wanted characters are found then go through the string and replace each occurrence of every character.
    -- This guard significantly improves the performance when very few strings need to be corrected.
    WHEN PATINDEX (
        '%[' + CHAR(0) + CHAR(1) + CHAR(2) + CHAR(3) + CHAR(4) + CHAR(5) + CHAR(6) + CHAR(7) + CHAR(8) + CHAR(11) + CHAR(12) + CHAR(14) + CHAR(15) + CHAR(16) + CHAR(17) + CHAR(18) + CHAR(19) + CHAR(20) + CHAR(21) + CHAR(22) + CHAR(23) + CHAR(24) + CHAR(25) + CHAR(26) + CHAR(27) + CHAR(28) + CHAR(29) + CHAR(30) + CHAR(31) + CHAR(127) + ']%',
        @str COLLATE Latin1_General_BIN) <> 0 THEN
        -- Replace, nested.
        -- See https://www.sqlservercentral.com/forums/topic/how-to-remove-characters-char0-to-char31
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        @str
        ,CHAR(0),'') COLLATE Latin1_General_BIN
        ,CHAR(1),'') COLLATE Latin1_General_BIN
        ,CHAR(2),'') COLLATE Latin1_General_BIN
        ,CHAR(3),'') COLLATE Latin1_General_BIN
        ,CHAR(4),'') COLLATE Latin1_General_BIN
        ,CHAR(5),'') COLLATE Latin1_General_BIN
        ,CHAR(6),'') COLLATE Latin1_General_BIN
        ,CHAR(7),'') COLLATE Latin1_General_BIN
        ,CHAR(8),'') COLLATE Latin1_General_BIN
        ,CHAR(9),'') COLLATE Latin1_General_BIN
        ,CHAR(10),'') COLLATE Latin1_General_BIN
        ,CHAR(11),'') COLLATE Latin1_General_BIN
        ,CHAR(12),'') COLLATE Latin1_General_BIN
        ,CHAR(13),'') COLLATE Latin1_General_BIN
        ,CHAR(14),'') COLLATE Latin1_General_BIN
        ,CHAR(15),'') COLLATE Latin1_General_BIN
        ,CHAR(16),'') COLLATE Latin1_General_BIN
        ,CHAR(17),'') COLLATE Latin1_General_BIN
        ,CHAR(18),'') COLLATE Latin1_General_BIN
        ,CHAR(19),'') COLLATE Latin1_General_BIN
        ,CHAR(20),'') COLLATE Latin1_General_BIN
        ,CHAR(21),'') COLLATE Latin1_General_BIN
        ,CHAR(22),'') COLLATE Latin1_General_BIN
        ,CHAR(23),'') COLLATE Latin1_General_BIN
        ,CHAR(24),'') COLLATE Latin1_General_BIN
        ,CHAR(25),'') COLLATE Latin1_General_BIN
        ,CHAR(26),'') COLLATE Latin1_General_BIN
        ,CHAR(27),'') COLLATE Latin1_General_BIN
        ,CHAR(28),'') COLLATE Latin1_General_BIN
        ,CHAR(29),'') COLLATE Latin1_General_BIN
        ,CHAR(30),'') COLLATE Latin1_General_BIN
        ,CHAR(31),'') COLLATE Latin1_General_BIN
        ,CHAR(127),'') COLLATE Latin1_General_BIN
    -- Did not match pattern: inherently valid
    ELSE @str END

In a query:

select
    Plucker.CleanedString
from Goose d
cross apply DropControlCharactersTv(d.Turkey) as Plucker
mlnl4t2r

mlnl4t2r4#

Just extending the previous answers a little bit

Below the white-list characters, all the others chars will be cleared

[ !`"#$%&'()\*+,\-\./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\]^``\\_abcdefghijklmnopqrstuvwxyz{|}~µº°¡¢£¤¥¦§¨©ª«¬­®¯±²³´¶·¸¹»¼½¾¿×÷ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]

I know it's ugly but it works.

CREATE FUNCTION [dbo].REPLACE_UNPRINT_CHARS(@VarString nvarchar(256))  
RETURNS nvarchar(256) 
AS    
BEGIN  
    RETURN REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@VarString, CHAR(0), ''), CHAR(1), ''), CHAR(2), ''), CHAR(3), ''), CHAR(4), ''), CHAR(5), ''), CHAR(6), ''), CHAR(7), ''), CHAR(8), ''), CHAR(9), ''), CHAR(10), ''), CHAR(11), ''), CHAR(12), ''), CHAR(13), ''), CHAR(14), ''), CHAR(15), ''), CHAR(16), ''), CHAR(17), ''), CHAR(18), ''), CHAR(19), ''), CHAR(20), ''), CHAR(21), ''), CHAR(22), ''), CHAR(23), ''), CHAR(24), ''), CHAR(25), ''), CHAR(26), ''), CHAR(27), ''), CHAR(28), ''), CHAR(29), ''), CHAR(30), ''), CHAR(31), ''), CHAR(127), ''), CHAR(128), ''), CHAR(129), ''), CHAR(130), ''), CHAR(131), ''), CHAR(132), ''), CHAR(133), ''), CHAR(134), ''), CHAR(135), ''), CHAR(136), ''), CHAR(137), ''), CHAR(138), ''), CHAR(139), ''), CHAR(140), ''), CHAR(141), ''), CHAR(142), ''), CHAR(143), ''), CHAR(144), ''), CHAR(145), ''), CHAR(146), ''), CHAR(147), ''), CHAR(148), ''), CHAR(149), ''), CHAR(150), ''), CHAR(151), ''), CHAR(152), ''), CHAR(153), ''), CHAR(154), ''), CHAR(155), ''), CHAR(156), ''), CHAR(157), ''), CHAR(158), ''), CHAR(159), ''), CHAR(160), ''); 
END;

Use for data clean up

UPDATE [dnName].[dbo].[tableName] 
SET FieldDirtyData= dbo.REPLACE_UNPRINT_CHARS(FieldDirtyData)
WHERE PATINDEX('%['+CHAR(1)+CHAR(2)+CHAR(3)+CHAR(4)+CHAR(5)+CHAR(6)+CHAR(7)+CHAR(8)+CHAR(9)+CHAR(10)+CHAR(11)+CHAR(12)+
CHAR(13)+CHAR(14)+CHAR(15)+CHAR(16)+CHAR(17)+CHAR(18)+CHAR(19)+CHAR(20)+
CHAR(21)+CHAR(22)+CHAR(23)+CHAR(24)+CHAR(25)+CHAR(26)+CHAR(27)+CHAR(28)+CHAR(29)+CHAR(30)+CHAR(31)+CHAR(127)+
CHAR(128)+CHAR(129)+CHAR(130)+CHAR(131)+CHAR(132)+CHAR(133)+CHAR(134)+CHAR(135)+CHAR(136)+CHAR(137)+CHAR(138)+
CHAR(139)+CHAR(140)+CHAR(141)+CHAR(142)+CHAR(143)+CHAR(144)+CHAR(145)+CHAR(146)+CHAR(147)+CHAR(148)+CHAR(149)+CHAR(150)+
CHAR(151)+CHAR(152)+CHAR(153)+CHAR(154)+CHAR(155)+CHAR(156)+CHAR(157)+CHAR(158)+CHAR(159)+CHAR(160)+']%', FieldDirtyData) <> 0

Adjust your datatype (nvarchar or varchar + max) as required

If you want to add more chars to clear use "select ASCII('char to remove here')" MSSQL command in order to get the ASCII code of the char and put it inside the replace instruction

i.g SELECT ASCII('¢') returns 162

so add one more "REPLACE(" after "RETURN" and "CHAR(162), '')" at the end of line but before the ";" sign.

相关问题