SQL Server Compare two strings with dash issue

cbjzeqam  于 2023-11-16  发布在  其他
关注(0)|答案(3)|浏览(123)

From the T-SQL code below you can see that dash - on the 3rd place of a nvarchar string does not equal dash on 3rd place of a varchar string.

if substring(N'BV‐D',3,1) <> substring('BV‐D',3,1)
    select substring(N'BV‐D',3,1) as nchar
            , substring('BV‐D',3,1) as char
            , ASCII(substring(N'BV‐D',3,1)) as ncode
            , ASCII(substring('BV‐D',3,1)) as code
            , 'Non-Equal' as status

output:

ncharcharncodecodestatus
-4545Non-Equal

If I will retype '-' or add N, or remove N in both cases strings will become equal. ASCII code is 45 in both cases.

Why are they not equal?

Initially it was not typed manually, but copied from different sources.

bqjvbblv

bqjvbblv1#

Because they aren't equal. You have a string which contains non-ASCII charcters that you're putting into a varchar and then (implicitly) converting that value back to an nvarchar . By that point it's too late and information has been lost.

Let's look at these values from a varbinary perspective:

SELECT CONVERT(varbinary(10),N'BV‐D') AS NvarcharBinary,
       CONVERT(varbinary(10),'BV‐D') AS VarcharBinary,
       CONVERT(varbinary(10),CONVERT(nvarchar(5),'BV‐D')) AS VarcharNvarcharBinary;

This gives the results:

NvarcharBinaryVarcharBinaryVarcharNvarcharBinary
0x42005600102044000x42562D440x420056002D004400

Notice the third character for the first string has the representation 0x1020 . Such a character cannot be represented in a varchar , and when you attempt to store it in one it is implicitly converted to a similar character, 0x2D . Then when you convert that back to an nvarchar you get 0x2D00 . (Note, not all characters are converted to similar characters. Those that aren't are replaced with ? .)

Unsurprisingly, 0x2D00 and 0x1020 are not the same, and so the strings are not equal.

sqyvllje

sqyvllje2#

This is the ASCII-45 HYPHEN-MINUS. May I comment that this is the more common hyphen. It's what appears if I press the hyphen key in my US keyboard.

This is the Unicode-8208 HYPHEN, well beyong the ASCII Range. It is what your 'BV‐D' string contains. If you don't use a unicode string with N , sql server converts it to the similar ASCII-45 character. Still, they differ, and that's the reason you get an inequality.

km0tfn4u

km0tfn4u3#

That's not the dash you assume it is, it's a different Unicode character. The query is using the wrong function though, ASCII , which does not return the actual Unicode code.

If you try

select UNICODE(N'‐') as Hyphen,
       UNICODE(N'-') as Minus,
       ASCII('‐') As Ascii_Dash,
       ASCII(N'-') as Ascii_Minus

You'll get

Hyphen  Minus   Ascii_Dash  Ascii_Minus
8208    45      45          45

8208 is the Hyphen character . That's the character generated by eg HTML's &dash; , used to separate words and syllable's. It's not the same as the minus character on our keyboards.

There are a lot of dash characters used in typography and hence Unicode. Full width, half width, breaking or non-breaking etc.

Fixing this

You can use TRANSLATE to map individual characters :

select translate(N'BV‐D', N'‐', N'-')

This trick is used in SQL Server to normalize equivalent Unicode characters because T-SQL doesn't have a NORMALIZE function. This wouldn't help here anyway, because HYPHEN-MINUS and HYPHEN aren't considered equivalent. You'd have to add all possible options though, as the example in the linked question shows :

select translate(last_name, 'éêëèàäçïîìôöòûù', 'eeeeaaciiiooouu') from employees

相关问题