How can rows with non-ASCII characters be returned using SQL Server?
If you can show how to do it for one column would be great.
I am doing something like this now, but it is not working
select *
from Staging.APARMRE1 as ar
where ar.Line like '%[^!-~ ]%'
For extra credit, if it can span allvarchar
columns in a table, that would be outstanding! In this solution, it would be nice to return three columns:
- The identity field for that record. (This will allow the whole record to be reviewed with another query.)
- The column name
- The text with the invalid character
Id | FieldName | InvalidText |
----+-----------+-------------------+
25 | LastName | Solís |
56 | FirstName | François |
100 | Address1 | 123 Ümlaut street |
Invalid characters would be any outside the range of SPACE (3210) through ~
(12710)
9条答案
按热度按时间tvmytwxo1#
Here is a solution for the single column search using PATINDEX.
It also displays the StartPosition, InvalidCharacter and ASCII code.
tuwxkamq2#
I've been running this bit of code with success
Which works well for known columns.
For extra credit, I wrote this quick script to search all nvarchar columns in a given table for Unicode characters.
I'm not a fan of dynamic SQL but it does have its uses for exploratory queries like this.
ztyzrc3y3#
try something like this:
OUTPUT:
ufj5ltwl4#
This script searches for non-ascii characters in one column. It generates a string of all valid characters, here code point 32 to 127. Then it searches for rows that don't match the list:
cidc1ykv5#
running the various solutions on some real world data - 12M rows varchar length ~30, around 9k dodgy rows, no full text index in play, the patIndex solution is the fastest, and it also selects the most rows.
(pre-ran km. to set the cache to a known state, ran the 3 processes, and finally ran km again - the last 2 runs of km gave times within 2 seconds)
patindex solution by Gerhard Weiss -- Runtime 0:38, returns 9144 rows
the substring-numbers solution by MT. -- Runtime 1:16, returned 8996 rows
udf solution by Deon Robertson -- Runtime 3:47, returns 7316 rows
ruoxqz4g6#
There is a user defined function available on the web 'Parse Alphanumeric'. Google UDF parse alphanumeric and you should find the code for it. This user defined function removes all characters that doesn't fit between 0-9, a-z, and A-Z.
That should bring back any records that have a last_name with invalid chars for you...though your bonus points question is a bit more of a challenge, but I think a case statement could handle it. This is a bit psuedo code, I'm not entirely sure if it'd work.
I wrote this in the forum post box...so I'm not quite sure if that'll function as is, but it should be close. I'm not quite sure how it will behave if a single record has two fields with invalid chars either.
As an alternative, you should be able to change the from clause away from a single table and into a subquery that looks something like:
Benefit here is for every column you'll only need to extend the union statement here, while you need to put that comparisson three times for every column in the case statement version of this script
o3imoua47#
To find which field has invalid characters:
You can test it with this query:
The result will be:
Msg 6841, Level 16, State 1, Line 3 FOR XML could not serialize the data for node 'field' because it contains a character (0x001F) which is not allowed in XML. To retrieve this data using FOR XML, convert it to binary, varbinary or image data type and use the BINARY BASE64 directive.
It is very useful when you write xml files and get error of invalid characters when validate it.
qnyhuwrf8#
Here is a UDF I built to detectc columns with extended ascii charaters. It is quick and you can extended the character set you want to check. The second parameter allows you to switch between checking anything outside the standard character set or allowing an extended set:
USAGE:
mm5n2pyu9#
I took Gerhard Weiss's script and made it my own as my business scenario was pretty simple and only needed to update 10K rows.