php 从字符串中删除非ASCII字符

6tdlim6h 于 2023-01-12 发布在 PHP

关注(0)|答案(9)|浏览(145)

从网站提取数据时，我得到奇怪的字符：

Â

如何删除非扩展ASCII字符？
更合适的问题可以在这里找到：PHP - replace all non-alphanumeric chars for all languages supported

php

来源：https://stackoverflow.com/questions/8781911/remove-non-ascii-characters-from-string

9条答案

按热度按时间

eqzww0vc1#

正则表达式替换是最好的选择，使用$str作为示例字符串，并使用:print:匹配它，:print:是POSIX Character Class：

$str = 'aAÂ';
$str = preg_replace('/[[:^print:]]/', '', $str); // should be aA

:print:所做的是查找所有可打印的字符。相反，:^print:查找所有不可打印的字符。任何不属于当前字符集的字符都将被删除。

**注意：**在使用这个方法之前，你必须确保你的当前字符集是ASCII。POSIX字符类支持ASCII和Unicode，并且只根据当前字符集匹配。从PHP 5.6开始，默认字符集是UTF-8。

赞(0）回复(0）举报 2023-01-12

ih99xse12#

你只想要ASCII printable characters？
使用此：

<?php
header('Content-Type: text/html; charset=UTF-8');
$str = "abqwrešđčžsff";
$res = preg_replace('/[^\x20-\x7E]/','', $str);
echo "($str)($res)";

或者更好的方法是，将输入转换为utf8，然后使用phputf8 lib将“not normal”字符转换为它们的ascii表示：

require_once('libs/utf8/utf8.php');
require_once('libs/utf8/utils/bad.php');
require_once('libs/utf8/utils/validation.php');
require_once('libs/utf8_to_ascii/utf8_to_ascii.php');

if(!utf8_is_valid($str))
{
  $str=utf8_bad_strip($str);
}

$str = utf8_to_ascii($str, '' );

赞(0）回复(0）举报 2023-01-12

368yc8dk3#

$clearstring=filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);

更新日期：

自PHP 8.1起不建议使用FILTER_SANITIZE_STRINGhttps://www.php.net/manual/en/migration81.deprecated.php#migration81.deprecated.filter

赞(0）回复(0）举报 2023-01-12

c9x0cxw04#

与此相关的是，我们有一个Web应用程序，它必须将数据发送到一个只能处理ASCII字符集前128个字符的遗留系统。
我们必须使用的解决方案是将尽可能多的字符“翻译”成紧密匹配的ASCII等价物，但留下任何不能翻译的内容。
通常我会这样做：

<?php
// transliterate
if (function_exists('iconv')) {
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    }
?>

......但它会替换所有不能翻译成问号（？）的内容。
所以我们最后做了以下工作：在这个函数的末尾检查（注解掉）php正则表达式，它只是去掉了非ASCII字符。

<?php
public function cleanNonAsciiCharactersInString($orig_text) {

    $text = $orig_text;

    // Single letters
    $text = preg_replace("/[∂άαáàâãªä]/u",      "a", $text);
    $text = preg_replace("/[∆лДΛдАÁÀÂÃÄ]/u",     "A", $text);
    $text = preg_replace("/[ЂЪЬБъь]/u",           "b", $text);
    $text = preg_replace("/[βвВ]/u",            "B", $text);
    $text = preg_replace("/[çς©с]/u",            "c", $text);
    $text = preg_replace("/[ÇС]/u",              "C", $text);        
    $text = preg_replace("/[δ]/u",             "d", $text);
    $text = preg_replace("/[éèêëέëèεе℮ёєэЭ]/u", "e", $text);
    $text = preg_replace("/[ÉÈÊË€ξЄ€Е∑]/u",     "E", $text);
    $text = preg_replace("/[₣]/u",               "F", $text);
    $text = preg_replace("/[НнЊњ]/u",           "H", $text);
    $text = preg_replace("/[ђћЋ]/u",            "h", $text);
    $text = preg_replace("/[ÍÌÎÏ]/u",           "I", $text);
    $text = preg_replace("/[íìîïιίϊі]/u",       "i", $text);
    $text = preg_replace("/[Јј]/u",             "j", $text);
    $text = preg_replace("/[ΚЌК]/u",            'K', $text);
    $text = preg_replace("/[ќк]/u",             'k', $text);
    $text = preg_replace("/[ℓ∟]/u",             'l', $text);
    $text = preg_replace("/[Мм]/u",             "M", $text);
    $text = preg_replace("/[ñηήηπⁿ]/u",            "n", $text);
    $text = preg_replace("/[Ñ∏пПИЙийΝЛ]/u",       "N", $text);
    $text = preg_replace("/[óòôõºöοФσόо]/u", "o", $text);
    $text = preg_replace("/[ÓÒÔÕÖθΩθОΩ]/u",     "O", $text);
    $text = preg_replace("/[ρφрРф]/u",          "p", $text);
    $text = preg_replace("/[®яЯ]/u",              "R", $text); 
    $text = preg_replace("/[ГЃгѓ]/u",              "r", $text); 
    $text = preg_replace("/[Ѕ]/u",              "S", $text);
    $text = preg_replace("/[ѕ]/u",              "s", $text);
    $text = preg_replace("/[Тт]/u",              "T", $text);
    $text = preg_replace("/[τ†‡]/u",              "t", $text);
    $text = preg_replace("/[úùûüџμΰµυϋύ]/u",     "u", $text);
    $text = preg_replace("/[√]/u",               "v", $text);
    $text = preg_replace("/[ÚÙÛÜЏЦц]/u",         "U", $text);
    $text = preg_replace("/[Ψψωώẅẃẁщш]/u",      "w", $text);
    $text = preg_replace("/[ẀẄẂШЩ]/u",          "W", $text);
    $text = preg_replace("/[ΧχЖХж]/u",          "x", $text);
    $text = preg_replace("/[ỲΫ¥]/u",           "Y", $text);
    $text = preg_replace("/[ỳγўЎУуч]/u",       "y", $text);
    $text = preg_replace("/[ζ]/u",              "Z", $text);

    // Punctuation
    $text = preg_replace("/[‚‚]/u", ",", $text);        
    $text = preg_replace("/[`‛′’‘]/u", "'", $text);
    $text = preg_replace("/[″“”«»„]/u", '"', $text);
    $text = preg_replace("/[—–―−–‾⌐─↔→←]/u", '-', $text);
    $text = preg_replace("/[  ]/u", ' ', $text);

    $text = str_replace("…", "...", $text);
    $text = str_replace("≠", "!=", $text);
    $text = str_replace("≤", "<=", $text);
    $text = str_replace("≥", ">=", $text);
    $text = preg_replace("/[‗≈≡]/u", "=", $text);

    // Exciting combinations    
    $text = str_replace("ыЫ", "bl", $text);
    $text = str_replace("℅", "c/o", $text);
    $text = str_replace("₧", "Pts", $text);
    $text = str_replace("™", "tm", $text);
    $text = str_replace("№", "No", $text);        
    $text = str_replace("Ч", "4", $text);                
    $text = str_replace("‰", "%", $text);
    $text = preg_replace("/[∙•]/u", "*", $text);
    $text = str_replace("‹", "<", $text);
    $text = str_replace("›", ">", $text);
    $text = str_replace("‼", "!!", $text);
    $text = str_replace("⁄", "/", $text);
    $text = str_replace("∕", "/", $text);
    $text = str_replace("⅞", "7/8", $text);
    $text = str_replace("⅝", "5/8", $text);
    $text = str_replace("⅜", "3/8", $text);
    $text = str_replace("⅛", "1/8", $text);        
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[Љљ]/u", "Ab", $text);
    $text = preg_replace("/[Юю]/u", "IO", $text);
    $text = preg_replace("/[ﬁﬂ]/u", "fi", $text);
    $text = preg_replace("/[зЗ]/u", "3", $text); 
    $text = str_replace("£", "(pounds)", $text);
    $text = str_replace("₤", "(lira)", $text);
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[↨↕↓↑│]/u", "|", $text);
    $text = preg_replace("/[∞∩∫⌂⌠⌡]/u", "", $text);

    //2) Translation CP1252.
    $trans = get_html_translation_table(HTML_ENTITIES);
    $trans['f'] = '&fnof;';    // Latin Small Letter F With Hook
    $trans['-'] = array(
        '&hellip;',     // Horizontal Ellipsis
        '&tilde;',      // Small Tilde
        '&ndash;'       // Dash
        );
    $trans["+"] = '&dagger;';    // Dagger
    $trans['#'] = '&Dagger;';    // Double Dagger         
    $trans['M'] = '&permil;';    // Per Mille Sign
    $trans['S'] = '&Scaron;';    // Latin Capital Letter S With Caron        
    $trans['OE'] = '&OElig;';    // Latin Capital Ligature OE
    $trans["'"] = array(
        '&lsquo;',  // Left Single Quotation Mark
        '&rsquo;',  // Right Single Quotation Mark
        '&rsaquo;', // Single Right-Pointing Angle Quotation Mark
        '&sbquo;',  // Single Low-9 Quotation Mark
        '&circ;',   // Modifier Letter Circumflex Accent
        '&lsaquo;'  // Single Left-Pointing Angle Quotation Mark
        );

    $trans['"'] = array(
        '&ldquo;',  // Left Double Quotation Mark
        '&rdquo;',  // Right Double Quotation Mark
        '&bdquo;',  // Double Low-9 Quotation Mark
        );

    $trans['*'] = '&bull;';    // Bullet
    $trans['n'] = '&ndash;';    // En Dash
    $trans['m'] = '&mdash;';    // Em Dash        
    $trans['tm'] = '&trade;';    // Trade Mark Sign
    $trans['s'] = '&scaron;';    // Latin Small Letter S With Caron
    $trans['oe'] = '&oelig;';    // Latin Small Ligature OE
    $trans['Y'] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis
    $trans['euro'] = '&euro;';    // euro currency symbol
    ksort($trans);

    foreach ($trans as $k => $v) {
        $text = str_replace($v, $k, $text);
    }

    // 3) remove <p>, <br/> ...
    $text = strip_tags($text);

    // 4) &amp; => & &quot; => '
    $text = html_entity_decode($text);

    // transliterate
    // if (function_exists('iconv')) {
    // $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    // }

    // remove non ascii characters
    // $text =  preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text);      

    return $text;
}

?>

赞(0）回复(0）举报 2023-01-12

csbfibhn5#

我还认为最好的解决方案可能是使用正则表达式。
我的建议是：

function convert_to_normal_text($text) {

    $normal_characters = "a-zA-Z0-9\s`~!@#$%^&*()_+-={}|:;<>?,.\/\"\'\\\[\]";
    $normal_text = preg_replace("/[^$normal_characters]/", '', $text);

    return $normal_text;
}

然后，您可以像这样使用它：

$before = 'Some "normal characters": Abc123!+, some ASCII characters: ABC+ŤĎ and some non-ASCII characters: Ąąśćł.';
$after = convert_to_normal_text($before);
echo $after;

显示：

Some "normal characters": Abc123!+, some ASCII characters: ABC+ and some non-ASCII characters: .

赞(0）回复(0）举报 2023-01-12

kx5bkwkv6#

我只需要加上标题

header('Content-Type: text/html; charset=UTF-8');

赞(0）回复(0）举报 2023-01-12

q8l4jmvw7#

这应该是非常直接的，不需要iconv函数：

// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$string = preg_replace('![^'.preg_quote('-').'a-z0-_9\s]+!', '', strtolower($string));
// Replace all separator characters and whitespace by a single separator
$string = preg_replace('!['.preg_quote('-').'\s]+!u', '-', $string);

赞(0）回复(0）举报 2023-01-12

gdrx4gfi8#

我的问题解决了

$text = 'Châu Thái  Nhân 12/09/2022';
echo preg_replace('/[\x00-\x1F\x7F]/', '', $text);
//Châu Thái  Nhân 12/09/2022

赞(0）回复(0）举报 2023-01-12

vybvopom9#

我认为最好的方法是使用ord（）命令。这样你就可以保存任何语言的字符。只要记住先测试你文本的ord结果。这在unicode上不起作用。

$name="βγδεζηΘKgfgebhjrf!@#$%^&";    
//this function will clear all non greek and english characters on greek-iso charset        
function replace_characters($string)    
{    
   $str_length=strlen($string);    
   for ($x=0;$x<$str_length;$x++)    
      {    
          $character=$string[$x];    
          if ((ord($character)>64 && ord($character)<91) || (ord($character)>96 && ord($character)<123) || (ord($character)>192 && ord($character)<210) || (ord($character)>210 && ord($character)<218) || (ord($character)>219 && ord($character)<250) || ord($character)==252 || ord($character)==254)    
             {    
                 $new_string=$new_string.$character;     
             }    
      }    
      return $new_string;    
}    
//end function    

$name=replace_characters($name);    

echo $name;

赞(0）回复(0）举报 2023-01-12

我来回答

php 从字符串中删除非ASCII字符

9条答案

相关问题

热门标签

最新问答