update page now
PHP 8.1.34 Released!

Voting

: max(nine, eight)?
(Example: nine)

The Note You're Voting On

hfuecks at nospam dot org
20 years ago
Regarding the validity of a UTF-8 string when using the /u pattern modifier, some things to be aware of;

1. If the pattern itself contains an invalid UTF-8 character, you get an error (as mentioned in the docs above - "UTF-8 validity of the pattern is checked since PHP 4.3.5"

2. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8

3. PCRE regards five and six octet UTF-8 character sequences as valid (both in patterns and the subject string) but these are not supported in Unicode ( see section 5.9 "Character Encoding" of the "Secure Programming for Linux and Unix HOWTO" - can be found at http://www.tldp.org/ and other places )

4. For an example algorithm in PHP which tests the validity of a UTF-8 string (and discards five / six octet sequences) head to: http://hsivonen.iki.fi/php-utf8/

The following script should give you an idea of what works and what doesn't;

<?php
$examples = array(
    'Valid ASCII' => "a",
    'Valid 2 Octet Sequence' => "\xc3\xb1",
    'Invalid 2 Octet Sequence' => "\xc3\x28",
    'Invalid Sequence Identifier' => "\xa0\xa1",
    'Valid 3 Octet Sequence' => "\xe2\x82\xa1",
    'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x28\xa1",
    'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xe2\x82\x28",

    'Valid 4 Octet Sequence' => "\xf0\x90\x8c\xbc",
    'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x28\x8c\xbc",
    'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x90\x28\xbc",
    'Invalid 4 Octet Sequence (in 4th Octet)' => "\xf0\x28\x8c\x28",
    'Valid 5 Octet Sequence (but not Unicode!)' => "\xf8\xa1\xa1\xa1\xa1",
    'Valid 6 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1",
);

echo "++Invalid UTF-8 in pattern\n";
foreach ( $examples as $name => $str ) {
    echo "$name\n";
    preg_match("/".$str."/u",'Testing');
}

echo "++ preg_match() examples\n";
foreach ( $examples as $name => $str ) {
    
    preg_match("/\xf8\xa1\xa1\xa1\xa1/u", $str, $ar);
    echo "$name: ";

    if ( count($ar) == 0 ) {
        echo "Matched nothing!\n";
    } else {
        echo "Matched {$ar[0]}\n";
    }
    
}

echo "++ preg_match_all() examples\n";
foreach ( $examples as $name => $str ) {
    preg_match_all('/./u', $str, $ar);
    echo "$name: ";
    
    $num_utf8_chars = count($ar[0]);
    if ( $num_utf8_chars == 0 ) {
        echo "Matched nothing!\n";
    } else {
        echo "Matched $num_utf8_chars character\n";
    }
    
}
?>

<< Back to user notes page

To Top