Using unicode with php

USING UNICODE WITH PHP
Translation, localization, and 100%
less mojibake guaranteed or your
users won’t come back!

The whole world uses the internet

Why is internationalization important?
Content language of websites

Percentage of Internet users by language

Worse than no internationalization?
Mojibake

Unicode is the solution!
Well – kind of

1. Different encodings
2. OS’s have different default implementations
3. All software encodings have to match or convert
Unicode Idea == simple
Unicode Implementation == hard

Back to Basics

WHAT IS UNICODE?

U·ni·code
ˈ oniˈkōd
yo͞
/
Noun COMPUTING

1. an international encoding standard for use
with different languages and scripts, by which
each letter, digit, or symbol is assigned a unique
numeric value that applies across different
platforms and programs.

In the Beginning, there was ASCII

Code Pages
In which things get really weird…

Representing characters differently
ASCII

Unicode

One character to bits
in memory

Code point

A -> 100 0001

A -> U+0041

Direct

Abstract
But how do we represent this in memory?

Encoding Madness
UTF – Unicode Transformation Format
Maps a Code Point to a Byte Sequence

What is a character?
å
U+212B ANGSTROM SIGN
U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
U+0041 LATIN CAPITAL LETTER A + U+030A COMBINING RING ABOVE
How long is the string?
1. In bytes?
2. In code units?
3. In code points?
4. In graphemes?

Crash course in Computer Memory
Big endian systems - most significant
bytes of a number in the upper left
corner. Decreasing significance.
Little endian systems – most
significant bytes of a number in the
lower right. Increasing significance.

Big Endian? Little Endian?
You’re hurting my brain
Hello -> U+0048 U+0065 U+006C U+006C U+006F
00 48 00 65 00 6C 00 6C 00 6F – Little Endian
48 00 65 00 6C 00 6C 00 6F 00 - Big Endian
But.. It’s the same way to encode unicode…
Now I have a headache!

UTF-8 to the rescue!

Hello in ANSI -> 48 65 6C 6C 6
Hello in UTF8 -> 48 65 6C 6C 6

Moral of the story
Unicode is a standard, not an implementation
Text is never plain
Every string has an encoding
From a file
From a db
From an HTTP POST or GET (or PUT or file upload…)
Even Binary is an encoding!
No encoding? Start praying to the Mojibake gods…
If you do web – use UTF-8

Mojibake on rye with swiss.

WHY DO YOU NEED
UNICODE?

More than just UTF8

BEYOND STRINGS

I18n and L10N
• Internationalization – adaptation of products for potential use
virtually everywhere
• Localization - addition of special features for use in a specific locale

Date and Time Formats
30 juin 2009 fr_FR
30.06.2009
de_DE
Jun 30, 2009 en_US

And don’t forget the time zones!

Currency and Numbers
• 123 456
fr_FR
• 345 987,246 fr_FR
• 123.456
de_DE
• 345.987,246 de_DE
• 123,456
en_US
• 345,987.246 en_US
• French (France), Euro: 9 876 543,21 €
• German (Germany), Euro: 9.876.543,21 €
• English (United States), US Dollar: $9,876,543.21

Collation (Sorting)
• The letters A-Z can be sorted in a different order than in English. For
example, in Lithuanian, "y" is sorted between "i" and "k”
• Combinations of letters can be treated as if they were one letter.
For example, in traditional Spanish "ch" is treated as a single letter,
and sorted between "c" and "d”
• Accented letters can be treated as minor variants of the unaccented
letter. For example, "é" can be treated equivalent to "e”.
• Accented letters can be treated as distinct letters. For example, "Å"
in Danish is treated as a separate letter that sorts just after "Z”.

String Translation
• Translation is never one to one, especially when inserting items like
numbers
• Some languages have different grammars and formats for the
strangest things
• Usually translated strings are separated into “messages” and
stored, then mapped depending on the locale
• Large amounts of text need even more – different tables in a
database, files in directories, or more

Layout and Design
• Reading order
• Right to left
• Left to right
• Top to bottom

• Word order
• Cultural taboos (human images, for example)

3.5 extensions for triple the pain!

HOW TO UNICODE
WITH PHP

Upgrade to at least 5.3
• No, really, I’m entirely serious

• If you’re not on 5.3 you’re not ready for unicode
• At all

• You have far bigger issues to deal with – like no security updates
• (oh, and the extensions and apis you need either don’t exist or
won’t work right)

Install the bare minimum
• intl extension (bundled since PHP 5.3)
• mb_string (if you need zend_multibyte support or on the fly
conversion, but most anything else it can do intl does better)
• iconv extension (optional but excellent for dealing with files)
• pcre MUST have utf8 support (CHECK!)

C strings and encoding
char - 1 byte (usually 8 bit)
char * - a pointer to an array of chars stored in memory
• Can handle Code Page encodings, although generally need special APIs for
dealing with multibyte code pages
• Usually null terminated… well unless it’s a binary string
• Unix cleverly supports utf8 with apis
• Windows … does not

Introducing a new type
wchar_t – C90 standard (horribly ambiguous)
• Windows set it at 16 – and defined A and W versions of everything
• Unix set it at 32

C99 and C++11 do char16_t and char32_t to fix the craziness
Non-portable and api support sketchy
• Libraries to fix this exist
• Few are cross-platform
• Except for ICU – which just rocks

Why do we care?
• PHP talks ONLY to ansi apis on windows
• PHP functions assume ascii or binary encodings (except for a few
special ones)
• Although most functions are now marked “binary safe” and don’t
flip out on null bytes within a string, some still assume a null
terminated string
• string handling functions treat strings as a sequence of single-byte
characters.

Non-stupid PHP functionality (kinda)
• utf8_encode (only ISO-8859-1 to UTF8)
• utf8_decode (only UTF8 to ISO-8859-1)
• html_ entity_ decode
• htmlentities
• htmlspecialchars_ decode
• htmlspecialchars

C locales or how to make servers cry
• Setlocale is Per process
• I will repeat that – setlocale sets PER PROCESS
• Locales are slightly different on different OS’s
• Windows does not support utf8 properly

What setlocale will break

•gettext extension
• strtoupper
• strtolower
• number_format
• money_format
• ucfirst
• ucwords
• strftime

INTL to the rescue!
•
•
•
•
•
•
•
•
•
•
•
•
•
•

Wrapper around the excellent ICU library
Standardized locales, set default locale per script
Number formatting
Currency formatting
Message formatting (replaces gettext)
Calendars, dates, timezones and time
Transliterator
Spoofchecker
Resource Bundles
Convertors
IDN support
Graphemes
Collation
Iterators

Some intl caveats
• New stuff is only in newer PHP versions
• All strings in and out must be UTF-8 except for Uconvertor
• Intl doesn’t yet support zend_multibyte
• Intl doesn’t support HTTP input/output conversion
• Intl doesn’t support function “overloading”

mb_string
• enables zend_multibyte support
• supports transparent http in and out encoding
• provides some wrappers for functionality such as strtoupper
(including overloading the php function version…)

Iconv
• Primarily for charset conversion
• output buffer handler
• mime encoding functionality
• conversion
• some string helpers
•
•
•
•

len
substr
strpos
strrpos

• stream filter
stream_filter_append($fp, 'convert.iconv.ISO-2022-JP/EUC-JP');

Stay away from:
• ctype (all of it)
• filter extension with string functionality
•
•
•
•

FILTER_VALIDATE_EMAIL
FILTER_VALIDATE_URL
FILTER_VALIDATE_REGEXP
FILTER_SANITIZE_*

• some string functionality
• str_pad
• wordwrap
• others that might work only by looking at single bytes

What do you mean mysql is giving
me garbage?

BEYOND THE CODE

Browser Considerations
• Set Content-type AND charset
• use HTTP headers AND meta tags (not just meta)
• use accept-charset on forms to make sure your data is coming in
right
• Javascript: string literals, regular expression literals and any code
unit can also be expressed via a Unicode escape sequence uHHHH
• Specify content-type AND charset headers for javascript!!

Databases
Table/Schema encoding and connection
• Mysql you need to set the charset right on the table
AND
• Set the charset right on the connection (NOT set names, it does not
do enough)
AND
• Don’t use mysql – mysqli or pdo
• postgresql - pg_set_client_encoding
• oracle – passed in the connect
• sqlite(3) – make sure it was compiled with unicode and intl
extension is available
• sqlsrv/pdo_sqlsrv – CharacterSet in options

Other gotchas
• Plain text is not plain text, files will have encodings
• Files will be loaded as binary if you add the b flag to fopen (here’s a
hint, always use the b flag)
• You can convert files on the fly with the iconv filter
• You cannot use unicode file names with PHP and windows at all (no,
not even utf8) – unless you find a 3 rd party php extension
• Beware of sending anything but ascii to exec, proc_open and other
command line calls

The best and worst in PHP apps

CASE STUDIES

Applications
• Wordpress
• gettext (sigh)
• Drupal
• gettext files but NOT gettext api

Frameworks
• ZF and ZF2
• http://framework.zend.com/manual/1.12/en/performance.localization.html
• multiple adapters
• “gettext” allows using fast .po files, but doesn’t use setlocale/gettext
extension
• Symfony 1 and 2
• http://symfony.com/doc/current/book/translation.html
• multiple formats to hold translations
• doesn’t use gettext

Resources
• http://www.joelonsoftware.com/articles/Unicode.html
• http://unicode.org
• http://www.slideshare.net/andreizm/the-good-the-bad-andthe-ugly-what-happened-to-unicode-and-php-6
• http://php.net
• http://www.2ality.com/2013/09/javascript-unicode.html
• http://htmlpurifier.org/docs/enduser-utf8.html

My Little Project
• Get everything needed into intl from mb_string and iconv so you
need only 1 solution
•
•
•
•
•

stream filter from iconv
output handler from iconv
zend_multibyte support from mb_string
http in and output conversion from mb_string
Some simplified apis to make “overloading” doable

Contact
• auroraeosrose@gmail.com
• @auroraeosrose
• http://emsmith.net
• http://github.com/auroraeosrose
• Freenode
• #phpwomen
• #phpmentoring
• #php-gtk

Using unicode with php

More Related Content

What's hot (20)

Similar to Using unicode with php (20)

More from Elizabeth Smith (20)

Recently uploaded (20)

Using unicode with php

Editor's Notes