Following is a brief explanation about UTF-8/16/32 (All UTF formats are unicode compatible but the way it represents the underlying bytes (called encoding) differ).

UTF-8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. This is the default encoding format used in Web.

Pros

Takes less memory (1 byte) if the target audience is English.

Cons

String manipulation is a bit tedious since it is variable-width.

————————————————————————————————————————————————

UTF-16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. This is the default encoding format used in Windows.

Pros

Takes less memory compared to UTF-8 if the target audience is Asian.

Cons

For Chinese symbols, musical notes or some others, string manipulation might be little tedious since they take 4 bytes.

————————————————————————————————————————————————

UTF-32: Fixed-width encoding. All code points take 4 bytes. Unix and others uses this encoding format.

Pros

String manipulation is easier since fixed width and generally faster compared to UTF-8/16.

Cons

Takes most memory (4 bytes for all code points) compared to UTF-8/16.

Advertisements