UTF8

Character Sets and Character Encodings : UTF-8 and ISO8859-1

1. Definitions

1.1 Character Sets

A character set in a sequence of characters (letters, symbols, numbers, etc.).

Each character is represented by a number.
e.g. 65=A, 66=B, 67=C, ... 1234=Ӓ, ...

Examples of character sets are:
ASCII, ISO 8859-1, Windows 1252

1.2 Character Encoding

A character encoding is a means of representing a character set in a computer file.

For ASCII and Windows 1252 (or ANSI) character sets, its easy, 1 byte = 1 character.

For large character sets, with more than 256 characters, it is more complex, as more than 1 byte per character is used.

"UTF-8" uses 1, 2, 3 or 4 bytes per character.

1.3 Character Entities in HTML and XML

An entity reference is of the form:
" € ©

A numeric reference (in HTML and XML) for character 255 is of the form:
Ϩ (decimal) or &xffc3; (hexidecimal)

2. Character Sets

2.1 ASCII

ASCII is the original Character Set, with 128 characters defined.

1 byte = 1 character.

2.2 ISO 8859-1

This is the ISO "Western European" character set.

It is the original "web" character set, and used as the default by older browsers.

ISO 8859-1 is a subset of the larger UCS/Unicode character set (not quite true, but almost)

It uses the "same" character set at UTF-8 (for codes #0 to #255), but a different character encoding
It is now depreciated (obsolete) - use UTF-8 instead.

2.3 Unicode and UCS (Universal Character Set)

This is a very large character set. It is a combination of the ISO 8859-1 characters,
plus mathematical and other symbols,
plus the Chinese, Hebrew, Japanese, Greek, Thai, Persian and other alphabets.

Some special cases :
- there are spaces reserved for 'user defined' characters,
- some characters can be combined to make composite characters (e.g. e and an accent to make e acute is 2 characters in the file, but 1 on the screen)

Unicode/UCS is a character set. It is encoded using UTF-8

2.4 Windows 1252 / ANSI Character Set

This is the Windows character set.

It is encoded using 1 bytes (0-255) per character.

From 0-127, its the same as ASCII

Between 0x80 and 0x9F there are differences. This is the problem area, as these character positions are 'not defined' in ISO 8859-1 and UTF-8

From 0xA0 and above, its "the same" as ISO 8859-1.

3 Character Encoding

3.1 UTF-8

UTF-8 is actually a character encoding, not a character set. Colloquially, it is now used to mean "Unicode/UCS with the UTF-8 encoding"

Its a means of using 1, 2 , 3 or 4 bytes to store a very large character set.

ASCII characters (0-127) take up 1 byte, so its backwards compatible.

£, maths symbols, Chinese and Japanese characters take up 2, 3 or 4 bytes

Some Windows editors use a 'BOM', a marker at the front of a file to indicate that the file contains UTF-8 encoded characters. (Actually, its a 2 byte character that's illegal in UTF-8). Not part of the spec.

4 Character Set Conversion Problems

Whatever you've used in the past, UTF-8 (Unicode/UCS), is the thing to aim for. Google, Blogger etc all use it.

4.1 From Windows 1252 / ANSI to the ISO character sets

If converting text from a Windows file to a web page in ISO format, you may have to map some 'high byte' characters, e.g. the euro symbol, as the character numbers will not be the same.

If copy-and-past'ing, windows will take care of the conversion for you.

4.2 From ISO 8859-1 to UTF-8/UCS/Unicode

Viewing a ISO 885901 file in a web browser page set to UTF-8 will display any characters greater than 128 as illegal characters

In terms of character sets, the conversion is straight forward, as there are "no" differences you are likely to encounter.

The character encoding is the problem. Example: In ISO 8859-1, character(165) is stored as binary 165. In UTF-8, it should be 2 bytes. The single byte will be an illegal UTF-8 character.

The solution is programing language dependent or editor dependant.

For example, in the Notepad++ editor, there is a 'convert ANSI to UTF-8' option.

In perl: $string =~ s/ ( [\x80-\xff] ) / chr($1) /gxe;

4.5 UTF-8 to ISO 8859-1

Viewing a UTF-8 file in a web browser page set to ISO 8859-1 will display 2 (or more)characters for each UTF-8 'hi byte' character.
e.g. For 2 byte UTF-8 characters, it will display an illegal character, followed by the character you want.

The solution: First, identify all characters in your input stream, that don't have ISO 8850-1 equivalents

Maybe convert

all the exotic UTF-8 bullet points to &#nn;

the exotic hyphens to - (minus sign)

the various 6, 66, 9, 99 style quotes to ' and "

For XML feeds with character codes greater than 255, consider using &#nn; escape sequences (rather than &name; or the binary code, both of which will cause problems)

5 Case Study

A well known international newspaper has a publishing system that uses UTF-8, and a series of XML feeds that use ISO 8859-1

Analyse an entire year's worth or newspaper articles.
- Make a list of every unique characters used.
- Cater for &#nn; and &name; style characters.
Map all the UTF-8 characters found (with character code greater than 128) to ISO 8859-1 equivalents.
Flag up any UTF-8 characters encountered in the conversion process which are not covered by this mapping.
- Again, cater for &#name; and &name; characters.
Escape all characters greater than 128 with the XML &#nn; escape sequence, so the output file is pure ASCII

6 Appendix : Differences between Windows 1252 and the ISO Character Sets

These character positions (0x80 to 0x9f) are not defined (illegal) in UTF8 and ISO 8895-1

In practise, you may wish to map characters like ‘ and ’ style quotes to ' etc

0x80 0x20ac ;Euro Sign
0x81 0x0081
0x82 0x201a ;Single Low-9 Quotation Mark
0x83 0x0192 ;Latin Small Letter F With Hook
0x84 0x201e ;Double Low-9 Quotation Mark
0x85 0x2026 ;Horizontal Ellipsis
0x86 0x2020 ;Dagger
0x87 0x2021 ;Double Dagger
0x88 0x02c6 ;Modifier Letter Circumflex Accent
0x89 0x2030 ;Per Mille Sign
0x8a 0x0160 ;Latin Capital Letter S With Caron
0x8b 0x2039 ;Single Left-Pointing Angle Quotation Mark
0x8c 0x0152 ;Latin Capital Ligature Oe
0x8d 0x008d
0x8e 0x017d ;Latin Capital Letter Z With Caron
0x8f 0x008f
0x90 0x0090
0x91 0x2018 ;Left Single Quotation Mark
0x92 0x2019 ;Right Single Quotation Mark
0x93 0x201c ;Left Double Quotation Mark
0x94 0x201d ;Right Double Quotation Mark
0x95 0x2022 ;Bullet
0x96 0x2013 ;En Dash
0x97 0x2014 ;Em Dash
0x98 0x02dc ;Small Tilde
0x99 0x2122 ;Trade Mark Sign
0x9a 0x0161 ;Latin Small Letter S With Caron
0x9b 0x203a ;Single Right-Pointing Angle Quotation Mark
0x9c 0x0153 ;Latin Small Ligature Oe
0x9d 0x009d
0x9e 0x017e ;Latin Small Letter Z With Caron
0x9f 0x0178 ;Latin Capital Letter Y With Diaeresis

Cablechip Solutions

web development with Unix, Perl, Javascript, HTML and web services

UTF8

Character Sets and Character Encodings : UTF-8 and ISO8859-1

1. Definitions

1.1 Character Sets

1.2 Character Encoding

1.3 Character Entities in HTML and XML

2. Character Sets

2.1 ASCII

2.2 ISO 8859-1

2.3 Unicode and UCS (Universal Character Set)

2.4 Windows 1252 / ANSI Character Set

3 Character Encoding

3.1 UTF-8

4 Character Set Conversion Problems

4.1 From Windows 1252 / ANSI to the ISO character sets

4.2 From ISO 8859-1 to UTF-8/UCS/Unicode

4.5 UTF-8 to ISO 8859-1

5 Case Study

6 Appendix : Differences between Windows 1252 and the ISO Character Sets