Unicode

Introduction
ASCII
Back in the old days of punched cards a 7-bit character coding system was used. 7 bits gave 27; = 128 possible combinations, enough for 26×2 letters, 10 numbers, about 15 punctuation characters, and 20 or so symbols. Finally, 33 of the codes were used as control characters, e.g. line feed, tab, bell etc. The del code was taken as the last character, number 127, that is 111111 in binary. This meant all 7 spaces representing the character on the card were punched out, thus allowing any mistakes to be deleted.

Extended ASCII
European languages require letters with diacritical marks (accents), for example: in France à, á, â, ç (grave, acute, circumflex, cedilla); in Spain ñ (tilde); in Germany ä, å (diaeresis, ring). Other characters are required, such as æ, ¥, ß, £, ©, etc., so the 7-bit system was extended to 8 bits, with these new letters taking values from 128 to 255. This is a clean result, as each character is now represented by one byte. This is the system that is in general use in the western world today.

However, the 256 characters were really not quite enough. So the first 128 letters are usually the same, but the last 128 depend on what language you are using. So the Latin-1 set is for West Europe, Latin-2 for Central and East Europe, Latin-3 is additional (e.g. Catalan, Turkish) and Latin-4 for other additional (e.g. Estonian, Lappish). Other systems for Russia etc. exist. An altogether different set also in common use is the symbol set, basically for use in mathematics, containing Greek letters and mathematical operators.

Unicode Standard
The above system becomes a problem if you wish to exchange documents with people who use different character sets. For example, if you are using a Latin-1 font and your friend has used a Latin-2 font, then a () in your friends document will appear as a () for you. A second problem are the thousands of characters from China, Japan and Korea (CJK), for which other systems exist.

Unicode provides a consistent way of encoding multilingual plain text and brings order to the chaotic state of affairs outlined above. The Unicode Standard provides the capacity to uniquely encode all of the characters used for the written languages of the world. It uses a 16 bit (2 byte) encoding allowing for over 65,000 characters.

Each character is assigned a unique name that specifies it and no other. For example, U+0041 is assigned the character name "LATIN CAPITAL LETTER A". The standard defines rules for the working of composite characters (characters generated by combining others, e.g. à). Many such characters exit in their own right (as for à).

UCS - Universal (Multiple-Octet Coded) Character Set
More accurately called UCS-4, this is a massive character set taking 31 bits to specify a character. Note that Octet is just another name for a byte. UCS-4 allows for 2^31 = 2,147,483,648 encoding points. The Unicode system, which can be reffered to as UCS-2, corresponds exactly to the first 65,536 entries of UCS-4.

Usefull Links
Unicode

Unicode Map
00102030405060708090A0B0C0D0E0F0
01112131415161718191A1B1C1D1E1F1
02122232425262728292A2B2C2D2E2F2
03132333435363738393A3B3C3D3E3F3
04142434445464748494A4B4C4D4E4F4
05152535455565758595A5B5C5D5E5F5
06162636465666768696A6B6C6D6E6F6
07172737475767778797A7B7C7D7E7F7
08182838485868788898A8B8C8D8E8F8
09192939495969798999A9B9C9D9E9F9
0A1A2A3A4A5A6A7A8A9AAABACADAEAFA
0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFB
0C1C2C3C4C5C6C7C8C9CACBCCCDCECFC
0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFD
0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFE
0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFFF

E&OE!