| .\" Copyright (C) Markus Kuhn, 1995, 2001 |
| .\" |
| .\" %%%LICENSE_START(GPLv2+_DOC_FULL) |
| .\" This is free documentation; you can redistribute it and/or |
| .\" modify it under the terms of the GNU General Public License as |
| .\" published by the Free Software Foundation; either version 2 of |
| .\" the License, or (at your option) any later version. |
| .\" |
| .\" The GNU General Public License's references to "object code" |
| .\" and "executables" are to be interpreted as the output of any |
| .\" document formatting or typesetting system, including |
| .\" intermediate and printed output. |
| .\" |
| .\" This manual is distributed in the hope that it will be useful, |
| .\" but WITHOUT ANY WARRANTY; without even the implied warranty of |
| .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
| .\" GNU General Public License for more details. |
| .\" |
| .\" You should have received a copy of the GNU General Public |
| .\" License along with this manual; if not, see |
| .\" <http://www.gnu.org/licenses/>. |
| .\" %%%LICENSE_END |
| .\" |
| .\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de> |
| .\" First version written |
| .\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk> |
| .\" Update |
| .\" |
| .TH UNICODE 7 2019-03-06 "GNU" "Linux Programmer's Manual" |
| .SH NAME |
| unicode \- universal character set |
| .SH DESCRIPTION |
| The international standard ISO 10646 defines the |
| Universal Character Set (UCS). |
| UCS contains all characters of all other character set standards. |
| It also guarantees "round-trip compatibility"; |
| in other words, |
| conversion tables can be built such that no information is lost |
| when a string is converted from any other encoding to UCS and back. |
| .PP |
| UCS contains the characters required to represent practically all |
| known languages. |
| This includes not only the Latin, Greek, Cyrillic, |
| Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese, |
| Japanese and Korean Han ideographs as well as scripts such as |
| Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati, |
| Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo, |
| Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian, |
| Ogham, Myanmar, Sinhala, Thaana, Yi, and others. |
| For scripts not yet |
| covered, research on how to best encode them for computer usage is |
| still going on and they will be added eventually. |
| This might |
| eventually include not only Hieroglyphs and various historic |
| Indo-European languages, but even some selected artistic scripts such |
| as Tengwar, Cirth, and Klingon. |
| UCS also covers a large number of |
| graphical, typographical, mathematical, and scientific symbols, |
| including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows, |
| Macintosh, OCR fonts, as well as many word processing and publishing |
| systems, and more are being added. |
| .PP |
| The UCS standard (ISO 10646) describes a |
| 31-bit character set architecture |
| consisting of 128 24-bit |
| .IR groups , |
| each divided into 256 16-bit |
| .I planes |
| made up of 256 8-bit |
| .I rows |
| with 256 |
| .I column |
| positions, one for each character. |
| Part 1 of the standard (ISO 10646-1) |
| defines the first 65534 code positions (0x0000 to 0xfffd), which form |
| the |
| .IR "Basic Multilingual Plane" |
| (BMP), that is plane 0 in group 0. |
| Part 2 of the standard (ISO 10646-2) |
| adds characters to group 0 outside the BMP in several |
| .I "supplementary planes" |
| in the range 0x10000 to 0x10ffff. |
| There are no plans to add characters |
| beyond 0x10ffff to the standard, therefore of the entire code space, |
| only a small fraction of group 0 will ever be actually used in the |
| foreseeable future. |
| The BMP contains all characters found in the |
| commonly used other character sets. |
| The supplemental planes added by |
| ISO 10646-2 cover only more exotic characters for special scientific, |
| dictionary printing, publishing industry, higher-level protocol and |
| enthusiast needs. |
| .PP |
| The representation of each UCS character as a 2-byte word is referred |
| to as the UCS-2 form (only for BMP characters), |
| whereas UCS-4 is the representation of each character by a 4-byte word. |
| In addition, there exist two encoding forms UTF-8 |
| for backward compatibility with ASCII processing software and UTF-16 |
| for the backward-compatible handling of non-BMP characters up to |
| 0x10ffff by UCS-2 software. |
| .PP |
| The UCS characters 0x0000 to 0x007f are identical to those of the |
| classic US-ASCII |
| character set and the characters in the range 0x0000 to 0x00ff |
| are identical to those in |
| ISO 8859-1 (Latin-1). |
| .SS Combining characters |
| Some code points in UCS |
| have been assigned to |
| .IR "combining characters" . |
| These are similar to the nonspacing accent keys on a typewriter. |
| A combining character just adds an accent to the previous character. |
| The most important accented characters have codes of their own in UCS, |
| however, the combining character mechanism allows us to add accents |
| and other diacritical marks to any character. |
| The combining characters |
| always follow the character which they modify. |
| For example, the German |
| character Umlaut-A ("Latin capital letter A with diaeresis") can |
| either be represented by the precomposed UCS code 0x00c4, or |
| alternatively as the combination of a normal "Latin capital letter A" |
| followed by a "combining diaeresis": 0x0041 0x0308. |
| .PP |
| Combining characters are essential for instance for encoding the Thai |
| script or for mathematical typesetting and users of the International |
| Phonetic Alphabet. |
| .SS Implementation levels |
| As not all systems are expected to support advanced mechanisms like |
| combining characters, ISO 10646-1 specifies the following three |
| .I implementation levels |
| of UCS: |
| .TP 0.9i |
| Level 1 |
| Combining characters and Hangul Jamo |
| (a variant encoding of the Korean script, where a Hangul syllable |
| glyph is coded as a triplet or pair of vowel/consonant codes) are not |
| supported. |
| .TP |
| Level 2 |
| In addition to level 1, combining characters are now allowed for some |
| languages where they are essential (e.g., Thai, Lao, Hebrew, |
| Arabic, Devanagari, Malayalam). |
| .TP |
| Level 3 |
| All UCS characters are supported. |
| .PP |
| The Unicode 3.0 Standard |
| published by the Unicode Consortium |
| contains exactly the UCS Basic Multilingual Plane |
| at implementation level 3, as described in ISO 10646-1:2000. |
| Unicode 3.1 added the supplemental planes of ISO 10646-2. |
| The Unicode standard and |
| technical reports published by the Unicode Consortium provide much |
| additional information on the semantics and recommended usages of |
| various characters. |
| They provide guidelines and algorithms for |
| editing, sorting, comparing, normalizing, converting, and displaying |
| Unicode strings. |
| .SS Unicode under Linux |
| Under GNU/Linux, the C type |
| .I wchar_t |
| is a signed 32-bit integer type. |
| Its values are always interpreted |
| by the C library as UCS |
| code values (in all locales), a convention that is signaled by the GNU |
| C library to applications by defining the constant |
| .B __STDC_ISO_10646__ |
| as specified in the ISO C99 standard. |
| .PP |
| UCS/Unicode can be used just like ASCII in input/output streams, |
| terminal communication, plaintext files, filenames, and environment |
| variables in the ASCII compatible UTF-8 multibyte encoding. |
| To signal the use of UTF-8 as the character |
| encoding to all applications, a suitable |
| .I locale |
| has to be selected via environment variables (e.g., |
| "LANG=en_GB.UTF-8"). |
| .PP |
| The |
| .B nl_langinfo(CODESET) |
| function returns the name of the selected encoding. |
| Library functions such as |
| .BR wctomb (3) |
| and |
| .BR mbsrtowcs (3) |
| can be used to transform the internal |
| .I wchar_t |
| characters and strings into the system character encoding and back |
| and |
| .BR wcwidth (3) |
| tells, how many positions (0\(en2) the cursor is advanced by the |
| output of a character. |
| .PP |
| .SS Private Use Areas (PUA) |
| In the Basic Multilingual Plane, |
| the range 0xe000 to 0xf8ff will never be assigned to any characters by |
| the standard and is reserved for private usage. |
| For the Linux |
| community, this private area has been subdivided further into the |
| range 0xe000 to 0xefff which can be used individually by any end-user |
| and the Linux zone in the range 0xf000 to 0xf8ff where extensions are |
| coordinated among all Linux users. |
| The registry of the characters |
| assigned to the Linux zone is maintained by LANANA and the registry |
| itself is |
| .I Documentation/admin\-guide/unicode.rst |
| in the Linux kernel sources |
| .\" commit 9d85025b0418163fae079c9ba8f8445212de8568 |
| (or |
| .I Documentation/unicode.txt |
| before Linux 4.10). |
| .PP |
| Two other planes are reserved for private usage, plane 15 |
| (Supplementary Private Use Area-A, range 0xf0000 to 0xffffd) |
| and plane 16 (Supplementary Private Use Area-B, range |
| 0x100000 to 0x10fffd). |
| .SS Literature |
| .IP * 3 |
| Information technology \(em Universal Multiple-Octet Coded Character |
| Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane. |
| International Standard ISO/IEC 10646-1, International Organization |
| for Standardization, Geneva, 2000. |
| .IP |
| This is the official specification of UCS . |
| Available from |
| .UR http://www.iso.ch/ |
| .UE . |
| .IP * |
| The Unicode Standard, Version 3.0. |
| The Unicode Consortium, Addison-Wesley, |
| Reading, MA, 2000, ISBN 0-201-61633-5. |
| .IP * |
| S.\& Harbison, G.\& Steele. C: A Reference Manual. Fourth edition, |
| Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3. |
| .IP |
| A good reference book about the C programming language. |
| The fourth |
| edition covers the 1994 Amendment 1 to the ISO C90 standard, which |
| adds a large number of new C library functions for handling wide and |
| multibyte character encodings, but it does not yet cover ISO C99, |
| which improved wide and multibyte character support even further. |
| .IP * |
| Unicode Technical Reports. |
| .RS |
| .UR http://www.unicode.org\:/reports/ |
| .UE |
| .RE |
| .IP * |
| Markus Kuhn: UTF-8 and Unicode FAQ for UNIX/Linux. |
| .RS |
| .UR http://www.cl.cam.ac.uk\:/~mgk25\:/unicode.html |
| .UE |
| .RE |
| .IP * |
| Bruno Haible: Unicode HOWTO. |
| .RS |
| .UR http://www.tldp.org\:/HOWTO\:/Unicode\-HOWTO.html |
| .UE |
| .RE |
| .\" .SH AUTHOR |
| .\" Markus Kuhn <mgk25@cl.cam.ac.uk> |
| .SH SEE ALSO |
| .BR locale (1), |
| .BR setlocale (3), |
| .BR charsets (7), |
| .BR utf-8 (7) |