<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" | |
"http://www.w3.org/TR/REC-html40/loose.dtd"> | |
<html> | |
<head> | |
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> | |
<meta http-equiv="Content-Language" content="en-us"> | |
<meta name="GENERATOR" content="Microsoft FrontPage 6.0"> | |
<meta name="ProgId" content="FrontPage.Editor.Document"> | |
<meta name="keywords" content="unicode, normalization, composition, decomposition"> | |
<meta name="description" content="Specifies the Unicode Normalization Formats"> | |
<title>UCD: Unicode NamesList File Format</title> | |
<link rel="stylesheet" type="text/css" href="http://www.unicode.org/unicode/reports/reports.css"> | |
<style type="text/css"> | |
<!-- | |
.foo { } | |
--> | |
</style> | |
</head> | |
<body bgcolor="#ffffff"> | |
<table class="header"> | |
<tr> | |
<td class="icon"><a href="http://www.unicode.org"><img border="0" src="http://www.unicode.org/webscripts/logo60s2.gif" align="middle" alt="[Unicode]" width="34" height="33"></a> <a class="bar" href="http://www.unicode.org/ucd/">Unicode | |
Character Database</a></td> | |
</tr> | |
<tr> | |
<td class="gray"> </td> | |
</tr> | |
</table> | |
<div class="body"> | |
<h1>Unicode NamesList File Format</h1> | |
<table class="wide"> | |
<tbody> | |
<tr> | |
<td valign="top" width="144">Revision</td> | |
<td valign="top">4.1.0</td> | |
</tr> | |
<tr> | |
<td valign="top" width="144">Authors</td> | |
<td valign="top">Asmus Freytag</td> | |
</tr> | |
<tr> | |
<td valign="top" width="144">Date</td> | |
<td valign="top">2005-3-10</td> | |
</tr> | |
<tr> | |
<td valign="top" width="144">This Version</td> | |
<td valign="top"> | |
<a href="http://www.unicode.org/Public/4.1.0/ucd/NamesList.html">http://www.unicode.org/Public/4.1.0/ucd/NamesList.html</a></td> | |
</tr> | |
<tr> | |
<td valign="top" width="144">Previous Version</td> | |
<td valign="top"><a href="http://www.unicode.org/Public/4.0-Update/NamesList-4.0.0.html">http://www.unicode.org/Public/4.0-Update/NamesList-4.0.0.html</a></td> | |
</tr> | |
<tr> | |
<td valign="top" width="144">Latest Version</td> | |
<td valign="top"><a href="http://www.unicode.org/Public/UNIDATA/NamesList.html">http://www.unicode.org/Public/UNIDATA/NamesList.html</a></td> | |
</tr> | |
</tbody> | |
</table> | |
<h3><br> | |
<i>Summary</i></h3> | |
<blockquote> | |
<p>This file describes the format and contents of NamesList.txt</p> | |
</blockquote> | |
<h3><i>Status</i></h3> | |
<blockquote> | |
<p><i>The file and the files described herein are part of the <a href="http://www.unicode.org/ucd/">Unicode | |
Character Database</a> (UCD) and are governed by the <a href="#Terms of Use">UCD | |
Terms of Use</a> stated at the end.</i></p> | |
</blockquote> | |
<hr width="50%"> | |
<h2>1.0 Introduction</h2> | |
<p>The Unicode name list file NamesList.txt (also NamesList.lst) is a plain | |
text file used to drive the layout of the character code charts in the Unicode | |
Standard. The information in this file is a combination of several fields from | |
the UnicodeData.txt and Blocks.txt files, together with additional annotations | |
for many characters. </p> | |
<p> This document describes the syntax rules for the file | |
format, but also gives brief information on how each construct is rendered | |
when laid out for the book. Some of the syntax elements were used in | |
preparation of the drafts of the book and may not be present in the final, | |
released form of the NamesList.txt file.</p> | |
<p>The same input file can be used to do the draft preparation for ISO/IEC | |
10646 (referred below as ISO-style). This necessitates the presence of some | |
information in the name list file that is not needed (and in fact removed | |
during parsing) for the Unicode code charts.</p> | |
<p>With access to the layout program (<a href="http://www.unicode.org/unibook/">unibook.exe</a>) it is a simple matter of | |
creating name lists for the purpose of formatting working drafts containing | |
proposed characters.</p> | |
<p>The content of the NamesList.txt file is optimized for code chart creation. | |
Some information that can be inferred by the reader from context has been | |
suppressed to make the code charts more readable. </p> | |
<h3>1.1 NamesList File Overview</h3> | |
<p>The *.lst files are plain text files which in their most simple form look | |
like this</p> | |
<p>@@<tab>0020<tab>BASIC LATIN<tab>007F<br> | |
; this is a file comment (ignored)<br> | |
0020<tab>SPACE<br> | |
0021<tab>EXCLAMATION MARK<br> | |
0022<tab>QUOTATION MARK<br> | |
. . .<br> | |
007F<tab>DELETE</p> | |
<p>The semicolon (as first character), @ and <tab> characters are used | |
by the file syntax and must be provided as shown. Hexadecimal digits must be | |
in UPPER CASE. A double @@ introduces a block header, with the title, and | |
start and ending code of the block provided as shown.</p> | |
<p>For an ISO-style, minimal name list, only the NAME_LINE and BLOCKHEADER and | |
their constituent syntax elements are needed.</p> | |
<p>The full syntax with all the options is provided in the following sections.</p> | |
<h3>1.2 NamesList File Structure</h3> | |
<p>This section gives defines the overall file structure</p> | |
<pre><strong>NAMELIST: TITLE_PAGE* BLOCK* | |
</strong> | |
<strong>TITLE_PAGE: TITLE | |
| TITLE_PAGE SUBTITLE | |
| TITLE_PAGE SUBHEADER | |
| TITLE_PAGE IGNORED_LINE | |
| TITLE_PAGE EMPTY_LINE | |
| TITLE_PAGE COMMENT_LINE | |
| TITLE_PAGE NOTICE_LINE | |
| TITLE_PAGE PAGEBREAK | |
</strong> | |
<strong>BLOCK: BLOCKHEADER | |
| BLOCK CHAR_ENTRY | |
| BLOCK SUBHEADER | |
| BLOCK NOTICE_LINE | |
| BLOCK EMPTY_LINE | |
| BLOCK IGNORED_LINE | |
| BLOCK PAGEBREAK | |
CHAR_ENTRY: NAME_LINE | RESERVED_LINE | |
| CHAR_ENTRY ALIAS_LINE | |
| CHAR_ENTRY COMMENT_LINE | |
| CHAR_ENTRY CROSS_REF | |
| CHAR_ENTRY DECOMPOSITION | |
| CHAR_ENTRY COMPAT_MAPPING | |
| CHAR_ENTRY IGNORED_LINE | |
| CHAR_ENTRY EMPTY_LINE | |
| CHAR_ENTRY NOTICE | |
</strong></pre> | |
<p>In other words:<br> | |
<br> | |
Neither TITLE nor SUBTITLE may occur after the first BLOCKHEADER.</p> | |
<p>Only TITLE, SUBTITLE, SUBHEADER, PAGEBREAK, COMMENT_LINE, and | |
IGNORED_LINE may occur before the first BLOCKHEADER.</p> | |
<p>Directly following either a NAME_LINE or a RESERVED_LINE an uninterrupted | |
sequence of the following lines may occur (in any order and repeated as often | |
as needed): ALIAS_LINE, CROSS_REF, DECOMPOSITION, COMPAT_MAPPING, NOTICE_LINE, | |
EMPTY_LINE and IGNORED_LINE.</p> | |
<p>Except for EMPTY_LINE, NOTICE_LINE and IGNORED_LINE, none of these lines may | |
occur in any other place.</p> | |
<p><b>Note</b>: A NOTICE_LINE displays differently depending on whether it follows a | |
header or title or is part of a CHAR_ENTRY.</p> | |
<h3>1.3 NamesList File Elements</h3> | |
<p>This section provides the details of the syntax for the individual | |
elements.</p> | |
<pre style="font-size:8pt"><strong>ELEMENT SYNTAX</strong> // How rendered | |
<strong>NAME_LINE: CHAR <tab> NAME LF | |
</strong> // the CHAR and the corresponding image are echoed, | |
// followed by the name as given in LINE | |
<strong> CHAR TAB "<" LCNAME ">" LF | |
</strong> // control and non-characters use this form of | |
// lower case, bracketed pseudo character name</pre> | |
<pre style="font-size:8pt"><strong> CHAR TAB NAME COMMENT LF | |
</strong> // Names may have a comment, which is stripped off | |
// unless the file is parsed for an ISO style list | |
<strong>RESERVED_LINE: CHAR TAB <reserved> | |
</strong> // the CHAR is echoed followed by an icon for the | |
// reserved character and a fixed string e.g. <reserved> | |
<strong>COMMMENT_LINE: <tab> "*" SP EXPAND_LINE | |
</strong> // * is replaced by BULLET, output line as comment | |
<strong><tab> EXPAND_LINE</strong> | |
// output line as comment | |
<strong>ALIAS_LINE: <tab> "=" SP LINE | |
</strong> // replace = by itself, output line as alias | |
<strong>CROSS_REF: <tab> "X" SP EXPAND_LINE | |
</strong> // X is replaced by a right arrow | |
<strong><tab> "X" SP "(" LCNAME SP "-" SP CHAR ")" | |
</strong> // X is replaced by a right arrow | |
// the "(", "-", ")" are removed, the | |
// order of CHAR and STRING is reversed | |
// i.e. both inputs result in the same output | |
<b>FILE_COMMENT:</b> <b>";"</b> <b>LINE</b><strong> | |
IGNORED_LINE: <tab> ";" EXPAND_LINE | |
EMPTY_LINE: LF | |
</strong> // empty and ignored lines as well as | |
// file comments are ignored | |
<strong>DECOMPOSITION: <tab> ":" EXPAND_LINE | |
</strong> // replace ':' by EQUIV, expand line into | |
// decomposition | |
<strong>COMPAT_MAPPING: <tab> "#" SP EXPAND_LINE | |
</strong> // replace '#' by APPROX, output line as mapping | |
<strong>NOTICE_LINE: "@+" <tab> LINE | |
</strong> // skip '@+', output text as notice | |
<strong> "@+" <tab> * SP LINE | |
</strong> // skip '@+', output text as notice | |
// "*" expands to a bullet character | |
// Notices following a character code apply to the | |
// character and are indented. Notices not following | |
// a character code apply to the page/block/column | |
// and are italicized, but not indented | |
<strong>SUBTITLE: "@@@+" <tab> LINE | |
</strong> // skip "@@@+", output text as subtitle | |
<strong>SUBHEADER: "@" <tab> LINE | |
</strong> // skip '@', output line as text as column header | |
<strong>BLOCKHEADER: "@@" <tab> BLOCKSTART <tab> BLOCKNAME <tab> BLOCKEND | |
</strong> // skip "@@", cause a page break and optional | |
// blank page, then output one or more charts | |
// followed by the list of character names. | |
// use BLOCKSTART and BLOCKEND to define the | |
// characters belonging to a block | |
// use blockname in page and table headers</pre> | |
<pre style="font-size:8pt"><b>BLOCKNAME: LABEL | |
LABEL SP "(" LABEL ")"</b> | |
// if an alternate label is present it replaces | |
// the blockname when an ISO-style namelist is | |
// laid out; it is ignored in the Unicode charts | |
<strong>BLOCKSTART: CHAR</strong> // first character position in block | |
<strong>BLOCKEND: CHAR</strong> // last character position in block | |
<strong>PAGE_BREAK: "@@"</strong> // insert a column break | |
<strong>TITLE: "@@@" <tab> LINE</strong> | |
// skip "@@@", output line as text | |
// Title is used in page headers | |
<strong>EXPAND_LINE: {CHAR | STRING}+ LF </strong> | |
// all instances of CHAR *) are replaced by | |
// CHAR NBSP x NBSP where x is the single Unicode | |
// character corresponding to char | |
// If character is combining, it is replaced with | |
// CHAR NBSP <circ> x NBSP where <circ> is the | |
// dotted circle</pre> | |
<p><strong>Notes:</strong></p> | |
<ul> | |
<li>Blocks must be aligned on 16-code point boundary and contain an integer | |
multiple of 16 code point columns. The exception to that rule is for blocks of | |
ideographs etc. for which no names are listed in the file. Such blocks | |
must end on the actual last character.</li> | |
<li>Blocks must be non-overlapping and in ascending order. Namelines | |
must be in ascending order and follow the block header for the block to | |
which they belong.</li> | |
<li>Reserved entries are optional, and will normally be supplied automatically. They | |
are required whenever followed by ALIAS_LINE, COMMENT_LINE, NOTICE_LINE or CROSS_REF</li> | |
<li>The French version of the nameslist uses French rules, which allow | |
apostrophe and accented letters in character names.</li> | |
</ul> | |
<h3><strong>1.4 NamesList File Primitives</strong></h3> | |
<p>The following are the primitives and terminals for the NamesList syntax.</p> | |
<pre style="font-size:8pt"><strong>LINE: STRING LF | |
COMMENT: "(" LABEL ")" | |
"(" LABEl ")" SP "*" | |
</strong> "*"<strong> </strong> | |
<strong>BLOCKNAME:</strong> <sequence of Latin-1 characters, except "(" and ")"> | |
<strong>NAME</strong>: <sequence of uppercase ASCII letters, digits and hyphen> | |
<b>LCNAME: </b><sequence of lowercase ASCII letters, digits and hyphen> | |
<strong>STRING</strong>: <sequence of Latin-1 characters, except space and controls> | |
<strong>LABEL</strong>: <sequence of Latin-1 characters, except space, controls, "(" or ")"> | |
<strong>CHAR</strong>: <strong>X X X X</strong> | |
<strong>| X X X X X</strong> | |
<strong>| X X X X X X</strong> | |
<strong>X: "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"|"A"|"B"|"C"|"D"|"E"|"F" | |
<tab>:</strong> <sequence of one or more ASCII tab characters 0x09> | |
<strong>SP</strong>: <ASCII 0x20> | |
<strong>LF</strong>: <any sequence of ASCII 0x0A and 0x0D> | |
</pre> | |
<p><strong>Notes:</strong> | |
<ul> | |
<li>Special lookahead logic prevents a mention of a 4 digit number for a standard, such | |
as ISO 9999 from being misinterpreted as ISO CHAR. The hyphen in a character | |
range CHAR-CHAR is replaced by an EN DASH on output.</li> | |
<li>The final LF in the file must be present</li> | |
<li>A CHAR inside ' or " is expanded, but only its glyph image is | |
printed, the code value is not echoed.</li> | |
<li>Single and double straight quotes in an EXPAND_LINE are replaced by curly quotes using | |
English rules. Smart apostrophes are supported, but nested quotes are not.</li> | |
<li>The NamesList.txt file is encoded in Latin-1. While the code chart | |
formatter can accept files in either Latin-1 and little-endian UTF-16, | |
prefixed with a BOM, the character repertoire for running text (anything | |
other than CHAR) is effectively restricted to Latin-1 characters.</li> | |
</ul> | |
<h2>Modifications</h2> | |
<p><b>Version 4.0.0<br> | |
</b> Fix syntax to better reflect restrictions on characters | |
in character and block names.<br> | |
Better document treatment of comments in block names, plus | |
French name rules.</p> | |
<p><b>Version 3.2.0<br> | |
</b> Fixed several broken links, added a left margin, | |
changed version numbering.</p> | |
<p><b>Version 3.1.0 (2)<br> | |
</b> Use of 4-6 digit hex notation is now supported.</p> | |
<hr width="50%"> | |
<h2>UCD <a name="Terms of Use">Terms of Use</a></h2> | |
<h3><i>Disclaimer</i></h3> | |
<blockquote> | |
<p><i>The Unicode Character Database is provided as is by Unicode, Inc. No | |
claims are made as to fitness for any particular purpose. No warranties of | |
any kind are expressed or implied. The recipient agrees to determine | |
applicability of information provided. If this file has been purchased on | |
magnetic or optical media from Unicode, Inc., the sole remedy for any claim | |
will be exchange of defective media within 90 days of receipt.</i></p> | |
<p><i>This disclaimer is applicable for all other data files accompanying | |
the Unicode Character Database, some of which have been compiled by the | |
Unicode Consortium, and some of which have been supplied by other sources.</i></p> | |
</blockquote> | |
<h3><i>Limitations on Rights to Redistribute This Data</i></h3> | |
<blockquote> | |
<p><i>Recipient is granted the right to make copies in any form for internal | |
distribution and to freely use the information supplied in the creation of | |
products supporting the Unicode<sup>TM</sup> Standard. The files in the | |
Unicode Character Database can be redistributed to third parties or other | |
organizations (whether for profit or not) as long as this notice and the | |
disclaimer notice are retained. Information can be extracted from these | |
files and used in documentation or programs, as long as there is an | |
accompanying notice indicating the source.</i></p> | |
</blockquote> | |
<hr width="50%"> | |
<div align="center"> | |
<center> | |
<table cellspacing="0" cellpadding="0" border="0"> | |
<tr> | |
<td><a href="http://www.unicode.org/unicode/copyright.html"><img src="http://www.unicode.org/img/hb_notice.gif" border="0" alt="Access to Copyright and terms of use" width="216" height="50"></a></td> | |
</tr> | |
</table> | |
<script language="Javascript" type="text/javascript" | |
src="http://www.unicode.org/webscripts/lastModified.js"></script> | |
</center> | |
</div> | |
</div> | |
</body> | |
</html> |