Update Unicode docs

db4
Slava Pestov 2009-01-25 23:03:49 -06:00
parent 4d547653b5
commit d4122b5715
4 changed files with 44 additions and 21 deletions

View File

@ -0,0 +1 @@
USE: unicode

View File

@ -1,49 +1,59 @@
! Copyright (C) 2009 Your name. ! Copyright (C) 2009 Daniel Ehrenberg
! See http://factorcode.org/license.txt for BSD license. ! See http://factorcode.org/license.txt for BSD license.
USING: help.markup help.syntax kernel ; USING: help.markup help.syntax kernel ;
IN: unicode.categories IN: unicode.categories
HELP: LETTER HELP: LETTER
{ $class-description "The class of upper cased letters" } ; { $class-description "The class of upper cased letters." } ;
HELP: Letter HELP: Letter
{ $class-description "The class of letters" } ; { $class-description "The class of letters." } ;
HELP: alpha HELP: alpha
{ $class-description "The class of code points which are alphanumeric" } ; { $class-description "The class of alphanumeric characters." } ;
HELP: blank HELP: blank
{ $class-description "The class of code points which are whitespace" } ; { $class-description "The class of whitespace characters." } ;
HELP: character HELP: character
{ $class-description "The class of numbers which are pre-defined Unicode code points" } ; { $class-description "The class of pre-defined Unicode code points." } ;
HELP: control HELP: control
{ $class-description "The class of control characters" } ; { $class-description "The class of control characters." } ;
HELP: digit HELP: digit
{ $class-description "The class of code coints which are digits" } ; { $class-description "The class of digits." } ;
HELP: letter HELP: letter
{ $class-description "The class of code points which are lower-cased letters" } ; { $class-description "The class of lower-cased letters." } ;
HELP: printable HELP: printable
{ $class-description "The class of characters which are printable, as opposed to being control or formatting characters" } ; { $class-description "The class of characters which are printable, as opposed to being control or formatting characters." } ;
HELP: uncased HELP: uncased
{ $class-description "The class of letters which don't have a case" } ; { $class-description "The class of letters which don't have a case." } ;
ARTICLE: "unicode.categories" "Character classes" ARTICLE: "unicode.categories" "Character classes"
{ $vocab-link "unicode.categories" } " is a vocabulary which provides predicates for determining if a code point has a particular property, for example being a lower cased letter. These should be used in preference to the " { $vocab-link "ascii" } " equivalents in most cases. Below are links to classes of characters, but note that each of these also has a predicate defined, which is usually more useful." "The " { $vocab-link "unicode.categories" } " vocabulary implements predicates for determining if a code point has a particular property, for example being a lower cased letter. These should be used in preference to the " { $vocab-link "ascii" } " equivalents in most cases. Each character class has an associated predicate word."
{ $subsection blank } { $subsection blank }
{ $subsection blank? }
{ $subsection letter } { $subsection letter }
{ $subsection letter? }
{ $subsection LETTER } { $subsection LETTER }
{ $subsection LETTER? }
{ $subsection Letter } { $subsection Letter }
{ $subsection Letter? }
{ $subsection digit } { $subsection digit }
{ $subsection digit? }
{ $subsection printable } { $subsection printable }
{ $subsection printable? }
{ $subsection alpha } { $subsection alpha }
{ $subsection alpha? }
{ $subsection control } { $subsection control }
{ $subsection control? }
{ $subsection uncased } { $subsection uncased }
{ $subsection character } ; { $subsection uncased? }
{ $subsection character }
{ $subsection character? } ;
ABOUT: "unicode.categories" ABOUT: "unicode.categories"

View File

@ -4,7 +4,13 @@ IN: unicode.normalize
ABOUT: "unicode.normalize" ABOUT: "unicode.normalize"
ARTICLE: "unicode.normalize" "Unicode normalization" ARTICLE: "unicode.normalize" "Unicode normalization"
"The " { $vocab-link "unicode.normalize" "unicode.normalize" } " vocabulary defines words for normalizing Unicode strings. In Unicode, it is often possible to have multiple sequences of characters which really represent exactly the same thing. For example, to represent e with an acute accent above, there are two possible strings: \"e\\u000301\" (the e character, followed by the combining acute accent character) and \"\\u0000e9\" (a single character, e with an acute accent). There are four normalization forms: NFD, NFC, NFKD, and NFKC. Basically, in NFD and NFKD, everything is expanded, whereas in NFC and NFKC, everything is contracted. In NFKD and NFKC, more things are expanded and contracted. This is a process which loses some information, so it should be done only with care. Most of the world uses NFC to communicate, but for many purposes, NFD/NFKD is easier to process. For more information, see Unicode Standard Annex #15 and section 3 of the Unicode standard." "The " { $vocab-link "unicode.normalize" "unicode.normalize" } " vocabulary defines words for normalizing Unicode strings."
$nl
"In Unicode, it is often possible to have multiple sequences of characters which really represent exactly the same thing. For example, to represent e with an acute accent above, there are two possible strings: " { $snippet "\"e\\u000301\"" } " (the e character, followed by the combining acute accent character) and " { $snippet "\"\\u0000e9\"" } " (a single character, e with an acute accent)."
$nl
"There are four normalization forms: NFD, NFC, NFKD, and NFKC. Basically, in NFD and NFKD, everything is expanded, whereas in NFC and NFKC, everything is contracted. In NFKD and NFKC, more things are expanded and contracted. This is a process which loses some information, so it should be done only with care."
$nl
"Most of the world uses NFC to communicate, but for many purposes, NFD/NFKD is easier to process. For more information, see Unicode Standard Annex #15 and section 3 of the Unicode standard."
{ $subsection nfc } { $subsection nfc }
{ $subsection nfd } { $subsection nfd }
{ $subsection nfkc } { $subsection nfkc }
@ -12,16 +18,16 @@ ARTICLE: "unicode.normalize" "Unicode normalization"
HELP: nfc HELP: nfc
{ $values { "string" string } { "nfc" "a string in NFC" } } { $values { "string" string } { "nfc" "a string in NFC" } }
{ $description "Converts a string to Normalization Form C" } ; { $description "Converts a string to Normalization Form C." } ;
HELP: nfd HELP: nfd
{ $values { "string" string } { "nfd" "a string in NFD" } } { $values { "string" string } { "nfd" "a string in NFD" } }
{ $description "Converts a string to Normalization Form D" } ; { $description "Converts a string to Normalization Form D." } ;
HELP: nfkc HELP: nfkc
{ $values { "string" string } { "nfkc" "a string in NFKC" } } { $values { "string" string } { "nfkc" "a string in NFKC" } }
{ $description "Converts a string to Normalization Form KC" } ; { $description "Converts a string to Normalization Form KC." } ;
HELP: nfkd HELP: nfkd
{ $values { "string" string } { "nfkd" "a string in NFKD" } } { $values { "string" string } { "nfkd" "a string in NFKD" } }
{ $description "Converts a string to Normalization Form KD" } ; { $description "Converts a string to Normalization Form KD." } ;

View File

@ -1,8 +1,14 @@
USING: help.markup help.syntax ; USING: help.markup help.syntax strings ;
IN: unicode IN: unicode
ARTICLE: "unicode" "Unicode" ARTICLE: "unicode" "Unicode"
"Unicode is a set of characters, or " { $emphasis "code points" } " covering what's used in most world writing systems. Any Factor string can hold any of these code points transparently; a factor string is a sequence of Unicode code points. Unicode is accompanied by several standard algorithms for common operations like encoding in files, capitalizing a string, finding the boundaries between words, etc. When a programmer is faced with a string manipulation problem, where the string represents human language, a Unicode algorithm is often much better than the naive one. This is not in terms of efficiency, but rather internationalization. Even English text that remains in ASCII is better served by the Unicode collation algorithm than a naive algorithm. The Unicode algorithms implemented here are:" "The " { $vocab-link "unicode" } " vocabulary and its sub-vocabularies implement support for the Unicode 5.1 character set."
$nl
"The Unicode character set contains most of the world's writing systems. Unicode is intended as a replacement for, and is a superset of, such legacy character sets as ASCII, Latin1, MacRoman, and so on. Unicode characters are called " { $emphasis "code points" } "; Factor's " { $link "strings" } " are sequences of code points."
$nl
"The Unicode character set is accompanied by several standard algorithms for common operations like encoding text in files, capitalizing a string, finding the boundaries between words, and so on."
$nl
"The Unicode algorithms implemented by the " { $vocab-link "unicode" } " vocabulary are:"
{ $vocab-subsection "Case mapping" "unicode.case" } { $vocab-subsection "Case mapping" "unicode.case" }
{ $vocab-subsection "Collation and weak comparison" "unicode.collation" } { $vocab-subsection "Collation and weak comparison" "unicode.collation" }
{ $vocab-subsection "Character classes" "unicode.categories" } { $vocab-subsection "Character classes" "unicode.categories" }
@ -11,6 +17,6 @@ ARTICLE: "unicode" "Unicode"
"The following are mostly for internal use:" "The following are mostly for internal use:"
{ $vocab-subsection "Unicode syntax" "unicode.syntax" } { $vocab-subsection "Unicode syntax" "unicode.syntax" }
{ $vocab-subsection "Unicode data tables" "unicode.data" } { $vocab-subsection "Unicode data tables" "unicode.data" }
{ $see-also "io.encodings" } ; { $see-also "ascii" "io.encodings" } ;
ABOUT: "unicode" ABOUT: "unicode"