Doc updates, minor cleanups, Porter stemmer algorithm ported from CL for upcoming full text search engine

darcs
slava 2006-06-11 20:16:45 +00:00
parent bf54019fee
commit e034305fce
19 changed files with 47418 additions and 40 deletions

View File

@ -5,7 +5,6 @@
+ httpd:
- outliners don't work
- browser responder doesn't work
- tests in a loop runs out of memory eventually
- fix this:
@ -48,7 +47,6 @@
+ help:
- document that can <void*> only be called with an alien
- help search
- automatically update help graph when adding/removing articles/words

View File

@ -16,14 +16,7 @@ $terpri
{ $subsection "dll-internals" } ;
ARTICLE: "loading-libs" "Loading native libraries"
"Factor must be aware of what native libraries are in use. This is done by associating a logical library name with an operating system path name, and then referring to the library by its logical name. There are two ways to define libraries in this manner; you can either use command line parameters or the " { $link add-library } " word."
$terpri
"The following two command line parameters can be specified for each library to load; the second parameter is optional:"
{ $list
{ { $snippet "-libraries:" { $emphasis "logical" } ":name=" { $emphasis "name" } } " associates a logical name with a system-specific native library name," }
{ { $snippet "-libraries:" { $emphasis "logical" } ":abi=" { $emphasis "type" } } " specifies the application binary interface (ABI) used by the library. On nearly all platforms, the default value of " { $snippet "cdecl" } " is correct. On Windows/x86, system DLLs use the " { $snippet "stdcall" } " ABI." }
}
"You can also define a logical library interactively:"
"Before calling a C library, you must associate its path name on disk with a logical name which Factor uses to identify the library:"
{ $subsection add-library }
"Once a library has been defined, you can try loading it to see if the path name is correct:"
{ $subsection load-library } ;
@ -123,9 +116,11 @@ ARTICLE: "c-types-strings" "C string types"
$terpri
"Passing a Factor string to a C function expecting a C string allocates a byte array in the Factor heap; the string is then converted to the requested format and a raw pointer is passed to the function. The function must not retain such pointers after it returns, since byte arrays in the Factor heap can be moved by the garbage collector. To allocate a string which will not move, use " { $link <malloc-string> } " and then " { $link free } "."
$terpri
"A couple of words can be used to read and write " { $snippet "char*" } " strings from arbitrary addresses:"
"A couple of words can be used to read and write " { $snippet "char*" } " and " { $snippet "ushort*" } " strings from arbitrary addresses:"
{ $subsection alien>char-string }
{ $subsection string>char-alien } ;
{ $subsection alien>u16-string }
{ $subsection string>char-alien }
{ $subsection string>u16-alien } ;
ARTICLE: "c-structs" "C structure types"
"A " { $snippet "struct" } " in C is essentially a block of memory with the value of each structure field stored at a fixed offset. The C library interface provides some utilities to define words which read and write structure fields given a base address."

View File

@ -13,7 +13,7 @@ $terpri
"The string representation of an object can be requested:"
{ $subsection unparse }
{ $subsection unparse-short }
"The prettyprinter is flexible and extensible."
"Advanced topics:"
{ $subsection "prettyprint-limitations" }
{ $subsection "prettyprint-variables" }
{ $subsection "prettyprint-extension" }

View File

@ -102,6 +102,7 @@ vectors words ;
"/library/help/markup.factor"
"/library/help/word-help.factor"
"/library/help/crossref.factor"
"/library/help/porter-stemmer.factor"
"/library/help/syntax.factor"
"/library/syntax/parse-stream.factor"

View File

@ -5,7 +5,7 @@ HELP: set-fill "( n seq -- )"
{ $values { "n" "a new fill pointer" } { "seq" "a growable sequence" } }
{ $contract "Sets the fill pointer (number of occupied elements in the underlying storage) of a growable sequence." }
{ $side-effects "Modifies " { $snippet "seq" } "." }
{ $warning "This word is in the " { $snippet "sequences-internals" } " vocabulary because it is not safe. Changing the fill pointer to a negative value, or a value higher than the underlying sequence length can lead to memory corruption. Client code should use " { $link set-length } " instead." } ;
{ $warning "This word is in the " { $snippet "sequences-internals" } " vocabulary because it is not safe. Changing the fill pointer to a negative value, or a value higher than the underlying sequence length can lead to memory corruption. User code should use " { $link set-length } " instead." } ;
HELP: underlying "( seq -- underlying )"
{ $values { "seq" "a growable sequence" } { "underlying" "the underlying sequence" } }

View File

@ -159,6 +159,9 @@ IN: hashtables
: set-hash ( value key hash -- )
[ (set-hash) ] keep ?grow-hash ;
: hash+ ( n key hash -- )
[ hash [ 0 ] unless* + ] 2keep set-hash ;
: associate ( value key -- hashtable )
2 <hashtable> [ set-hash ] keep ;

View File

@ -76,24 +76,24 @@ HELP: change-size "( hash old -- )"
HELP: (set-hash) "( value key hash -- )"
{ $values { "value" "a value" } { "key" "a key to add" } { "hash" "a hashtable" } }
{ $description "Stores the key/value pair into the hashtable. This word does not grow the hashtable if it exceeds capacity, therefore a hang can result. Client code should use " { $link set-hash } " instead, which grows the hashtable if necessary." }
{ $description "Stores the key/value pair into the hashtable. This word does not grow the hashtable if it exceeds capacity, therefore a hang can result. User code should use " { $link set-hash } " instead, which grows the hashtable if necessary." }
{ $side-effects "hash" } ;
HELP: grow-hash "( hash -- )"
{ $values { "hash" "a hashtable" } }
{ $description "Enlarges the capacity of a hashtable. Client code does not need to call this word directly." }
{ $description "Enlarges the capacity of a hashtable. User code does not need to call this word directly." }
{ $see-also (set-hash) ?grow-hash set-hash }
{ $side-effects "hash" } ;
HELP: ?grow-hash "( hash -- )"
{ $values { "hash" "a hashtable" } }
{ $description "Enlarges the capacity of a hashtable if it is almost full. Client code does not need to call this word directly." }
{ $description "Enlarges the capacity of a hashtable if it is almost full. User code does not need to call this word directly." }
{ $see-also (set-hash) grow-hash set-hash }
{ $side-effects "hash" } ;
HELP: hash>seq "( i hash -- seq )"
{ $values { "i" "0 or 1" } { "hash" "a hashtable" } { "seq" "a sequence of keys or values" } }
{ $description "Client code should not call this word. It is unsafe and only used in the implementation of " { $link hash-keys } " and " { $link hash-values } ", both of which are safe." }
{ $description "User code should not call this word. It is unsafe and only used in the implementation of " { $link hash-keys } " and " { $link hash-values } ", both of which are safe." }
{ $warning "This word is in the " { $snippet "hashtables-internals" } " vocabulary because passing an invalid value for " { $snippet "i" } " can lead to memory corruption." } ;
HELP: <hashtable> "( n -- hash )"
@ -160,6 +160,11 @@ HELP: remove-hash* "( key hash -- old )"
{ $side-effects "hash" }
{ $see-also hash remove-hash } ;
HELP: set-hash "( value key hash -- )"
{ $values { "value" "a value" } { "key" "a key to add" } { "hash" "a hashtable" } }
{ $description "Stores the key/value pair into the hashtable." }
{ $side-effects "hash" } ;
HELP: hash-keys "( hash -- keys )"
{ $values { "hash" "a hashtable" } { "keys" "an array of keys" } }
{ $description "Outputs an array of all keys in the hashtable." }

View File

@ -3,7 +3,7 @@ USING: arrays help math sequences-internals ;
HELP: collect "( n quot -- array )"
{ $values { "n" "a non-negative integer" } { "quot" "a quotation with stack effect " { $snippet "( n -- value )" } } { "array" "an array with " { $snippet "n" } " elements" } }
{ $description "A primitive mapping operation that applies a quotation to all integers from 0 up to but not including " { $snippet "n" } ", and collects the results in a new array. Client code should use " { $link map } " instead." } ;
{ $description "A primitive mapping operation that applies a quotation to all integers from 0 up to but not including " { $snippet "n" } ", and collects the results in a new array. User code should use " { $link map } " instead." } ;
HELP: each "( seq quot -- )"
{ $values { "seq" "a sequence" } { "quot" "a quotation with stack effect " { $snippet "( elt -- )" } } }

View File

@ -2,7 +2,7 @@ USING: generic help kernel kernel-internals ;
HELP: tuple= "( tuple1 tuple2 -- ? )"
{ $values { "tuple1" "a tuple" } { "tuple2" "a tuple" } }
{ $description "Low-level tuple equality test. Client code should use " { $link = } " instead." }
{ $description "Low-level tuple equality test. User code should use " { $link = } " instead." }
{ $warning "This word is in the " { $snippet "kernel-internals" } " vocabulary because it does not do any type checking. Passing values which are not tuples can result in memory corruption." } ;
HELP: tuple f
@ -12,13 +12,6 @@ $terpri
{ $list
{ { $snippet "\"slot-names\"" } " - a sequence of strings naming the tuple's slots" }
{ { $snippet "\"tuple-size\"" } " - the number of slots" }
} }
{ $notes "Low-level facilities need to be aware of tuple object layout. It is of no concern to client code. The layout of a tuple in memory is straightforward:"
{ $list
"slot 0 - object header with type number (as usual)"
"slot 1 - number of slots, include class and delegate"
"slot 2 - the tuple's class word"
{ "slot 3 - a delegate or " { $link f } }
} } ;
HELP: class "( object -- class )"

View File

@ -0,0 +1,262 @@
! The Porter Stemming Algorithm, hand translated to Factor from
! Common Lisp by Slava Pestov.
! The Common Lisp version was hand translated from ANSI C by
! Steven M. Haflich smh@franz.com.
! The original ANSI C was written by Martin Porter.
! References:
! http://www.tartarus.org/~martin/PorterStemmer
! Porter, 1980, An algorithm for suffix stripping, Program,
! Vol. 14, no. 3, pp 130-137.
IN: porter-stemmer
USING: kernel math parser sequences test ;
: consonant? ( i str -- ? )
2dup nth dup "aeiou" member? [
3drop f
] [
CHAR: y = [
over zero?
[ 2drop t ] [ >r 1- r> consonant? not ] if
] [
2drop t
] if
] if ;
: skip-vowels ( i str -- i str )
2dup bounds-check? [
2dup consonant? [ >r 1+ r> skip-vowels ] unless
] when ;
: skip-consonants ( i str -- i str )
2dup bounds-check? [
2dup consonant? [ >r 1+ r> skip-consonants ] when
] when ;
: (consonant-seq) ( n i str -- n )
skip-vowels
2dup bounds-check? [
>r 1+ >r 1+ r> r> skip-consonants >r 1+ r>
(consonant-seq)
] [
2drop
] if ;
: consonant-seq ( str -- n )
0 0 rot skip-consonants (consonant-seq) ;
: stem-vowel? ( str -- ? )
dup length [ swap consonant? ] all-with? not ;
: double-consonant? ( i str -- ? )
over 1 < [
2drop f
] [
2dup nth >r over 1- over nth r> = [
consonant?
] [
2drop f
] if
] if ;
: consonant-end? ( n seq -- ? )
[ length swap - ] keep consonant? ;
: last-is? ( str possibilities -- ? ) >r peek r> member? ;
: cvc? ( str -- ? )
{
{ [ dup length 3 < ] [ drop f ] }
{ [ 1 over consonant-end? not ] [ drop f ] }
{ [ 2 over consonant-end? ] [ drop f ] }
{ [ 3 over consonant-end? not ] [ drop f ] }
{ [ t ] [ "wxy" last-is? not ] }
} cond ;
: r ( str oldsuffix newsuffix -- str )
pick consonant-seq 0 > [ nip ] [ drop ] if append ;
: butlast ( seq -- seq ) 1 swap head-slice* ;
! step1a and step1b get rid of plurals and -ed or -ing. e.g.
!
! caresses -> caress
! ponies -> poni
! ties -> ti
! caress -> caress
! cats -> cat
!
! feed -> feed
! agreed -> agree
! disabled -> disable
!
! matting -> mat
! mating -> mate
! meeting -> meet
! milling -> mill
! messing -> mess
!
! meetings -> meet
: step1a ( str -- str )
dup peek CHAR: s = [
{
{ [ "sses" ?tail ] [ "ss" append ] }
{ [ "ies" ?tail ] [ "i" append ] }
{ [ dup "ss" tail? ] [ ] }
{ [ "s" ?tail ] [ ] }
{ [ t ] [ ] }
} cond
] when ;
: -eed ( str -- str )
dup consonant-seq 0 > "ee" "eed" ? append ;
: -ed ( str -- str ? )
dup stem-vowel? [ [ "ed" append ] unless ] keep ;
: -ing ( str -- str ? )
dup stem-vowel? [ [ "ing" append ] unless ] keep ;
: -ed/ing ( str -- str )
{
{ [ "at" ?tail ] [ "ate" append ] }
{ [ "bl" ?tail ] [ "ble" append ] }
{ [ "iz" ?tail ] [ "ize" append ] }
{
[ dup length 1- over double-consonant? ]
[ dup "lsz" last-is? [ butlast ] unless ]
}
{
[ t ]
[
dup consonant-seq 1 = over cvc? and
[ "e" append ] when
]
}
} cond ;
: step1b ( str -- str ? )
{
{ [ "eed" ?tail ] [ -eed ] }
{
[
{
{ [ "ed" ?tail ] [ -ed ] }
{ [ "ing" ?tail ] [ -ing ] }
{ [ t ] [ f ] }
} cond
] [ -ed/ing ]
}
{ [ t ] [ ] }
} cond ;
: step1c ( str -- str )
#! step1c turns terminal y to i when there is another vowel
#! in the stem.
dup butlast stem-vowel? [
"y" ?tail [ "i" append ] when
] when ;
: step2 ( str -- str )
#! step2 maps double suffices to single ones. so -ization
#! ( = -ize plus-ation) maps to -ize etc. note that the
#! string before the suffix must give consonant-seq > 0.
{
{ [ "ational" ?tail ] [ "ational" "ate" r ] }
{ [ "tional" ?tail ] [ "tional" "tion" r ] }
{ [ "enci" ?tail ] [ "enci" "ence" r ] }
{ [ "anci" ?tail ] [ "anci" "ance" r ] }
{ [ "izer" ?tail ] [ "izer" "ize" r ] }
{ [ "bli" ?tail ] [ "bli" "ble" r ] }
{ [ "alli" ?tail ] [ "alli" "al" r ] }
{ [ "entli" ?tail ] [ "entli" "ent" r ] }
{ [ "eli" ?tail ] [ "eli" "e" r ] }
{ [ "ousli" ?tail ] [ "ousli" "ous" r ] }
{ [ "ization" ?tail ] [ "ization" "ize" r ] }
{ [ "ation" ?tail ] [ "ation" "ate" r ] }
{ [ "ator" ?tail ] [ "ator" "ate" r ] }
{ [ "alism" ?tail ] [ "alism" "al" r ] }
{ [ "iveness" ?tail ] [ "iveness" "ive" r ] }
{ [ "fulness" ?tail ] [ "fulness" "ful" r ] }
{ [ "ousness" ?tail ] [ "ousness" "ous" r ] }
{ [ "aliti" ?tail ] [ "aliti" "al" r ] }
{ [ "iviti" ?tail ] [ "iviti" "ive" r ] }
{ [ "biliti" ?tail ] [ "biliti" "ble" r ] }
{ [ "logi" ?tail ] [ "logi" "log" r ] }
{ [ t ] [ ] }
} cond ;
: step3 ( str -- str )
#! step3 deals with -ic-, -full, -ness etc. similar
#! jstrategy to step2.
{
{ [ "icate" ?tail ] [ "icate" "ic" r ] }
{ [ "ative" ?tail ] [ "ative" "" r ] }
{ [ "alize" ?tail ] [ "alize" "al" r ] }
{ [ "iciti" ?tail ] [ "iciti" "ic" r ] }
{ [ "ical" ?tail ] [ "ical" "ic" r ] }
{ [ "ful" ?tail ] [ "ful" "" r ] }
{ [ "ness" ?tail ] [ "ness" "" r ] }
{ [ t ] [ ] }
} cond ;
: -ion ( str -- str )
dup empty? [
drop "ion"
] [
dup "st" last-is? [ "ion" append ] unless
] if ;
: step4 ( str -- str )
dup {
{ [ "al" ?tail ] [ ] }
{ [ "ance" ?tail ] [ ] }
{ [ "ence" ?tail ] [ ] }
{ [ "er" ?tail ] [ ] }
{ [ "ic" ?tail ] [ ] }
{ [ "able" ?tail ] [ ] }
{ [ "ible" ?tail ] [ ] }
{ [ "ant" ?tail ] [ ] }
{ [ "ement" ?tail ] [ ] }
{ [ "ment" ?tail ] [ ] }
{ [ "ent" ?tail ] [ ] }
{ [ "ion" ?tail ] [ -ion ] }
{ [ "ou" ?tail ] [ ] }
{ [ "ism" ?tail ] [ ] }
{ [ "ate" ?tail ] [ ] }
{ [ "iti" ?tail ] [ ] }
{ [ "ous" ?tail ] [ ] }
{ [ "ive" ?tail ] [ ] }
{ [ "ize" ?tail ] [ ] }
{ [ t ] [ ] }
} cond dup consonant-seq 1 > [ nip ] [ drop ] if ;
: remove-e? ( str -- ? )
dup consonant-seq dup 1 >
[ 2drop t ]
[ 1 = [ butlast cvc? not ] [ drop f ] if ] if ;
: remove-e ( str -- str )
#! removes a final -e if consonant-seq > 1
dup peek CHAR: e = [
dup remove-e? [ butlast ] when
] when ;
: ll->l ( str -- str )
#! changes -ll to -l if consonant-seq > 1.
{
{ [ dup peek CHAR: l = not ] [ ] }
{ [ dup length 1- over double-consonant? not ] [ ] }
{ [ dup consonant-seq 1 > ] [ butlast ] }
{ [ t ] [ ] }
} cond ;
: step5 ( str -- str ) remove-e ll->l ;
: stem ( str -- str )
dup length 2 <= [
step1a step1b step1c step2 step3 step4 step5 "" like
] unless ;

View File

@ -26,8 +26,7 @@ strings styles ;
: resource-path ( path -- path )
image parent-dir swap path+ ;
: <resource-stream> ( path -- stream )
#! Open a file path relative to the Factor source code root.
: <resource-reader> ( path -- stream )
resource-path <file-reader> ;
: (file.) ( name path -- )

View File

@ -46,7 +46,7 @@ HELP: resource-path "( resource -- path )"
{ $values { "resource" "a string" } { "path" "a string" } }
{ $description "Resolve a path relative to the Factor source code location." } ;
HELP: <resource-stream> "( resource -- stream )"
HELP: <resource-reader> "( resource -- stream )"
{ $values { "resource" "a string" } { "stream" "an input stream" } }
{ $description "Opens a file relative to the Factor source code location." } ;

View File

@ -37,7 +37,7 @@ words ;
: parse-resource ( path -- quot )
dup parsing-file
[ <resource-stream> "resource:" ] keep append parse-stream ;
[ <resource-reader> "resource:" ] keep append parse-stream ;
: run-resource ( file -- ) parse-resource call ;

23531
library/test/help/output.txt Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,63 @@
IN: temporary
USING: arrays io kernel porter-stemmer sequences test ;
[ 0 ] [ "xa" consonant-seq ] unit-test
[ 0 ] [ "xxaa" consonant-seq ] unit-test
[ 1 ] [ "xaxa" consonant-seq ] unit-test
[ 2 ] [ "xaxaxa" consonant-seq ] unit-test
[ 3 ] [ "xaxaxaxa" consonant-seq ] unit-test
[ 3 ] [ "zzzzxaxaxaxaeee" consonant-seq ] unit-test
[ t ] [ 0 "fish" consonant? ] unit-test
[ f ] [ 0 "and" consonant? ] unit-test
[ t ] [ 0 "yes" consonant? ] unit-test
[ f ] [ 1 "gym" consonant? ] unit-test
[ t ] [ 5 "splitting" double-consonant? ] unit-test
[ f ] [ 2 "feel" double-consonant? ] unit-test
[ f ] [ "xxxz" stem-vowel? ] unit-test
[ t ] [ "baobab" stem-vowel? ] unit-test
[ t ] [ "hop" cvc? ] unit-test
[ t ] [ "cav" cvc? ] unit-test
[ t ] [ "lov" cvc? ] unit-test
[ t ] [ "crim" cvc? ] unit-test
[ f ] [ "show" cvc? ] unit-test
[ f ] [ "box" cvc? ] unit-test
[ f ] [ "tray" cvc? ] unit-test
[ f ] [ "meet" cvc? ] unit-test
[ "caress" ] [ "caresses" step1a step1b "" like ] unit-test
[ "poni" ] [ "ponies" step1a step1b "" like ] unit-test
[ "ti" ] [ "ties" step1a step1b "" like ] unit-test
[ "caress" ] [ "caress" step1a step1b "" like ] unit-test
[ "cat" ] [ "cats" step1a step1b "" like ] unit-test
[ "feed" ] [ "feed" step1a step1b "" like ] unit-test
[ "agree" ] [ "agreed" step1a step1b "" like ] unit-test
[ "disable" ] [ "disabled" step1a step1b "" like ] unit-test
[ "mat" ] [ "matting" step1a step1b "" like ] unit-test
[ "mate" ] [ "mating" step1a step1b "" like ] unit-test
[ "meet" ] [ "meeting" step1a step1b "" like ] unit-test
[ "mill" ] [ "milling" step1a step1b "" like ] unit-test
[ "mess" ] [ "messing" step1a step1b "" like ] unit-test
[ "meet" ] [ "meetings" step1a step1b "" like ] unit-test
[ "fishi" ] [ "fishy" step1c ] unit-test
[ "by" ] [ "by" step1c ] unit-test
[ "realizat" ] [ "realization" step4 ] unit-test
[ "ion" ] [ "ion" step4 ] unit-test
[ "able" ] [ "able" step4 ] unit-test
[ "fear" ] [ "feare" step5 "" like ] unit-test
[ "mate" ] [ "mate" step5 "" like ] unit-test
[ "hell" ] [ "hell" step5 "" like ] unit-test
[ "mate" ] [ "mate" step5 "" like ] unit-test
[ { } ] [
"/library/test/help/voc.txt" <resource-reader> lines
[ stem ] map
"/library/test/help/output.txt" <resource-reader> lines
[ 2array ] 2map [ first2 = not ] subset
] unit-test

23531
library/test/help/voc.txt Normal file

File diff suppressed because it is too large Load Diff

View File

@ -10,34 +10,34 @@ USING: io kernel math parser strings test ;
"This is a line."
"This is another line."
] [
"/library/test/io/windows-eol.txt" <resource-stream> lines-test
"/library/test/io/windows-eol.txt" <resource-reader> lines-test
] unit-test
[
"This is a line."
"This is another line."
] [
"/library/test/io/mac-os-eol.txt" <resource-stream> lines-test
"/library/test/io/mac-os-eol.txt" <resource-reader> lines-test
] unit-test
[
"This is a line."
"This is another line."
] [
"/library/test/io/unix-eol.txt" <resource-stream> lines-test
"/library/test/io/unix-eol.txt" <resource-reader> lines-test
] unit-test
[
"This is a line.\rThis is another line.\r"
] [
"/library/test/io/mac-os-eol.txt" <resource-stream>
"/library/test/io/mac-os-eol.txt" <resource-reader>
[ 500 read ] with-stream
] unit-test
[
255
] [
"/library/test/io/binary.txt" <resource-stream>
"/library/test/io/binary.txt" <resource-reader>
[ read1 ] with-stream >fixnum
] unit-test

View File

@ -82,7 +82,7 @@ SYMBOL: failures
"inference" "interpreter" "alien"
"gadgets/line-editor" "gadgets/rectangles" "memory"
"redefine" "annotate" "binary" "inspector"
"kernel"
"kernel" "help/porter-stemmer"
} run-tests ;
: benchmarks

View File

@ -53,9 +53,6 @@ sequences strings vectors words ;
-rot [ (instances) ] 2keep
] each-object nip ; inline
: hash+ ( n key hash -- )
[ hash [ 0 ] unless* + ] 2keep set-hash ;
: heap-stat-step ( counts sizes obj -- )
[ dup size swap class rot hash+ ] keep
1 swap class rot hash+ ;