342 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			HTML
		
	
	
		
		
			
		
	
	
			342 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			HTML
		
	
	
|  | <html> | ||
|  |   <head> | ||
|  |     <title>Parser Combinators</title> | ||
|  |     <link rel="stylesheet" type="text/css" href="style.css"> | ||
|  |       </head> | ||
|  |   <body> | ||
|  |     <h1>Parsers</h1> | ||
|  | <p class="note">The parser combinator library described here is based | ||
|  |   on a library written for the Clean pure functional programming language and | ||
|  |   described in chapter 5 of the 'Clean Book' (<a | ||
|  |   href="ftp://ftp.cs.kun.nl/pub/Clean/papers/cleanbook/II.05.ParserCombinators.pdf">PDF | ||
|  |   available here</a>). Based on the description | ||
|  |   in that chapter I developed a version for Factor, a concatenative | ||
|  |   language.</p>   | ||
|  | <p>A parser is a word or quotation that, when called, processes | ||
|  |    an input string on the stack, performs some parsing operation on | ||
|  |    it, and returns a result indicating the success of the parsing | ||
|  |    operation.</p>  | ||
|  | <p>The result returned by a parser is known as a 'list of | ||
|  | successes'. It is a lazy list of standard Factor cons cells. Each cons | ||
|  | cell is a result of a parse. The car of the cell is the remaining | ||
|  | input left to be parsed and the cdr of the cell is the result of the | ||
|  | parsing operation.</p> | ||
|  | <p>A lazy list is used for the result as a parse operation can potentially | ||
|  | return many successful results. For example, a parser that parses one | ||
|  | or more digits will return more than one result for the input "123". A | ||
|  | successful parse could be "1", "12" or "123".</p> | ||
|  | <p>The list is lazy so if only one parse result is required the | ||
|  | remaining results won't actually be processed if they are not | ||
|  | requested. This improves efficiency.</p> | ||
|  | <p>The cdr of the result pair can be any value that the parser wishes | ||
|  | to return. It could be the successful portion of the input string | ||
|  | parsed, an abstract syntax tree representing the parsed input, or even | ||
|  | a quotation that should get called for later processing.</p> | ||
|  | <p>A Parser Combinator is a word that takes one or more parsers and | ||
|  | returns a parser that when called uses the original parsers in some | ||
|  | manner.</p> | ||
|  | <h1>Example Parsers</h1> | ||
|  | <p>The following are some very simple parsers that demonstrate how | ||
|  | general parsers work and the 'list of sucesses' that are returned as a | ||
|  | result.</p> | ||
|  | <pre class="code"> | ||
|  |   (1) : char-a ( inp -- result ) | ||
|  |         0 over string-nth CHAR: a = [ | ||
|  |           1 swap string-tail CHAR: a cons unit delay lunit | ||
|  |         ] [ | ||
|  |           drop lnil | ||
|  |         ] ifte ; | ||
|  |   (2) "atest" char-a [ [ . ] leach ] when* | ||
|  |       => [[ "test" 97 ]] | ||
|  |   (3) "test"  char-a [ [ . ] leach ] when* | ||
|  |       => | ||
|  | </pre> | ||
|  | <p>'char-a' is a parser that only accepts the character 'a' in the | ||
|  | input string. When passed an input string with a string with a leading | ||
|  | 'a' then the 'list of successes' has 1 result value. The cdr of that | ||
|  | result value is the character 'a' successfully parsed, and the car is | ||
|  | the remaining input string. On failure of the parse an empty list is | ||
|  | returned.</p>  | ||
|  | <p>The parser combinator library provides a combinator, <&>, that takes | ||
|  | two parsers off the stack and returns a parser that calls the original | ||
|  | two in sequence. An example of use would be calling 'char-a' twice, | ||
|  | which would then result in an input string expected with two 'a' | ||
|  | characters leading:</p> | ||
|  | <pre class="code"> | ||
|  |   (1) "aatest" [ char-a ] [ char-a ] <&> call | ||
|  |       => < list of successes > | ||
|  |   (2) [ . ] leach | ||
|  |       => [[ "test" [[ 97 97 ]] ]] | ||
|  | </pre> | ||
|  | <h2>Tokens</h2> | ||
|  | <p>Creating parsers for specfic characters and tokens can be a chore | ||
|  | so there is a word that, given a string token on the stack, returns | ||
|  | a parser that parses that particular token:</p> | ||
|  | <pre class="code"> | ||
|  |   (1) "begin" token  | ||
|  |       => < a parser that parses the token "begin" > | ||
|  |   (2) dup "this should fail" swap call lnil? . | ||
|  |       => t | ||
|  |   (3) "begin a successfull parse" swap call  | ||
|  |       => < lazy list > | ||
|  |   (4) [ . ] leach | ||
|  |       => [[ " a successfull parse" "begin" ]] | ||
|  | </pre> | ||
|  | <h2>Predicate matching</h2> | ||
|  | <p>The word 'satisfy' takes a quotation from the top of the stack and | ||
|  | returns a parser than when called will call the quotation with the | ||
|  | first item in the input string on the stack. If the quotation returns | ||
|  | true then the parse is successful, otherwise it fails:</p> | ||
|  | <pre class="code"> | ||
|  |   (1) : digit-parser ( -- parser ) | ||
|  |         [ digit? ] satisfy ; | ||
|  |   (2) "5" digit-parser call [ . ] leach | ||
|  |       => [[ "" 53 ]] | ||
|  |   (3) "a" digit-parser call lnil? . | ||
|  |       => t | ||
|  | </pre> | ||
|  | <p>Note that 'digit-parser' returns a parser, it is not the parser | ||
|  | itself. It is really a parser generating word like 'token'. Whereas | ||
|  | our 'char-a' word defined originally was a parser itself.</p> | ||
|  | <h2>Zero or more matches</h2> | ||
|  | <p>Now that we can parse single digits it would be nice to easily | ||
|  | parse a string of them. The '<*>' parser combinator word will do | ||
|  | this. It accepts a parser on the top of the stack and produces a | ||
|  | parser that parses zero or more of the constructs that the original | ||
|  | parser parsed. The result of the '<*>' generated parser will be a list | ||
|  | of the successful results returned by the original parser.</p> | ||
|  | <pre class="code"> | ||
|  |   (1) digit-parser <*> | ||
|  |       => < parser > | ||
|  |   (2) "123" swap call | ||
|  |       => < lazy list > | ||
|  |   (3) [ . ] leach | ||
|  |       => [ "" [ 49 50 51 ] ] | ||
|  |            [ "3" [ 49 50 ] ] | ||
|  |            [ "23" [ 49 ] ] | ||
|  |            [ "123" ] | ||
|  | </pre> | ||
|  | <p>In this case there are multiple successful parses. This is because | ||
|  | the occurrence of zero or more digits happens more than once. There is | ||
|  | also the 'f' case where zero digits is parsed. If only the 'longest | ||
|  | match' is required then the lcar of the lazy list can be used and the | ||
|  | remaining parse results are never produced.</p> | ||
|  | <h2>Manipulating parse trees</h2> | ||
|  | <p>The result of the previous parse was the list of characters | ||
|  | parsed. Sometimes you want this to be something else, like an abstract | ||
|  | syntax tree, or some calculation. For the digit case we may want the | ||
|  | actual integer number.</p> | ||
|  | <p>For this we can use the '<@' parser | ||
|  | combinator. This combinator takes a parser and a quotation on the | ||
|  | stack and returns a new parser. When the new parser is called it will | ||
|  | call the original parser to produce the results, then it will call the | ||
|  | quotation on each successfull result, and the result of that quotation | ||
|  | will be the result of the parse:</p> | ||
|  | <pre class="code"> | ||
|  |   (1) : digit-parser2 ( -- parser ) | ||
|  |         [ digit? ] satisfy [ digit> ] <@ ; | ||
|  |   (2) "5" digit-parser2 call [ . ] leach | ||
|  |       => [[ "" 5 ]] | ||
|  | </pre> | ||
|  | <p>Notice that now the result is the actual integer '5' rather than | ||
|  | character code '53'.</p> | ||
|  | <pre class="code"> | ||
|  |   (1) : digit-list>number ( list -- number ) | ||
|  |          #! Converts a list of digits to a number | ||
|  |          [ >digit ] map >string dup empty? [  | ||
|  |            drop 0  | ||
|  |          ] [ | ||
|  | 	   str>number  | ||
|  |          ]  ifte ; | ||
|  |   (2) : natural-parser ( -- parser ) | ||
|  |         digit-parser2 <*> [ car digit-list>number unit  ] <@  ; | ||
|  |   (3) "123" natural-parser call | ||
|  |       => < lazy list > | ||
|  |   (4) [ . ] leach | ||
|  |       => [ "" 123 ] | ||
|  |            [ "3" 12 ] | ||
|  |            [ "23" 1 ] | ||
|  |            [ "123" 0 ] | ||
|  |            [ [ 123 ] | "" ] | ||
|  | </pre> | ||
|  | <p>The number parsed is the actual integer number due to the operation | ||
|  | of the '<@' word. This allows parsers to not only parse the input | ||
|  | string but perform operations and transformations on the syntax tree | ||
|  | returned.</p> | ||
|  | <p>A useful debugging method to work out what to use in the quotation | ||
|  | passed to <@ is to write an initial version of the parser that just | ||
|  | displays the topmost item on the stack:</p> | ||
|  | <pre class="code"> | ||
|  |   (1) : natural-parser-debug ( -- parser ) | ||
|  |         digit-parser2 <*> [ "debug: " write dup . ] <@  ; | ||
|  |   (3) "123" natural-parser-debug call lcar . | ||
|  |       => debug: [ [ 1 2 3 ] ] | ||
|  |            [ "" [ 1 2 3 ] ] | ||
|  | </pre> | ||
|  | <p>From the debug output we can see how to manipulate the result to | ||
|  | get what we want. In this case it's the quotation in the previous example.</p> | ||
|  |   | ||
|  | <h2>Sequential combinator</h2> | ||
|  | <p>To create a full grammar we need a parser combinator that does | ||
|  | sequential compositions. That is, given two parsers, the sequential | ||
|  | combinator will first run the first parser, and then run the second on | ||
|  | the remaining text to be parsed. As the first parser returns a lazy | ||
|  | list, the second parser will be run on each item of the lazy list. Of | ||
|  | course this is done lazily so it only ends up being done when those | ||
|  | list items are requested. The sequential combinator word is <&>.</p> | ||
|  | <pre class="code"> | ||
|  |   ( 1 ) "number:" token  | ||
|  |        => < parser that parses the text 'number:' > | ||
|  |   ( 2 ) natural-parser | ||
|  |        => < parser that parses natural numbers > | ||
|  |   ( 3 ) <&> | ||
|  |        => < parser that parses 'number:' followed by a natural > | ||
|  |   ( 4 ) "number:100" swap call | ||
|  |        => < list of successes > | ||
|  |   ( 5 ) [ . ] leach | ||
|  |        => [ "" "number:" 100 ] | ||
|  |             [ "0" "number:" 10 ] | ||
|  |             [ "00" "number:" 1 ] | ||
|  |             [ "100" "number:" 0 ] | ||
|  | </pre> | ||
|  | <p>In this  example we might prefer not to have the parse result | ||
|  | contain the token, we want just the number. Two alternatives to <&> | ||
|  | provide the ability to select which result to use from the two | ||
|  | parsers. These operators are <& and &>. The < or > points  | ||
|  | in the direction of which parser to retain the results from. So our | ||
|  | example above could be:</p> | ||
|  | <pre class="code"> | ||
|  |   ( 1 ) "number:" token  | ||
|  |        => < parser that parses the text 'number:' > | ||
|  |   ( 2 ) natural-parser | ||
|  |        => < parser that parses natural numbers > | ||
|  |   ( 3 ) &> | ||
|  |        => < parser that parses 'number:' followed by a natural > | ||
|  |   ( 4 ) "number:100" swap call | ||
|  |        => < list of successes > | ||
|  |   ( 5 ) [ . ] leach | ||
|  |        => [ "" 100 ] | ||
|  |             [ "0" 10 ] | ||
|  |             [ "00" 1 ] | ||
|  |             [ "100" 0 ] | ||
|  | </pre> | ||
|  | <p>Notice how the parse result only contains the number due to &> | ||
|  | being used to retain the result of the second parser.</p> | ||
|  | 
 | ||
|  | <h2>Choice combinator</h2> | ||
|  | <p>As well as a sequential combinator we need an alternative | ||
|  | combinator. The word for this is <|>. It takes two parsers from the | ||
|  | stack and returns a parser that will first try the first parser. If it | ||
|  | succeeds then the result for that is returned. If it fails then the | ||
|  | second parser is tried and its result returned.</p> | ||
|  | <pre class="code"> | ||
|  |   ( 1 ) "one" token | ||
|  |         => < parser that parses the text 'one' > | ||
|  |   ( 2 ) "two" token  | ||
|  |         => < parser that parses the text 'two' > | ||
|  |   ( 3 ) <|> | ||
|  |         => < parser that parses 'one' or 'two' > | ||
|  |   ( 4 ) "one" over call [ . ] leach | ||
|  |         => [[ "" "one" ]] | ||
|  |   ( 5 ) "two" swap call [ . ] leach | ||
|  |         => [[ "" "two" ]] | ||
|  | </pre> | ||
|  | 
 | ||
|  | <h2>Option combinator</h2> | ||
|  | <p>The option combinator, <?> allows adding optional elements to | ||
|  | a parser. It takes one parser off the stack and if the parse succeeds | ||
|  | add it to the result tree, otherwise it will ignore it and | ||
|  | continue. The example below extends our natural-parser to parse | ||
|  | integers with an optional leading minus sign.</p> | ||
|  | <pre class="code"> | ||
|  |   ( 1 ) : integer-parser | ||
|  |           "-" token <?> natural-parser <&> ; | ||
|  |   ( 2 ) "200" integer-parser call [ . ] leach  | ||
|  |        => [ "" [ ] 200 ] | ||
|  |             [ "0" [ ] 20 ] | ||
|  |             [ "00" [ ] 2 ] | ||
|  |             [ "200" [ ] 0 ] | ||
|  |   ( 3 ) "-200" integer-parser call [ . ] leach | ||
|  |        => [ "" [ "-" ] 200 ] | ||
|  |             [ "0" [ "-" ] 20 ] | ||
|  |             [ "00" [ "-" ] 2 ] | ||
|  |             [ "200" [ "-" ] 0 ] | ||
|  |             [ "-200" [ ] 0 ] | ||
|  |   ( 4 ) : integer-parser2 | ||
|  |           integer-parser [ uncons swap [ car -1 * ] when ] <@ ; | ||
|  |   ( 5 ) "200" integer-parser2 call [ . ] leach  | ||
|  |        => [ "" 200 ] | ||
|  |             [ "0" 20 ] | ||
|  |             [ "00" 2 ] | ||
|  |             [ "200" 0 ] | ||
|  |   ( 6 ) "-200" integer-parser2 call [ . ] leach | ||
|  |        => [ "" -200 ] | ||
|  |             [ "0" -20 ] | ||
|  |             [ "00" -2 ] | ||
|  |             [ "200" 0 ] | ||
|  |             [ "-200" 0 ] | ||
|  | 
 | ||
|  | </pre> | ||
|  | 
 | ||
|  | <h2>Skipping Whitespace</h2> | ||
|  | <p>A parser transformer exists, the word 'sp', that takes an existing | ||
|  | parser and returns a new one that will first skip any whitespace | ||
|  | before calling the original parser. This makes it easy to write | ||
|  | grammers that avoid whitespace without having to explicitly code it | ||
|  | into the grammar.</p> | ||
|  | <pre class="code"> | ||
|  |   ( 1 ) "  123" natural-parser call [ . ] leach | ||
|  |         => [ "  123" 0 ] | ||
|  |   ( 2 ) "  123" natural-parser sp call [ . ] leach | ||
|  |         => [ "" 123 ] | ||
|  |              [ "3" 12 ] | ||
|  |              [ "23" 1 ] | ||
|  |              [ "123" 0 ] | ||
|  | </pre> | ||
|  | <h2>Eval grammar example</h2> | ||
|  | <p>This example presents a simple grammar that will parse a number | ||
|  | followed by an operator and another number. A factor expression that | ||
|  | computes the entered value will be executed.</p> | ||
|  | <pre class="code"> | ||
|  |   ( 1 ) natural-parser | ||
|  |         => < a parser for natural numbers > | ||
|  |   ( 2 ) "/" token "*" token "+" token "-" token <|> <|> <|> | ||
|  |         => < a parser for the operator > | ||
|  |   ( 3 ) sp [ "\\ " swap cat2 eval unit ] <@ | ||
|  |         => < operator parser that skips whitespace and converts to a  | ||
|  |              factor expression > | ||
|  |   ( 4 ) natural-parser sp | ||
|  |         => < a whitespace skipping natural parser > | ||
|  |   ( 5 ) <&> <&> [ uncons uncons swap append append call ] <@ | ||
|  |         => < a parser that parsers the expression, converts it to | ||
|  |              factor, calls it and puts the result in the parse tree > | ||
|  |   ( 6 ) "123 + 456" over call lcar . | ||
|  |         => [[ "" 579 ]] | ||
|  |   ( 7 ) "300-100" over call lcar . | ||
|  |         => [[ "" 200 ]] | ||
|  |   ( 8 ) "200/2" over call lcar . | ||
|  |         => [[ "" 100 ]] | ||
|  | </pre> | ||
|  | <p>It looks complicated when expanded as above but the entire parser, | ||
|  | factored a little, looks quite readable:</p> | ||
|  | <pre class="code"> | ||
|  |   ( 1 ) : operator ( -- parser ) | ||
|  |           "/" token  | ||
|  |           "*" token <|> | ||
|  |           "+" token <|> | ||
|  |           "-" token <|> | ||
|  |           [ "\\ " swap cat2 eval unit ] <@ ; | ||
|  |   ( 2 ) : expression ( -- parser ) | ||
|  |           natural-parser  | ||
|  |           operator sp <&>   | ||
|  |           natural-parser sp <&>  | ||
|  |           [ uncons swap uncons -rot append append reverse call ] <@ ; | ||
|  |   ( 3 ) "40+2" expression call lcar . | ||
|  |         => [[ "" 42 ]] | ||
|  | </pre> | ||
|  | <p class="footer"> | ||
|  | News and updates to this software can be obtained from the authors | ||
|  | weblog: <a href="http://radio.weblogs.com/0102385">Chris Double</a>.</p> | ||
|  | <p id="copyright">Copyright (c) 2004, Chris Double. All Rights Reserved.</p> | ||
|  | </body> </html> |