factor/core/strings/parser/parser.factor

! Copyright (C) 2008, 2009 Slava Pestov, Doug Coleman.
! See http://factorcode.org/license.txt for BSD license.
USING: accessors assocs combinators continuations kernel
kernel.private lexer math math.parser namespaces sbufs sequences
splitting strings ;
IN: strings.parser

ERROR: bad-escape char ;

: escape ( escape -- ch )
    H{
        { CHAR: a  CHAR: \a }
        { CHAR: b  CHAR: \b }
        { CHAR: e  CHAR: \e }
        { CHAR: f  CHAR: \f }
        { CHAR: n  CHAR: \n }
        { CHAR: r  CHAR: \r }
        { CHAR: t  CHAR: \t }
        { CHAR: s  CHAR: \s }
        { CHAR: v  CHAR: \v }
        { CHAR: \s CHAR: \s }
        { CHAR: 0  CHAR: \0 }
        { CHAR: \\ CHAR: \\ }
        { CHAR: \" CHAR: \" }
    } ?at [ bad-escape ] unless ;

SYMBOL: name>char-hook

name>char-hook [
    [ "Unicode support not available" throw ]
] initialize

: hex-escape ( str -- ch str' )
    2 cut-slice [ hex> ] dip ;

: unicode-escape ( str -- ch str' )
    "{" ?head-slice [
        CHAR: } over index cut-slice
        [ >string name>char-hook get call( name -- char ) ] dip
        rest-slice
    ] [
        6 cut-slice [ hex> ] dip
    ] if ;

: next-escape ( str -- ch str' )
    unclip-slice {
        { CHAR: u [ unicode-escape ] }
        { CHAR: x [ hex-escape ] }
        [ escape swap ]
    } case ;

<PRIVATE

: (unescape-string) ( accum str i/f -- accum )
    { sbuf object object } declare
    [
        cut-slice [ append! ] dip
        rest-slice next-escape [ suffix! ] dip
        CHAR: \\ over index (unescape-string)
    ] [
        append!
    ] if* ;

PRIVATE>

: unescape-string ( str -- str' )
    CHAR: \\ over index [
        [ [ length <sbuf> ] keep ] dip (unescape-string)
    ] when* "" like ;

<PRIVATE

: lexer-subseq ( i lexer -- before )
    { fixnum lexer } declare
    [ [ column>> ] [ line-text>> ] bi swapd subseq ]
    [ column<< ] 2bi ;

: rest-of-line ( lexer -- seq )
    { lexer } declare
    [ line-text>> ] [ column>> ] bi tail-slice ;

: current-char ( lexer -- ch/f )
    { lexer } declare
    [ column>> ] [ line-text>> ] bi ?nth ;

: advance-char ( lexer -- )
    { lexer } declare
    [ 1 + ] change-column drop ;

: next-char ( lexer -- ch/f )
    { lexer } declare
    dup still-parsing-line? [
        [ current-char ] [ advance-char ] bi
    ] [
        drop f
    ] if ;

: next-line% ( accum lexer -- )
    { sbuf lexer } declare
    [ rest-of-line swap push-all ] [ next-line ] bi ;

: find-next-token ( lexer -- i elt )
    { lexer } declare
    [ column>> ] [ line-text>> ] bi
    [ "\"\\" member? ] find-from ;

DEFER: (parse-string)

: parse-found-token ( accum lexer i elt -- )
    { sbuf lexer fixnum fixnum } declare
    [ over lexer-subseq pick push-all ] dip
    CHAR: \ = [
        dup dup [ next-char ] bi@
        [ [ pick push ] bi@ ]
        [ drop 2dup next-line% ] if*
        (parse-string)
    ] [
        advance-char drop
    ] if ;

: (parse-string) ( accum lexer -- )
    { sbuf lexer } declare
    dup still-parsing? [
        dup find-next-token [
            parse-found-token
        ] [
            drop 2dup next-line%
            CHAR: \n pick push
            (parse-string)
        ] if*
    ] [
        "Unterminated string" throw
    ] if ;

: rewind-lexer-on-error ( quot -- )
    lexer get [ line>> ] [ line-text>> ] [ column>> ] tri
    [
        lexer get [ column<< ] [ line-text<< ] [ line<< ] tri
        rethrow
    ] 3curry recover ; inline

PRIVATE>

: parse-string ( -- str )
    [
        SBUF" " clone [
            lexer get (parse-string)
        ] keep unescape-string
    ] rewind-lexer-on-error ;
add multiline string support 2009-09-19 04:55:05 -04:00			`! Copyright (C) 2008, 2009 Slava Pestov, Doug Coleman.`
Split up huge parser vocabulary 2008-06-25 04:25:08 -04:00			`! See http://factorcode.org/license.txt for BSD license.`
strings.parser: better string error messages. 2016-04-04 17:48:05 -04:00			`USING: accessors assocs combinators continuations kernel`
			`kernel.private lexer math math.parser namespaces sbufs sequences`
			`splitting strings ;`
Split up huge parser vocabulary 2008-06-25 04:25:08 -04:00			`IN: strings.parser`

Put bad escape code in the bad-escape error 2009-09-23 18:55:54 -04:00			`ERROR: bad-escape char ;`
Split up huge parser vocabulary 2008-06-25 04:25:08 -04:00
			`: escape ( escape -- ch )`
			`H{`
			`{ CHAR: a CHAR: \a }`
syntax: adding \b \v and \f escape codes. 2014-06-03 21:04:51 -04:00			`{ CHAR: b CHAR: \b }`
Split up huge parser vocabulary 2008-06-25 04:25:08 -04:00			`{ CHAR: e CHAR: \e }`
syntax: adding \b \v and \f escape codes. 2014-06-03 21:04:51 -04:00			`{ CHAR: f CHAR: \f }`
Split up huge parser vocabulary 2008-06-25 04:25:08 -04:00			`{ CHAR: n CHAR: \n }`
			`{ CHAR: r CHAR: \r }`
			`{ CHAR: t CHAR: \t }`
			`{ CHAR: s CHAR: \s }`
syntax: adding \b \v and \f escape codes. 2014-06-03 21:04:51 -04:00			`{ CHAR: v CHAR: \v }`
Split up huge parser vocabulary 2008-06-25 04:25:08 -04:00			`{ CHAR: \s CHAR: \s }`
			`{ CHAR: 0 CHAR: \0 }`
			`{ CHAR: \\ CHAR: \\ }`
			`{ CHAR: \" CHAR: \" }`
change ERROR: words from throw-foo back to foo. 2015-08-13 19:13:05 -04:00			`} ?at [ bad-escape ] unless ;`
Split up huge parser vocabulary 2008-06-25 04:25:08 -04:00
			`SYMBOL: name>char-hook`

Update some existing code to use initialize 2009-02-10 17:16:12 -05:00			`name>char-hook [`
			`[ "Unicode support not available" throw ]`
			`] initialize`
Split up huge parser vocabulary 2008-06-25 04:25:08 -04:00
strings.parser: support "\x" ascii hex escapes. 2012-07-26 22:24:25 -04:00			`: hex-escape ( str -- ch str' )`
			`2 cut-slice [ hex> ] dip ;`

Split up huge parser vocabulary 2008-06-25 04:25:08 -04:00			`: unicode-escape ( str -- ch str' )`
			`"{" ?head-slice [`
			`CHAR: } over index cut-slice`
Move call( and execute( to core 2009-03-16 21:11:36 -04:00			`[ >string name>char-hook get call( name -- char ) ] dip`
Split up huge parser vocabulary 2008-06-25 04:25:08 -04:00			`rest-slice`
			`] [`
Refactor all usages of >r/r> in core to use dip, 2dip, 3dip Non-optimizing compiler now special-cases dip, 2dip, 3dip following a literal quotation: this allows us to break the dip/slip meta-circle without explicit calls to >r/r> 2008-11-23 03:44:56 -05:00			`6 cut-slice [ hex> ] dip`
Split up huge parser vocabulary 2008-06-25 04:25:08 -04:00			`] if ;`

			`: next-escape ( str -- ch str' )`
strings.parser: simplify next-escape. 2014-06-04 00:04:05 -04:00			`unclip-slice {`
			`{ CHAR: u [ unicode-escape ] }`
			`{ CHAR: x [ hex-escape ] }`
			`[ escape swap ]`
strings.parser: support "\x" ascii hex escapes. 2012-07-26 22:24:25 -04:00			`} case ;`
Split up huge parser vocabulary 2008-06-25 04:25:08 -04:00
strings.parser: use sbuf accumulator instead of make. 2014-05-19 17:14:02 -04:00			`<PRIVATE`

			`: (unescape-string) ( accum str i/f -- accum )`
strings.parser: using type declarations. 2014-05-20 11:20:34 -04:00			`{ sbuf object object } declare`
strings.parser: use sbuf accumulator instead of make. 2014-05-19 17:14:02 -04:00			`[`
use suffix! and append!. 2014-12-03 14:37:34 -05:00			`cut-slice [ append! ] dip`
			`rest-slice next-escape [ suffix! ] dip`
strings.parser: use sbuf accumulator instead of make. 2014-05-19 17:14:02 -04:00			`CHAR: \\ over index (unescape-string)`
add multiline string support 2009-09-19 04:55:05 -04:00			`] [`
use suffix! and append!. 2014-12-03 14:37:34 -05:00			`append!`
strings.parser: using type declarations. 2014-05-20 11:20:34 -04:00			`] if* ;`
strings.parser: use sbuf accumulator instead of make. 2014-05-19 17:14:02 -04:00
			`PRIVATE>`
add multiline string support 2009-09-19 04:55:05 -04:00
			`: unescape-string ( str -- str' )`
strings.parser: use sbuf accumulator instead of make. 2014-05-19 17:14:02 -04:00			`CHAR: \\ over index [`
			`[ [ length <sbuf> ] keep ] dip (unescape-string)`
			`] when* "" like ;`

			`<PRIVATE`
add multiline string support 2009-09-19 04:55:05 -04:00
strings.parser: use sbuf accumulator instead of make. 2014-05-19 17:14:02 -04:00			`: lexer-subseq ( i lexer -- before )`
strings.parser: using type declarations. 2014-05-20 11:20:34 -04:00			`{ fixnum lexer } declare`
strings.parser: use sbuf accumulator instead of make. 2014-05-19 17:14:02 -04:00			`[ [ column>> ] [ line-text>> ] bi swapd subseq ]`
			`[ column<< ] 2bi ;`
add multiline string support 2009-09-19 04:55:05 -04:00
fix string parsing 2009-09-20 22:50:17 -04:00			`: rest-of-line ( lexer -- seq )`
strings.parser: using type declarations. 2014-05-20 11:20:34 -04:00			`{ lexer } declare`
fix string parsing 2009-09-20 22:50:17 -04:00			`[ line-text>> ] [ column>> ] bi tail-slice ;`
add multiline string support 2009-09-19 04:55:05 -04:00
fix string parsing 2009-09-20 22:50:17 -04:00			`: current-char ( lexer -- ch/f )`
strings.parser: using type declarations. 2014-05-20 11:20:34 -04:00			`{ lexer } declare`
fix string parsing 2009-09-20 22:50:17 -04:00			`[ column>> ] [ line-text>> ] bi ?nth ;`
the last character on a multiline string cannot be a backslash 2009-09-20 15:18:19 -04:00
			`: advance-char ( lexer -- )`
strings.parser: using type declarations. 2014-05-20 11:20:34 -04:00			`{ lexer } declare`
the last character on a multiline string cannot be a backslash 2009-09-20 15:18:19 -04:00			`[ 1 + ] change-column drop ;`

strings.parser: finish removing triple-strings. parse-string => parse-short-string (on the same line) parse-multiline-string-until => parse-full-string 2015-08-17 22:49:12 -04:00			`: next-char ( lexer -- ch/f )`
strings.parser: using type declarations. 2014-05-20 11:20:34 -04:00			`{ lexer } declare`
the last character on a multiline string cannot be a backslash 2009-09-20 15:18:19 -04:00			`dup still-parsing-line? [`
			`[ current-char ] [ advance-char ] bi`
			`] [`
strings.parser: finish removing triple-strings. parse-string => parse-short-string (on the same line) parse-multiline-string-until => parse-full-string 2015-08-17 22:49:12 -04:00			`drop f`
the last character on a multiline string cannot be a backslash 2009-09-20 15:18:19 -04:00			`] if ;`

strings.parser: use sbuf accumulator instead of make. 2014-05-19 17:14:02 -04:00			`: next-line% ( accum lexer -- )`
strings.parser: using type declarations. 2014-05-20 11:20:34 -04:00			`{ sbuf lexer } declare`
strings.parser: finish removing triple-strings. parse-string => parse-short-string (on the same line) parse-multiline-string-until => parse-full-string 2015-08-17 22:49:12 -04:00			`[ rest-of-line swap push-all ] [ next-line ] bi ;`
rename a couple of strings.parser words 2009-09-24 20:43:57 -04:00
strings.parser: finish removing triple-strings. parse-string => parse-short-string (on the same line) parse-multiline-string-until => parse-full-string 2015-08-17 22:49:12 -04:00			`: find-next-token ( lexer -- i elt )`
strings.parser: using type declarations. 2014-05-20 11:20:34 -04:00			`{ lexer } declare`
strings.parser: finish removing triple-strings. parse-string => parse-short-string (on the same line) parse-multiline-string-until => parse-full-string 2015-08-17 22:49:12 -04:00			`[ column>> ] [ line-text>> ] bi`
			`[ "\"\\" member? ] find-from ;`
fix string parsing 2009-09-20 22:50:17 -04:00
strings.parser: remove parse-short-string, everyone should parse-string. 2016-04-04 17:54:06 -04:00			`DEFER: (parse-string)`
add multiline string support 2009-09-19 04:55:05 -04:00
strings.parser: finish removing triple-strings. parse-string => parse-short-string (on the same line) parse-multiline-string-until => parse-full-string 2015-08-17 22:49:12 -04:00			`: parse-found-token ( accum lexer i elt -- )`
			`{ sbuf lexer fixnum fixnum } declare`
			`[ over lexer-subseq pick push-all ] dip`
add multiline string support 2009-09-19 04:55:05 -04:00			`CHAR: \ = [`
strings.parser: finish removing triple-strings. parse-string => parse-short-string (on the same line) parse-multiline-string-until => parse-full-string 2015-08-17 22:49:12 -04:00			`dup dup [ next-char ] bi@`
			`[ [ pick push ] bi@ ]`
			`[ drop 2dup next-line% ] if*`
strings.parser: remove parse-short-string, everyone should parse-string. 2016-04-04 17:54:06 -04:00			`(parse-string)`
add multiline string support 2009-09-19 04:55:05 -04:00			`] [`
strings.parser: finish removing triple-strings. parse-string => parse-short-string (on the same line) parse-multiline-string-until => parse-full-string 2015-08-17 22:49:12 -04:00			`advance-char drop`
strings.parser: using type declarations. 2014-05-20 11:20:34 -04:00			`] if ;`
Remove eval dependency from unicode.syntax 2008-12-08 20:46:40 -05:00
strings.parser: remove parse-short-string, everyone should parse-string. 2016-04-04 17:54:06 -04:00			`: (parse-string) ( accum lexer -- )`
strings.parser: finish removing triple-strings. parse-string => parse-short-string (on the same line) parse-multiline-string-until => parse-full-string 2015-08-17 22:49:12 -04:00			`{ sbuf lexer } declare`
			`dup still-parsing? [`
			`dup find-next-token [`
fix string parsing 2009-09-20 22:50:17 -04:00			`parse-found-token`
			`] [`
strings.parser: finish removing triple-strings. parse-string => parse-short-string (on the same line) parse-multiline-string-until => parse-full-string 2015-08-17 22:49:12 -04:00			`drop 2dup next-line%`
			`CHAR: \n pick push`
strings.parser: remove parse-short-string, everyone should parse-string. 2016-04-04 17:54:06 -04:00			`(parse-string)`
fix string parsing 2009-09-20 22:50:17 -04:00			`] if*`
add multiline string support 2009-09-19 04:55:05 -04:00			`] [`
strings.parser: better string error messages. 2016-04-04 17:48:05 -04:00			`"Unterminated string" throw`
strings.parser: using type declarations. 2014-05-20 11:20:34 -04:00			`] if ;`
add multiline string support 2009-09-19 04:55:05 -04:00
strings.parser: remove parse-short-string, everyone should parse-string. 2016-04-04 17:54:06 -04:00			`: rewind-lexer-on-error ( quot -- )`
strings.parser: better string error messages. 2016-04-04 17:48:05 -04:00			`lexer get [ line>> ] [ line-text>> ] [ column>> ] tri`
			`[`
			`lexer get [ column<< ] [ line-text<< ] [ line<< ] tri`
			`rethrow`
			`] 3curry recover ; inline`

rename a couple of strings.parser words 2009-09-24 20:43:57 -04:00			`PRIVATE>`
the last character on a multiline string cannot be a backslash 2009-09-20 15:18:19 -04:00
strings.parser: remove parse-short-string, everyone should parse-string. 2016-04-04 17:54:06 -04:00			`: parse-string ( -- str )`
strings.parser: better string error messages. 2016-04-04 17:48:05 -04:00			`[`
			`SBUF" " clone [`
strings.parser: remove parse-short-string, everyone should parse-string. 2016-04-04 17:54:06 -04:00			`lexer get (parse-string)`
strings.parser: better string error messages. 2016-04-04 17:48:05 -04:00			`] keep unescape-string`
strings.parser: remove parse-short-string, everyone should parse-string. 2016-04-04 17:54:06 -04:00			`] rewind-lexer-on-error ;`