| \documentclass[10pt]{article} |
| \usepackage{palatino} |
| \usepackage{html} |
| \usepackage{color} |
| |
| \setlength{\headsep}{0in} |
| \setlength{\headheight}{0in} |
| \setlength{\textheight}{8.5in} |
| \setlength{\textwidth}{5.9in} |
| \setlength{\oddsidemargin}{0.25in} |
| |
| \definecolor{darkblue}{rgb}{0,0,0.6} |
| \definecolor{darkerblue}{rgb}{0,0,0.3} |
| |
| %% \newcommand{\mysection}[1]{\section{\textcolor{darkblue}{#1}}} |
| %% \newcommand{\mysubsection}[1]{\subsection{\textcolor{darkerblue}{#1}}} |
| \newcommand{\mysection}[1]{\section{#1}} |
| \newcommand{\mysubsection}[1]{\subsection{#1}} |
| |
| \bodytext{bgcolor=white text=black link=#004080 vlink=#006020} |
| |
| \newcommand{\first}{\textsc{first}} |
| \newcommand{\follow}{\textsc{follow}} |
| |
| \begin{document} |
| |
| \begin{center} |
| \hfill \begin{tabular}{c} |
| {\Large The \emph{Yapps} Parser Generator System}\\ |
| \verb|http://theory.stanford.edu/~amitp/Yapps/|\\ |
| Version 2\\ |
| \\ |
| Amit J. Patel\\ |
| \htmladdnormallink{http://www-cs-students.stanford.edu/~amitp/} |
| {http://www-cs-students.stanford.edu/~amitp/} |
| |
| \end{tabular} \hfill \rule{0in}{0in} |
| \end{center} |
| |
| \mysection{Introduction} |
| |
| \emph{Yapps} (\underline{Y}et \underline{A}nother \underline{P}ython |
| \underline{P}arser \underline{S}ystem) is an easy to use parser |
| generator that is written in Python and generates Python code. There |
| are several parser generator systems already available for Python, |
| including \texttt{PyLR, kjParsing, PyBison,} and \texttt{mcf.pars,} |
| but I had different goals for my parser. Yapps is simple, is easy to |
| use, and produces human-readable parsers. It is not the fastest or |
| most powerful parser. Yapps is designed to be used when regular |
| expressions are not enough and other parser systems are too much: |
| situations where you may write your own recursive descent parser. |
| |
| Some unusual features of Yapps that may be of interest are: |
| |
| \begin{enumerate} |
| |
| \item Yapps produces recursive descent parsers that are readable by |
| humans, as opposed to table-driven parsers that are difficult to |
| read. A Yapps parser for a simple calculator looks similar to the |
| one that Mark Lutz wrote by hand for \emph{Programming Python.} |
| |
| \item Yapps also allows for rules that accept parameters and pass |
| arguments to be used while parsing subexpressions. Grammars that |
| allow for arguments to be passed to subrules and for values to be |
| passed back are often called \emph{attribute grammars.} In many |
| cases parameterized rules can be used to perform actions at ``parse |
| time'' that are usually delayed until later. For example, |
| information about variable declarations can be passed into the |
| rules that parse a procedure body, so that undefined variables can |
| be detected at parse time. The types of defined variables can be |
| used in parsing as well---for example, if the type of {\tt X} is |
| known, we can determine whether {\tt X(1)} is an array reference or |
| a function call. |
| |
| \item Yapps grammars are fairly easy to write, although there are |
| some inconveniences having to do with ELL(1) parsing that have to be |
| worked around. For example, rules have to be left factored and |
| rules may not be left recursive. However, neither limitation seems |
| to be a problem in practice. |
| |
| Yapps grammars look similar to the notation used in the Python |
| reference manual, with operators like \verb:*:, \verb:+:, \verb:|:, |
| \verb:[]:, and \verb:(): for patterns, names ({\tt tim}) for rules, |
| regular expressions (\verb:"[a-z]+":) for tokens, and \verb:#: for |
| comments. |
| |
| \item The Yapps parser generator is written as a single Python module |
| with no C extensions. Yapps produces parsers that are written |
| entirely in Python, and require only the Yapps run-time module (5k) |
| for support. |
| |
| \item Yapps's scanner is context-sensitive, picking tokens based on |
| the types of the tokens accepted by the parser. This can be |
| helpful when implementing certain kinds of parsers, such as for a |
| preprocessor. |
| |
| \end{enumerate} |
| |
| There are several disadvantages of using Yapps over another parser system: |
| |
| \begin{enumerate} |
| |
| \item Yapps parsers are \texttt{ELL(1)} (Extended LL(1)), which is |
| less powerful than \texttt{LALR} (used by \texttt{PyLR}) or |
| \texttt{SLR} (used by \texttt{kjParsing}), so Yapps would not be a |
| good choice for parsing complex languages. For example, allowing |
| both \texttt{x := 5;} and \texttt{x;} as statements is difficult |
| because we must distinguish based on only one token of lookahead. |
| Seeing only \texttt{x}, we cannot decide whether we have an |
| assignment statement or an expression statement. (Note however |
| that this kind of grammar can be matched with backtracking; see |
| section \ref{sec:future}.) |
| |
| \item The scanner that Yapps provides can only read from strings, not |
| files, so an entire file has to be read in before scanning can |
| begin. It is possible to build a custom scanner, though, so in |
| cases where stream input is needed (from the console, a network, or |
| a large file are examples), the Yapps parser can be given a custom |
| scanner that reads from a stream instead of a string. |
| |
| \item Yapps is not designed with efficiency in mind. |
| |
| \end{enumerate} |
| |
| Yapps provides an easy to use parser generator that produces parsers |
| similar to what you might write by hand. It is not meant to be a |
| solution for all parsing problems, but instead an aid for those times |
| you would write a parser by hand rather than using one of the more |
| powerful parsing packages available. |
| |
| Yapps 2.0 is easier to use than Yapps 1.0. New features include a |
| less restrictive input syntax, which allows mixing of sequences, |
| choices, terminals, and nonterminals; optional matching; the ability |
| to insert single-line statements into the generated parser; and |
| looping constructs \verb|*| and \verb|+| similar to the repetitive |
| matching constructs in regular expressions. Unfortunately, the |
| addition of these constructs has made Yapps 2.0 incompatible with |
| Yapps 1.0, so grammars will have to be rewritten. See section |
| \ref{sec:Upgrading} for tips on changing Yapps 1.0 grammars for use |
| with Yapps 2.0. |
| |
| \mysection{Examples} |
| |
| In this section are several examples that show the use of Yapps. |
| First, an introduction shows how to construct grammars and write them |
| in Yapps form. This example can be skipped by someone familiar with |
| grammars and parsing. Next is a Lisp expression grammar that produces |
| a parse tree as output. This example demonstrates the use of tokens |
| and rules, as well as returning values from rules. The third example |
| is a expression evaluation grammar that evaluates during parsing |
| (instead of producing a parse tree). |
| |
| \mysubsection{Introduction to Grammars} |
| |
| A \emph{grammar} for a natural language specifies how words can be put |
| together to form large structures, such as phrases and sentences. A |
| grammar for a computer language is similar in that it specifies how |
| small components (called \emph{tokens}) can be put together to form |
| larger structures. In this section we will write a grammar for a tiny |
| subset of English. |
| |
| Simple English sentences can be described as being a noun phrase |
| followed by a verb followed by a noun phrase. For example, in the |
| sentence, ``Jack sank the blue ship,'' the word ``Jack'' is the first |
| noun phrase, ``sank'' is the verb, and ``the blue ship'' is the second |
| noun phrase. In addition we should say what a noun phrase is; for |
| this example we shall say that a noun phrase is an optional article |
| (a, an, the) followed by any number of adjectives followed by a noun. |
| The tokens in our language are the articles, nouns, verbs, and |
| adjectives. The \emph{rules} in our language will tell us how to |
| combine the tokens together to form lists of adjectives, noun phrases, |
| and sentences: |
| |
| \begin{itemize} |
| \item \texttt{sentence: noun\_phrase verb noun\_phrase} |
| \item \texttt{noun\_phrase: [article] adjective* noun} |
| \end{itemize} |
| |
| Notice that some things that we said easily in English, such as |
| ``optional article,'' are expressed using special syntax, such as |
| brackets. When we said, ``any number of adjectives,'' we wrote |
| \texttt{adjective*}, where the \texttt{*} means ``zero or more of the |
| preceding pattern''. |
| |
| The grammar given above is close to a Yapps grammar. We also have to |
| specify what the tokens are, and what to do when a pattern is matched. |
| For this example, we will do nothing when patterns are matched; the |
| next example will explain how to perform match actions. |
| |
| \begin{verbatim} |
| parser TinyEnglish: |
| ignore: "\\W+" |
| token noun: "(Jack|spam|ship)" |
| token verb: "(sank|threw)" |
| token article: "(a|an|the)" |
| token adjective: "(blue|red|green)" |
| |
| rule sentence: noun_phrase verb noun_phrase |
| rule noun_phrase: [article] adjective* noun |
| \end{verbatim} |
| |
| The tokens are specified as Python \emph{regular expressions}. Since |
| Yapps produces Python code, you can write any regular expression that |
| would be accepted by Python. (\emph{Note:} These are Python 1.5 |
| regular expressions from the \texttt{re} module, not Python 1.4 |
| regular expressions from the \texttt{regex} module.) In addition to |
| tokens that you want to see (which are given names), you can also |
| specify tokens to ignore, marked by the \texttt{ignore} keyword. In |
| this parser we want to ignore whitespace. |
| |
| The TinyEnglish grammar shows how you define tokens and rules, but it |
| does not specify what should happen once we've matched the rules. In |
| the next example, we will take a grammar and produce a \emph{parse |
| tree} from it. |
| |
| \mysubsection{Lisp Expressions} |
| |
| Lisp syntax, although hated by many, has a redeeming quality: it is |
| simple to parse. In this section we will construct a Yapps grammar to |
| parse Lisp expressions and produce a parse tree as output. |
| |
| \subsubsection*{Defining the Grammar} |
| |
| The syntax of Lisp is simple. It has expressions, which are |
| identifiers, strings, numbers, and lists. A list is a left |
| parenthesis followed by some number of expressions (separated by |
| spaces) followed by a right parenthesis. For example, \verb|5|, |
| \verb|"ni"|, and \verb|(print "1+2 = " (+ 1 2))| are Lisp expressions. |
| Written as a grammar, |
| |
| \begin{verbatim} |
| expr: ID | STR | NUM | list |
| list: ( expr* ) |
| \end{verbatim} |
| |
| In addition to having a grammar, we need to specify what to do every |
| time something is matched. For the tokens, which are strings, we just |
| want to get the ``value'' of the token, attach its type (identifier, |
| string, or number) in some way, and return it. For the lists, we want |
| to construct and return a Python list. |
| |
| Once some pattern is matched, we enclose a return statement enclosed |
| in \verb|{{...}}|. The braces allow us to insert any one-line |
| statement into the parser. Within this statement, we can refer to the |
| values returned by matching each part of the rule. After matching a |
| token such as \texttt{ID}, ``ID'' will be bound to the text of the |
| matched token. Let's take a look at the rule: |
| |
| \begin{verbatim} |
| rule expr: ID {{ return ('id', ID) }} |
| ... |
| \end{verbatim} |
| |
| In a rule, tokens return the text that was matched. For identifiers, |
| we just return the identifier, along with a ``tag'' telling us that |
| this is an identifier and not a string or some other value. Sometimes |
| we may need to convert this text to a different form. For example, if |
| a string is matched, we want to remove quotes and handle special forms |
| like \verb|\n|. If a number is matched, we want to convert it into a |
| number. Let's look at the return values for the other tokens: |
| |
| \begin{verbatim} |
| ... |
| | STR {{ return ('str', eval(STR)) }} |
| | NUM {{ return ('num', atoi(NUM)) }} |
| ... |
| \end{verbatim} |
| |
| If we get a string, we want to remove the quotes and process any |
| special backslash codes, so we run \texttt{eval} on the quoted string. |
| If we get a number, we convert it to an integer with \texttt{atoi} and |
| then return the number along with its type tag. |
| |
| For matching a list, we need to do something slightly more |
| complicated. If we match a Lisp list of expressions, we want to |
| create a Python list with those values. |
| |
| \begin{verbatim} |
| rule list: "\\(" # Match the opening parenthesis |
| {{ result = [] }} # Create a Python list |
| ( |
| expr # When we match an expression, |
| {{ result.append(expr) }} # add it to the list |
| )* # * means repeat this if needed |
| "\\)" # Match the closing parenthesis |
| {{ return result }} # Return the Python list |
| \end{verbatim} |
| |
| In this rule we first match the opening parenthesis, then go into a |
| loop. In this loop we match expressions and add them to the list. |
| When there are no more expressions to match, we match the closing |
| parenthesis and return the resulting. Note that \verb:#: is used for |
| comments, just as in Python. |
| |
| The complete grammar is specified as follows: |
| \begin{verbatim} |
| parser Lisp: |
| ignore: '\\s+' |
| token NUM: '[0-9]+' |
| token ID: '[-+*/!@%^&=.a-zA-Z0-9_]+' |
| token STR: '"([^\\"]+|\\\\.)*"' |
| |
| rule expr: ID {{ return ('id', ID) }} |
| | STR {{ return ('str', eval(STR)) }} |
| | NUM {{ return ('num', atoi(NUM)) }} |
| | list {{ return list }} |
| rule list: "\\(" {{ result = [] }} |
| ( expr {{ result.append(expr) }} |
| )* |
| "\\)" {{ return result }} |
| \end{verbatim} |
| |
| One thing you may have noticed is that \verb|"\\("| and \verb|"\\)"| |
| appear in the \texttt{list} rule. These are \emph{inline tokens}: |
| they appear in the rules without being given a name with the |
| \texttt{token} keyword. Inline tokens are more convenient to use, but |
| since they do not have a name, the text that is matched cannot be used |
| in the return value. They are best used for short simple patterns |
| (usually punctuation or keywords). |
| |
| Another thing to notice is that the number and identifier tokens |
| overlap. For example, ``487'' matches both NUM and ID. In Yapps, the |
| scanner only tries to match tokens that are acceptable to the parser. |
| This rule doesn't help here, since both NUM and ID can appear in the |
| same place in the grammar. There are two rules used to pick tokens if |
| more than one matches. One is that the \emph{longest} match is |
| preferred. For example, ``487x'' will match as an ID (487x) rather |
| than as a NUM (487) followed by an ID (x). The second rule is that if |
| the two matches are the same length, the \emph{first} one listed in |
| the grammar is preferred. For example, ``487'' will match as an NUM |
| rather than an ID because NUM is listed first in the grammar. Inline |
| tokens have preference over any tokens you have listed. |
| |
| Now that our grammar is defined, we can run Yapps to produce a parser, |
| and then run the parser to produce a parse tree. |
| |
| \subsubsection*{Running Yapps} |
| |
| In the Yapps module is a function \texttt{generate} that takes an |
| input filename and writes a parser to another file. We can use this |
| function to generate the Lisp parser, which is assumed to be in |
| \texttt{lisp.g}. |
| |
| \begin{verbatim} |
| % python |
| Python 1.5.1 (#1, Sep 3 1998, 22:51:17) [GCC 2.7.2.3] on linux-i386 |
| Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam |
| >>> import yapps |
| >>> yapps.generate('lisp.g') |
| \end{verbatim} |
| |
| At this point, Yapps has written a file \texttt{lisp.py} that contains |
| the parser. In that file are two classes (one scanner and one parser) |
| and a function (called \texttt{parse}) that puts things together for |
| you. |
| |
| Alternatively, we can run Yapps from the command line to generate the |
| parser file: |
| |
| \begin{verbatim} |
| % python yapps.py lisp.g |
| \end{verbatim} |
| |
| After running Yapps either from within Python or from the command |
| line, we can use the Lisp parser by calling the \texttt{parse} |
| function. The first parameter should be the rule we want to match, |
| and the second parameter should be the string to parse. |
| |
| \begin{verbatim} |
| >>> import lisp |
| >>> lisp.parse('expr', '(+ 3 4)') |
| [('id', '+'), ('num', 3), ('num', 4)] |
| >>> lisp.parse('expr', '(print "3 = " (+ 1 2))') |
| [('id', 'print'), ('str', '3 = '), [('id', '+'), ('num', 1), ('num', 2)]] |
| \end{verbatim} |
| |
| The \texttt{parse} function is not the only way to use the parser; |
| section \ref{sec:Parser-Objects} describes how to access parser objects |
| directly. |
| |
| We've now gone through the steps in creating a grammar, writing a |
| grammar file for Yapps, producing a parser, and using the parser. In |
| the next example we'll see how rules can take parameters and also how |
| to do computations instead of just returning a parse tree. |
| |
| \mysubsection{Calculator} |
| |
| A common example parser given in many textbooks is that for simple |
| expressions, with numbers, addition, subtraction, multiplication, |
| division, and parenthesization of subexpressions. We'll write this |
| example in Yapps, evaluating the expression as we parse. |
| |
| Unlike \texttt{yacc}, Yapps does not have any way to specify |
| precedence rules, so we have to do it ourselves. We say that an |
| expression is the sum of terms, and that a term is the product of |
| factors, and that a factor is a number or a parenthesized expression: |
| |
| \begin{verbatim} |
| expr: factor ( ("+"|"-") factor )* |
| factor: term ( ("*"|"/") term )* |
| term: NUM | "(" expr ")" |
| \end{verbatim} |
| |
| In order to evaluate the expression as we go, we should keep along an |
| accumulator while evaluating the lists of terms or factors. Just as |
| we kept a ``result'' variable to build a parse tree for Lisp |
| expressions, we will use a variable to evaluate numerical |
| expressions. The full grammar is given below: |
| |
| \begin{verbatim} |
| parser Calculator: |
| token END: "$" # $ means end of string |
| token NUM: "[0-9]+" |
| |
| rule goal: expr END {{ return expr }} |
| |
| # An expression is the sum and difference of factors |
| rule expr: factor {{ v = factor }} |
| ( "[+]" factor {{ v = v+factor }} |
| | "-" factor {{ v = v-factor }} |
| )* {{ return v }} |
| |
| # A factor is the product and division of terms |
| rule factor: term {{ v = term }} |
| ( "[*]" term {{ v = v*term }} |
| | "/" term {{ v = v/term }} |
| )* {{ return v }} |
| |
| # A term is either a number or an expression surrounded by parentheses |
| rule term: NUM {{ return atoi(NUM) }} |
| | "\\(" expr "\\)" {{ return expr }} |
| \end{verbatim} |
| |
| The top-level rule is \emph{goal}, which says that we are looking for |
| an expression followed by the end of the string. The \texttt{END} |
| token is needed because without it, it isn't clear when to stop |
| parsing. For example, the string ``1+3'' could be parsed either as |
| the expression ``1'' followed by the string ``+3'' or it could be |
| parsed as the expression ``1+3''. By requiring expressions to end |
| with \texttt{END}, the parser is forced to take ``1+3''. |
| |
| In the two rules with repetition, the accumulator is named \texttt{v}. |
| After reading in one expression, we initialize the accumulator. Each |
| time through the loop, we modify the accumulator by adding, |
| subtracting, multiplying by, or dividing the previous accumulator by |
| the expression that has been parsed. At the end of the rule, we |
| return the accumulator. |
| |
| The calculator example shows how to process lists of elements using |
| loops, as well as how to handle precedence of operators. |
| |
| \emph{Note:} It's often important to put the \texttt{END} token in, so |
| put it in unless you are sure that your grammar has some other |
| non-ambiguous token marking the end of the program. |
| |
| \mysubsection{Calculator with Memory} |
| |
| In the previous example we learned how to write a calculator that |
| evaluates simple numerical expressions. In this section we will |
| extend the example to support both local and global variables. |
| |
| To support global variables, we will add assignment statements to the |
| ``goal'' rule. |
| |
| \begin{verbatim} |
| rule goal: expr END {{ return expr }} |
| | 'set' ID expr END {{ global_vars[ID] = expr }} |
| {{ return expr }} |
| \end{verbatim} |
| |
| To use these variables, we need a new kind of terminal: |
| |
| \begin{verbatim} |
| rule term: ... | ID {{ return global_vars[ID] }} |
| \end{verbatim} |
| |
| So far, these changes are straightforward. We simply have a global |
| dictionary \texttt{global\_vars} that stores the variables and values, |
| we modify it when there is an assignment statement, and we look up |
| variables in it when we see a variable name. |
| |
| To support local variables, we will add variable declarations to the |
| set of allowed expressions. |
| |
| \begin{verbatim} |
| rule term: ... | 'let' VAR '=' expr 'in' expr ... |
| \end{verbatim} |
| |
| This is where it becomes tricky. Local variables should be stored in |
| a local dictionary, not in the global one. One trick would be to save |
| a copy of the global dictionary, modify it, and then restore it |
| later. In this example we will instead use \emph{attributes} to |
| create local information and pass it to subrules. |
| |
| A rule can optionally take parameters. When we invoke the rule, we |
| must pass in arguments. For local variables, let's use a single |
| parameter, \texttt{local\_vars}: |
| |
| \begin{verbatim} |
| rule expr<<local_vars>>: ... |
| rule factor<<local_vars>>: ... |
| rule term<<local_vars>>: ... |
| \end{verbatim} |
| |
| Each time we want to match \texttt{expr}, \texttt{factor}, or |
| \texttt{term}, we will pass the local variables in the current rule to |
| the subrule. One interesting case is when we pass as an argument |
| something \emph{other} than \texttt{local\_vars}: |
| |
| \begin{verbatim} |
| rule term<<local_vars>>: ... |
| | 'let' VAR '=' expr<<local_vars>> |
| {{ local_vars = [(VAR, expr)] + local_vars }} |
| 'in' expr<<local_vars>> |
| {{ return expr }} |
| \end{verbatim} |
| |
| Note that the assignment to the local variables list does not modify |
| the original list. This is important to keep local variables from |
| being seen outside the ``let''. |
| |
| The other interesting case is when we find a variable: |
| |
| \begin{verbatim} |
| global_vars = {} |
| |
| def lookup(map, name): |
| for x,v in map: if x==name: return v |
| return global_vars[name] |
| %% |
| ... |
| rule term<<local_vars>: ... |
| | VAR {{ return lookup(local_vars, VAR) }} |
| \end{verbatim} |
| |
| The lookup function will search through the local variable list, and |
| if it cannot find the name there, it will look it up in the global |
| variable dictionary. |
| |
| A complete grammar for this example, including a read-eval-print loop |
| for interacting with the calculator, can be found in the examples |
| subdirectory included with Yapps. |
| |
| In this section we saw how to insert code before the parser. We also |
| saw how to use attributes to transmit local information from one rule |
| to its subrules. |
| |
| \mysection{Grammars} |
| |
| Each Yapps grammar has a name, a list of tokens, and a set of |
| production rules. A grammar named \texttt{X} will be used to produce |
| a parser named \texttt{X} and a scanner anmed \texttt{XScanner}. As |
| in Python, names are case sensitive, start with a letter, and contain |
| letters, numbers, and underscores (\_). |
| |
| There are three kinds of tokens in Yapps: named, inline, and ignored. |
| As their name implies, named tokens are given a name, using the token |
| construct: \texttt{token \emph{name} : \emph{regexp}}. In a rule, the |
| token can be matched by using the name. Inline tokens are regular |
| expressions that are used in rules without being declared. Ignored |
| tokens are declared using the ignore construct: \texttt{ignore: |
| \emph{regexp}}. These tokens are ignored by the scanner, and are |
| not seen by the parser. Often whitespace is an ignored token. The |
| regular expressions used to define tokens should use the syntax |
| defined in the \texttt{re} module, so some symbols may have to be |
| backslashed. |
| |
| Production rules in Yapps have a name and a pattern to match. If the |
| rule is parameterized, the name should be followed by a list of |
| parameter names in \verb|<<...>>|. A pattern can be a simple pattern |
| or a compound pattern. Simple patterns are the name of a named token, |
| a regular expression in quotes (inline token), the name of a |
| production rule (followed by arguments in \verb|<<...>>|, if the rule |
| has parameters), and single line Python statements (\verb|{{...}}|). |
| Compound patterns are sequences (\verb|A B C ...|), choices ( |
| \verb:A | B | C | ...:), options (\verb|[...]|), zero-or-more repetitions |
| (\verb|...*|), and one-or-more repetitions (\verb|...+|). Like |
| regular expressions, repetition operators have a higher precedence |
| than sequences, and sequences have a higher precedence than choices. |
| |
| Whenever \verb|{{...}}| is used, a legal one-line Python statement |
| should be put inside the braces. The token \verb|}}| should not |
| appear within the \verb|{{...}}| section, even within a string, since |
| Yapps does not attempt to parse the Python statement. A workaround |
| for strings is to put two strings together (\verb|"}" "}"|), or to use |
| backslashes (\verb|"}\}"|). At the end of a rule you should use a |
| \verb|{{ return X }}| statement to return a value. However, you |
| should \emph{not} use any control statements (\texttt{return}, |
| \texttt{continue}, \texttt{break}) in the middle of a rule. Yapps |
| needs to make assumptions about the control flow to generate a parser, |
| and any changes to the control flow will confuse Yapps. |
| |
| The \verb|<<...>>| form can occur in two places: to define parameters |
| to a rule and to give arguments when matching a rule. Parameters use |
| the syntax used for Python functions, so they can include default |
| arguments and the special forms (\verb|*args| and \verb|**kwargs|). |
| Arguments use the syntax for Python function call arguments, so they |
| can include normal arguments and keyword arguments. The token |
| \verb|>>| should not appear within the \verb|<<...>>| section. |
| |
| In both the statements and rule arguments, you can use names defined |
| by the parser to refer to matched patterns. You can refer to the text |
| matched by a named token by using the token name. You can use the |
| value returned by a production rule by using the name of that rule. |
| If a name \texttt{X} is matched more than once (such as in loops), you |
| will have to save the earlier value(s) in a temporary variable, and |
| then use that temporary variable in the return value. The next |
| section has an example of a name that occurs more than once. |
| |
| \mysubsection{Left Factoring} |
| \label{sec:Left-Factoring} |
| |
| Yapps produces ELL(1) parsers, which determine which clause to match |
| based on the first token available. Sometimes the leftmost tokens of |
| several clauses may be the same. The classic example is the |
| \emph{if/then/else} construct in Pascal: |
| |
| \begin{verbatim} |
| rule stmt: "if" expr "then" stmt {{ then_part = stmt }} |
| "else" stmt {{ return ('If',expr,then_part,stmt) }} |
| | "if" expr "then" stmt {{ return ('If',expr,stmt,[]) }} |
| \end{verbatim} |
| |
| (Note that we have to save the first \texttt{stmt} into a variable |
| because there is another \texttt{stmt} that will be matched.) The |
| left portions of the two clauses are the same, which presents a |
| problem for the parser. The solution is \emph{left-factoring}: the |
| common parts are put together, and \emph{then} a choice is made about |
| the remaining part: |
| |
| \begin{verbatim} |
| rule stmt: "if" expr |
| "then" stmt {{ then_part = stmt }} |
| {{ else_part = [] }} |
| [ "else" stmt {{ else_part = stmt }} ] |
| {{ return ('If', expr, then_part, else_part) }} |
| \end{verbatim} |
| |
| Unfortunately, the classic \emph{if/then/else} situation is |
| \emph{still} ambiguous when you left-factor. Yapps can deal with this |
| situation, but will report a warning; see section |
| \ref{sec:Ambiguous-Grammars} for details. |
| |
| In general, replace rules of the form: |
| |
| \begin{verbatim} |
| rule A: a b1 {{ return E1 }} |
| | a b2 {{ return E2 }} |
| | c3 {{ return E3 }} |
| | c4 {{ return E4 }} |
| \end{verbatim} |
| |
| with rules of the form: |
| |
| \begin{verbatim} |
| rule A: a ( b1 {{ return E1 }} |
| | b2 {{ return E2 }} |
| ) |
| | c3 {{ return E3 }} |
| | c4 {{ return E4 }} |
| \end{verbatim} |
| |
| \mysubsection{Left Recursion} |
| |
| A common construct in grammars is for matching a list of patterns, |
| sometimes separated with delimiters such as commas or semicolons. In |
| LR-based parser systems, we can parse a list with something like this: |
| |
| \begin{verbatim} |
| rule sum: NUM {{ return NUM }} |
| | sum "+" NUM {{ return (sum, NUM) }} |
| \end{verbatim} |
| |
| Parsing \texttt{1+2+3+4} would produce the output |
| \texttt{(((1,2),3),4)}, which is what we want from a left-associative |
| addition operator. Unfortunately, this grammar is \emph{left |
| recursive,} because the \texttt{sum} rule contains a clause that |
| begins with \texttt{sum}. (The recursion occurs at the left side of |
| the clause.) |
| |
| We must restructure this grammar to be \emph{right recursive} instead: |
| |
| \begin{verbatim} |
| rule sum: NUM {{ return NUM }} |
| | NUM "+" sum {{ return (NUM, sum) }} |
| \end{verbatim} |
| |
| Unfortunately, using this grammar, \texttt{1+2+3+4} would be parsed as |
| \texttt{(1,(2,(3,4)))}, which no longer follows left associativity. |
| The rule also needs to be left-factored. Instead, we write the |
| pattern as a loop instead: |
| |
| \begin{verbatim} |
| rule sum: NUM {{ v = NUM }} |
| ( "[+]" NUM {{ v = (v,NUM) }} )* |
| {{ return v }} |
| \end{verbatim} |
| |
| In general, replace rules of the form: |
| |
| \begin{verbatim} |
| rule A: A a1 -> << E1 >> |
| | A a2 -> << E2 >> |
| | b3 -> << E3 >> |
| | b4 -> << E4 >> |
| \end{verbatim} |
| |
| with rules of the form: |
| |
| \begin{verbatim} |
| rule A: ( b3 {{ A = E3 }} |
| | b4 {{ A = E4 }} ) |
| ( a1 {{ A = E1 }} |
| | a2 {{ A = E2 }} )* |
| {{ return A }} |
| \end{verbatim} |
| |
| We have taken a rule that proved problematic for with recursion and |
| turned it into a rule that works well with looping constructs. |
| |
| \mysubsection{Ambiguous Grammars} |
| \label{sec:Ambiguous-Grammars} |
| |
| In section \ref{sec:Left-Factoring} we saw the classic if/then/else |
| ambiguity, which occurs because the ``else \ldots'' portion of an ``if |
| \ldots then \ldots else \ldots'' construct is optional. Programs with |
| nested if/then/else constructs can be ambiguous when one of the else |
| clauses is missing: |
| \begin{verbatim} |
| if 1 then if 1 then |
| if 5 then if 5 then |
| x := 1; x := 1; |
| else else |
| y := 9; y := 9; |
| \end{verbatim} |
| |
| The indentation shows that the program can be parsed in two different |
| ways. (Of course, if we all would adopt Python's indentation-based |
| structuring, this would never happen!) Usually we want the parsing on |
| the left: the ``else'' should be associated with the closest ``if'' |
| statement. In section \ref{sec:Left-Factoring} we ``solved'' the |
| problem by using the following grammar: |
| |
| \begin{verbatim} |
| rule stmt: "if" expr |
| "then" stmt {{ then_part = stmt }} |
| {{ else_part = [] }} |
| [ "else" stmt {{ else_part = stmt }} ] |
| {{ return ('If', expr, then_part, else_part) }} |
| \end{verbatim} |
| |
| Here, we have an optional match of ``else'' followed by a statement. |
| The ambiguity is that if an ``else'' is present, it is not clear |
| whether you want it parsed immediately or if you want it to be parsed |
| by the outer ``if''. |
| |
| Yapps will deal with the situation by matching when the else pattern |
| when it can. The parser will work in this case because it prefers the |
| \emph{first} matching clause, which tells Yapps to parse the ``else''. |
| That is exactly what we want! |
| |
| For ambiguity cases with choices, Yapps will choose the \emph{first} |
| matching choice. However, remember that Yapps only looks at the first |
| token to determine its decision, so {\tt (a b | a c)} will result in |
| Yapps choosing {\tt a b} even when the input is {\tt a c}. It only |
| looks at the first token, {\tt a}, to make its decision. |
| |
| \mysection{Customization} |
| |
| Both the parsers and the scanners can be customized. The parser is |
| usually extended by subclassing, and the scanner can either be |
| subclassed or completely replaced. |
| |
| \mysubsection{Customizing Parsers} |
| |
| If additional fields and methods are needed in order for a parser to |
| work, Python subclassing can be used. (This is unlike parser classes |
| written in static languages, in which these fields and methods must be |
| defined in the generated parser class.) We simply subclass the |
| generated parser, and add any fields or methods required. Expressions |
| in the grammar can call methods of the subclass to perform any actions |
| that cannot be expressed as a simple expression. For example, |
| consider this simple grammar: |
| |
| \begin{verbatim} |
| parser X: |
| rule goal: "something" {{ self.printmsg() }} |
| \end{verbatim} |
| |
| The \texttt{printmsg} function need not be implemented in the parser |
| class \texttt{X}; it can be implemented in a subclass: |
| |
| \begin{verbatim} |
| import Xparser |
| |
| class MyX(Xparser.X): |
| def printmsg(self): |
| print "Hello!" |
| \end{verbatim} |
| |
| \mysubsection{Customizing Scanners} |
| |
| The generated parser class is not dependent on the generated scanner |
| class. A scanner object is passed to the parser object's constructor |
| in the \texttt{parse} function. To use a different scanner, write |
| your own function to construct parser objects, with an instance of a |
| different scanner. Scanner objects must have a \texttt{token} method |
| that accepts an integer \texttt{N} as well as a list of allowed token |
| types, and returns the Nth token, as a tuple. The default scanner |
| raises \texttt{NoMoreTokens} if no tokens are available, and |
| \texttt{SyntaxError} if no token could be matched. However, the |
| parser does not rely on these exceptions; only the \texttt{parse} |
| convenience function (which calls \texttt{wrap\_error\_reporter}) and |
| the \texttt{print\_error} error display function use those exceptions. |
| |
| The tuples representing tokens have four elements. The first two are |
| the beginning and ending indices of the matched text in the input |
| string. The third element is the type tag, matching either the name |
| of a named token or the quoted regexp of an inline or ignored token. |
| The fourth element of the token tuple is the matched text. If the |
| input string is \texttt{s}, and the token tuple is |
| \texttt{(b,e,type,val)}, then \texttt{val} should be equal to |
| \texttt{s[b:e]}. |
| |
| The generated parsers do not the beginning or ending index. They use |
| only the token type and value. However, the default error reporter |
| uses the beginning and ending index to show the user where the error |
| is. |
| |
| \mysection{Parser Mechanics} |
| |
| The base parser class (Parser) defines two methods, \texttt{\_scan} |
| and \texttt{\_peek}, and two fields, \texttt{\_pos} and |
| \texttt{\_scanner}. The generated parser inherits from the base |
| parser, and contains one method for each rule in the grammar. To |
| avoid name clashes, do not use names that begin with an underscore |
| (\texttt{\_}). |
| |
| \mysubsection{Parser Objects} |
| \label{sec:Parser-Objects} |
| |
| Yapps produces as output two exception classes, a scanner class, a |
| parser class, and a function \texttt{parse} that puts everything |
| together. The \texttt{parse} function does not have to be used; |
| instead, one can create a parser and scanner object and use them |
| together for parsing. |
| |
| \begin{verbatim} |
| def parse(rule, text): |
| P = X(XScanner(text)) |
| return wrap_error_reporter(P, rule) |
| \end{verbatim} |
| |
| The \texttt{parse} function takes a name of a rule and an input string |
| as input. It creates a scanner and parser object, then calls |
| \texttt{wrap\_error\_reporter} to execute the method in the parser |
| object named \texttt{rule}. The wrapper function will call the |
| appropriate parser rule and report any parsing errors to standard |
| output. |
| |
| There are several situations in which the \texttt{parse} function |
| would not be useful. If a different parser or scanner is being used, |
| or exceptions are to be handled differently, a new \texttt{parse} |
| function would be required. The supplied \texttt{parse} function can |
| be used as a template for writing a function for your own needs. An |
| example of a custom parse function is the \texttt{generate} function |
| in \texttt{Yapps.py}. |
| |
| \mysubsection{Context Sensitive Scanner} |
| |
| Unlike most scanners, the scanner produced by Yapps can take into |
| account the context in which tokens are needed, and try to match only |
| good tokens. For example, in the grammar: |
| |
| \begin{verbatim} |
| parser IniFile: |
| token ID: "[a-zA-Z_0-9]+" |
| token VAL: ".*" |
| |
| rule pair: ID "[ \t]*=[ \t]*" VAL "\n" |
| \end{verbatim} |
| |
| we would like to scan lines of text and pick out a name/value pair. |
| In a conventional scanner, the input string \texttt{shell=progman.exe} |
| would be turned into a single token of type \texttt{VAL}. The Yapps |
| scanner, however, knows that at the beginning of the line, an |
| \texttt{ID} is expected, so it will return \texttt{"shell"} as a token |
| of type \texttt{ID}. Later, it will return \texttt{"progman.exe"} as |
| a token of type \texttt{VAL}. |
| |
| Context sensitivity decreases the separation between scanner and |
| parser, but it is useful in parsers like \texttt{IniFile}, where the |
| tokens themselves are not unambiguous, but \emph{are} unambiguous |
| given a particular stage in the parsing process. |
| |
| Unfortunately, context sensitivity can make it more difficult to |
| detect errors in the input. For example, in parsing a Pascal-like |
| language with ``begin'' and ``end'' as keywords, a context sensitive |
| scanner would only match ``end'' as the END token if the parser is in |
| a place that will accept the END token. If not, then the scanner |
| would match ``end'' as an identifier. To disable the context |
| sensitive scanner in Yapps, add the |
| \texttt{context-insensitive-scanner} option to the grammar: |
| |
| \begin{verbatim} |
| Parser X: |
| option: "context-insensitive-scanner" |
| \end{verbatim} |
| |
| Context-insensitive scanning makes the parser look cleaner as well. |
| |
| \mysubsection{Internal Variables} |
| |
| There are two internal fields that may be of use. The parser object |
| has two fields, \texttt{\_pos}, which is the index of the current |
| token being matched, and \texttt{\_scanner}, which is the scanner |
| object. The token itself can be retrieved by accessing the scanner |
| object and calling the \texttt{token} method with the token index. However, if you call \texttt{token} before the token has been requested by the parser, it may mess up a context-sensitive scanner.\footnote{When using a context-sensitive scanner, the parser tells the scanner what the valid token types are at each point. If you call \texttt{token} before the parser can tell the scanner the valid token types, the scanner will attempt to match without considering the context.} A |
| potentially useful combination of these fields is to extract the |
| portion of the input matched by the current rule. To do this, just save the scanner state (\texttt{\_scanner.pos}) before the text is matched and then again after the text is matched: |
| |
| \begin{verbatim} |
| rule R: |
| {{ start = self._scanner.pos }} |
| a b c |
| {{ end = self._scanner.pos }} |
| {{ print 'Text is', self._scanner.input[start:end] }} |
| \end{verbatim} |
| |
| \mysubsection{Pre- and Post-Parser Code} |
| |
| Sometimes the parser code needs to rely on helper variables, |
| functions, and classes. A Yapps grammar can optionally be surrounded |
| by double percent signs, to separate the grammar from Python code. |
| |
| \begin{verbatim} |
| ... Python code ... |
| %% |
| ... Yapps grammar ... |
| %% |
| ... Python code ... |
| \end{verbatim} |
| |
| The second \verb|%%| can be omitted if there is no Python code at the |
| end, and the first \verb|%%| can be omitted if there is no extra |
| Python code at all. (To have code only at the end, both separators |
| are required.) |
| |
| If the second \verb|%%| is omitted, Yapps will insert testing code |
| that allows you to use the generated parser to parse a file. |
| |
| The extended calculator example in the Yapps examples subdirectory |
| includes both pre-parser and post-parser code. |
| |
| \mysubsection{Representation of Grammars} |
| |
| For each kind of pattern there is a class derived from Pattern. Yapps |
| has classes for Terminal, NonTerminal, Sequence, Choice, Option, Plus, |
| Star, and Eval. Each of these classes has the following interface: |
| |
| \begin{itemize} |
| \item[setup(\emph{gen})] Set accepts-$\epsilon$, and call |
| \emph{gen.changed()} if it changed. This function can change the |
| flag from false to true but \emph{not} from true to false. |
| \item[update(\emph(gen))] Set \first and \follow, and call |
| \emph{gen.changed()} if either changed. This function can add to |
| the sets but \emph{not} remove from them. |
| \item[output(\emph{gen}, \emph{indent})] Generate code for matching |
| this rule, using \emph{indent} as the current indentation level. |
| Writes are performed using \emph{gen.write}. |
| \item[used(\emph{vars})] Given a list of variables \emph{vars}, |
| return two lists: one containing the variables that are used, and |
| one containing the variables that are assigned. This function is |
| used for optimizing the resulting code. |
| \end{itemize} |
| |
| Both \emph{setup} and \emph{update} monotonically increase the |
| variables they modify. Since the variables can only increase a finite |
| number of times, we can repeatedly call the function until the |
| variable stabilized. The \emph{used} function is not currently |
| implemented. |
| |
| With each pattern in the grammar Yapps associates three pieces of |
| information: the \first set, the \follow set, and the |
| accepts-$\epsilon$ flag. |
| |
| The \first set contains the tokens that can appear as we start |
| matching the pattern. The \follow set contains the tokens that can |
| appear immediately after we match the pattern. The accepts-$\epsilon$ |
| flag is true if the pattern can match no tokens. In this case, \first |
| will contain all the elements in \follow. The \follow set is not |
| needed when accepts-$\epsilon$ is false, and may not be accurate in |
| those cases. |
| |
| Yapps does not compute these sets precisely. Its approximation can |
| miss certain cases, such as this one: |
| |
| \begin{verbatim} |
| rule C: ( A* | B ) |
| rule B: C [A] |
| \end{verbatim} |
| |
| Yapps will calculate {\tt C}'s \follow set to include {\tt A}. |
| However, {\tt C} will always match all the {\tt A}'s, so {\tt A} will |
| never follow it. Yapps 2.0 does not properly handle this construct, |
| but if it seems important, I may add support for it in a future |
| version. |
| |
| Yapps also cannot handle constructs that depend on the calling |
| sequence. For example: |
| |
| \begin{verbatim} |
| rule R: U | 'b' |
| rule S: | 'c' |
| rule T: S 'b' |
| rule U: S 'a' |
| \end{verbatim} |
| |
| The \follow set for {\tt S} includes {\tt a} and {\tt b}. Since {\tt |
| S} can be empty, the \first set for {\tt S} should include {\tt a}, |
| {\tt b}, and {\tt c}. However, when parsing {\tt R}, if the lookahead |
| is {\tt b} we should \emph{not} parse {\tt U}. That's because in {\tt |
| U}, {\tt S} is followed by {\tt a} and not {\tt b}. Therefore in |
| {\tt R}, we should choose rule {\tt U} only if there is an {\tt a} or |
| {\tt c}, but not if there is a {\tt b}. Yapps and many other LL(1) |
| systems do not distinguish {\tt S b} and {\tt S a}, making {\tt |
| S}'s \follow set {\tt a, b}, and making {\tt R} always try to match |
| {\tt U}. In this case we can solve the problem by changing {\tt R} to |
| \verb:'b' | U: but it may not always be possible to solve all such |
| problems in this way. |
| |
| \appendix |
| |
| \mysection{Grammar for Parsers} |
| |
| This is the grammar for parsers, without any Python code mixed in. |
| The complete grammar can be found in \texttt{parsedesc.g} in the Yapps |
| distribution. |
| |
| \begin{verbatim} |
| parser ParserDescription: |
| ignore: "\\s+" |
| ignore: "#.*?\r?\n" |
| token END: "$" # $ means end of string |
| token ATTR: "<<.+?>>" |
| token STMT: "{{.+?}}" |
| token ID: '[a-zA-Z_][a-zA-Z_0-9]*' |
| token STR: '[rR]?\'([^\\n\'\\\\]|\\\\.)*\'|[rR]?"([^\\n"\\\\]|\\\\.)*"' |
| |
| rule Parser: "parser" ID ":" |
| Options |
| Tokens |
| Rules |
| END |
| |
| rule Options: ( "option" ":" STR )* |
| rule Tokens: ( "token" ID ":" STR | "ignore" ":" STR )* |
| rule Rules: ( "rule" ID OptParam ":" ClauseA )* |
| |
| rule ClauseA: ClauseB ( '[|]' ClauseB )* |
| rule ClauseB: ClauseC* |
| rule ClauseC: ClauseD [ '[+]' | '[*]' ] |
| rule ClauseD: STR | ID [ATTR] | STMT |
| | '\\(' ClauseA '\\) | '\\[' ClauseA '\\]' |
| \end{verbatim} |
| |
| \mysection{Upgrading} |
| |
| Yapps 2.0 is not backwards compatible with Yapps 1.0. In this section |
| are some tips for upgrading: |
| |
| \begin{enumerate} |
| \item Yapps 1.0 was distributed as a single file. Yapps 2.0 is |
| instead distributed as two Python files: a \emph{parser generator} |
| (26k) and a \emph{parser runtime} (5k). You need both files to |
| create parsers, but you need only the runtime (\texttt{yappsrt.py}) |
| to use the parsers. |
| |
| \item Yapps 1.0 supported Python 1.4 regular expressions from the |
| \texttt{regex} module. Yapps 2.0 uses Python 1.5 regular |
| expressions from the \texttt{re} module. \emph{The new syntax for |
| regular expressions is not compatible with the old syntax.} |
| Andrew Kuchling has a \htmladdnormallink{guide to converting |
| regular |
| expressions}{http://www.python.org/doc/howto/regex-to-re/} on his |
| web page. |
| |
| \item Yapps 1.0 wants a pattern and then a return value in \verb|->| |
| \verb|<<...>>|. Yapps 2.0 allows patterns and Python statements to |
| be mixed. To convert a rule like this: |
| |
| \begin{verbatim} |
| rule R: A B C -> << E1 >> |
| | X Y Z -> << E2 >> |
| \end{verbatim} |
| |
| to Yapps 2.0 form, replace the return value specifiers with return |
| statements: |
| |
| \begin{verbatim} |
| rule R: A B C {{ return E1 }} |
| | X Y Z {{ return E2 }} |
| \end{verbatim} |
| |
| \item Yapps 2.0 does not perform tail recursion elimination. This |
| means any recursive rules you write will be turned into recursive |
| methods in the parser. The parser will work, but may be slower. |
| It can be made faster by rewriting recursive rules, using instead |
| the looping operators \verb|*| and \verb|+| provided in Yapps 2.0. |
| |
| \end{enumerate} |
| |
| \mysection{Troubleshooting} |
| |
| \begin{itemize} |
| \item A common error is to write a grammar that doesn't have an END |
| token. End tokens are needed when it is not clear when to stop |
| parsing. For example, when parsing the expression {\tt 3+5}, it is |
| not clear after reading {\tt 3} whether to treat it as a complete |
| expression or whether the parser should continue reading. |
| Therefore the grammar for numeric expressions should include an end |
| token. Another example is the grammar for Lisp expressions. In |
| Lisp, it is always clear when you should stop parsing, so you do |
| \emph{not} need an end token. In fact, it may be more useful not |
| to have an end token, so that you can read in several Lisp expressions. |
| \item If there is a chance of ambiguity, make sure to put the choices |
| in the order you want them checked. Usually the most specific |
| choice should be first. Empty sequences should usually be last. |
| \item The context sensitive scanner is not appropriate for all |
| grammars. You might try using the insensitive scanner with the |
| {\tt context-insensitive-scanner} option in the grammar. |
| \item If performance turns out to be a problem, try writing a custom |
| scanner. The Yapps scanner is rather slow (but flexible and easy |
| to understand). |
| \end{itemize} |
| |
| \mysection{History} |
| |
| Yapps 1 had several limitations that bothered me while writing |
| parsers: |
| |
| \begin{enumerate} |
| \item It was not possible to insert statements into the generated |
| parser. A common workaround was to write an auxilliary function |
| that executed those statements, and to call that function as part |
| of the return value calculation. For example, several of my |
| parsers had an ``append(x,y)'' function that existed solely to call |
| ``x.append(y)''. |
| \item The way in which grammars were specified was rather |
| restrictive: a rule was a choice of clauses. Each clause was a |
| sequence of tokens and rule names, followed by a return value. |
| \item Optional matching had to be put into a separate rule because |
| choices were only made at the beginning of a rule. |
| \item Repetition had to be specified in terms of recursion. Not only |
| was this awkward (sometimes requiring additional rules), I had to |
| add a tail recursion optimization to Yapps to transform the |
| recursion back into a loop. |
| \end{enumerate} |
| |
| Yapps 2 addresses each of these limitations. |
| |
| \begin{enumerate} |
| \item Statements can occur anywhere within a rule. (However, only |
| one-line statements are allowed; multiline blocks marked by |
| indentation are not.) |
| \item Grammars can be specified using any mix of sequences, choices, |
| tokens, and rule names. To allow for complex structures, |
| parentheses can be used for grouping. |
| \item Given choices and parenthesization, optional matching can be |
| expressed as a choice between some pattern and nothing. In |
| addition, Yapps 2 has the convenience syntax \verb|[A B ...]| for |
| matching \verb|A B ...| optionally. |
| \item Repetition operators \verb|*| for zero or more and \verb|+| for |
| one or more make it easy to specify repeating patterns. |
| \end{enumerate} |
| |
| It is my hope that Yapps 2 will be flexible enough to meet my needs |
| for another year, yet simple enough that I do not hesitate to use it. |
| |
| \mysection{Future Extensions} |
| \label{sec:future} |
| |
| I am still investigating the possibility of LL(2) and higher |
| lookahead. However, it looks like the resulting parsers will be |
| somewhat ugly. |
| |
| It would be nice to control choices with user-defined predicates. |
| |
| The most likely future extension is backtracking. A grammar pattern |
| like \verb|(VAR ':=' expr)? {{ return Assign(VAR,expr) }} : expr {{ return expr }}| |
| would turn into code that attempted to match \verb|VAR ':=' expr|. If |
| it succeeded, it would run \verb|{{ return ... }}|. If it failed, it |
| would match \verb|expr {{ return expr }}|. Backtracking may make it |
| less necessary to write LL(2) grammars. |
| |
| \mysection{References} |
| |
| \begin{enumerate} |
| \item The \htmladdnormallink{Python-Parser |
| SIG}{http://www.python.org/sigs/parser-sig/} is the first place |
| to look for a list of parser systems for Python. |
| |
| \item ANTLR/PCCTS, by Terrence Parr, is available at |
| \htmladdnormallink{The ANTLR Home Page}{http://www.antlr.org/}. |
| |
| \item PyLR, by Scott Cotton, is at \htmladdnormallink{his Starship |
| page}{http://starship.skyport.net/crew/scott/PyLR.html}. |
| |
| \item John Aycock's \htmladdnormallink{Compiling Little Languages |
| Framework}{http://www.csr.UVic.CA/~aycock/python/}. |
| |
| \item PyBison, by Scott Hassan, can be found at |
| \htmladdnormallink{his Python Projects |
| page}{http://coho.stanford.edu/\~{}hassan/Python/}. |
| |
| \item mcf.pars, by Mike C. Fletcher, is available at |
| \htmladdnormallink{his web |
| page}{http://www.golden.net/\~{}mcfletch/programming/}. |
| |
| \item kwParsing, by Aaron Watters, is available at |
| \htmladdnormallink{his Starship |
| page}{http://starship.skyport.net/crew/aaron_watters/kwParsing/}. |
| \end{enumerate} |
| |
| \end{document} |