Friday, October 2, 2009

Top versus Bottom

Last time, I posted on multiple inheritance, and on "class"-ifying the Slam code in general. Sadly, that is a "rat-hole" that I've gone down, and other than "still working on it" there hasn't been much to post. :-(

So I thought I'd dig out this little gem of a topic that I've had sitting in reserve: the difference between top-down and bottom-up parsing, and what it means to you.



Top-down parsing
First, top-down parsing is what you're doing when you write code in PGE, or Perl6, or in ANTLR or if you just do it yourself. In a top-down parser, you describe structures in terms of their sub-components:
  • A program is a sequence of zero or more statements.
  • A statement is an expression followed by a semicolon.
These rules turn into code like:
sub program() {
    my @program := Array::empty();
    repeat {
        @program.push(statement());
    }
   until (eof());
   return @program;
}

sub statement() {
    my $statement := expression();
    match( ';' );
    return $statement;
}
This is a very intuitive way to write programs, and if you know what your language looks like in advance, it can be really, really fast to code. (If you're stuck wandering in a maze of features, like me, then maybe not so much.)

Bottom-up parsing
On the other hand there is something called bottom-up parsing. To get your head around this, first you need to understand that most parsing is broken into two separate phases, called "lexical analysis" (or "lexing") and "parsing." The distinction is that the lexing phase takes input characters and turns them into tokens. A token is an abstraction of the input characters, identifying what role it plays in the language. This can be a slippery concept to grasp: in written human languages, for example, we talk about "words." So is a word a token? Well, if you're trying to process the text, sure. But if you're trying to convert the words into a different tense, maybe you want to separate out prefixes and suffixes as tokens. Or if you're trying to understand the text, maybe you want the tokens to be parts of speech: noun, verb, adjective. Finally, if you're processing Thai, you're doomed - they don't put spaces between the words, so how will your lexer know where one starts and the next one stops?

At any rate, lexing isn't 100% necessary. It's a convenience that you will almost immediately re-invent for yourself if you don't have it - kind of like a "print" statement. But tokens are generally what parsers operate on, whether they have to create them or whether a lexer does it for them.

That said, bottom-up parsing defines a higher-level structure in terms of compositions of a sequence of low-level tokens. It's the reverse of top-down, and it can look almost exactly the same if you aren't careful:
A sequence of zero or more statements is a program.
An expression followed by a semicolon is a statement.
See the difference? Yeah, it's tough. And it's made tougher by the fact that the tool writers, who build automatic parser generators, generally all choose a similar syntax:
program: statement*
statement: expression SEMICOLON

Is that top-down or bottom-up? It's impossible to say, really. So you just have to know what kind of tool you're using, and adjust accordingly. But if you're using a bottom-up parser, you'll hear the terms "shift" and "reduce" a lot, so that usually helps. Here's why.

A bottom-up, or shift/reduce, parser uses a separate stack to keep track of the tokens it sees. So when you type in "y = m * x + b;" it receives a sequence of tokens like:
IDENT(y) EQUALS IDENT(m) STAR IDENT(x) PLUS IDENT(b) SEMI
By default, the parser will 'shift' these tokens onto its internal stack, in this order, so that the right-most token is at the "top" of the stack. As it shifts tokens onto the stack, it matches the top-most few tokens against its internal database of rules, to see if any rule has a pattern that the stack matches. If a "production" (a rule) matches what is on the stack, the matching tokens are taken off the stack and replaced with whatever the production makes.

In this example, maybe you have an expression rule that looks like:
term : IDENT ;

op : EQUALS | PLUS | STAR ;


expression : expression op term | term;

(Note that I am ignoring things like operator precedence. If you want to know more, Google is your friend.)

In our example, the parser will do the following:

Stack
Next token
Action?
<empty>
IDENT(y)
shift
IDENT(y)
EQUALS
reduce
term(y)
EQUALS
reduce
expression(y)
EQUALS
shift
expression(y), EQUALS
IDENT(m)
reduce
expression(y), op(=)
IDENT(m)
shift
expression(y), op(=) IDENT(m)
STAR
reduce
expression(y), op(=), term(m)
STAR
reduce
expression(y=m)
STAR
shift
expression(y=m), STAR
IDENT(x)
reduce
expression(y=m), op(*)
IDENT(x)
shift
expression(y=m), op(*), IDENT(x)
PLUS
reduce
expression(y=m), op(*), term(x)
PLUS
reduce
expression(y=m*x)
PLUS
shift
expression(y=m*x), PLUS
IDENT(b)
reduce
expression(y=m*x), op(+)
IDENT(b)
shift
expression(y=m*x), op(+), IDENT(b)
SEMI(b)
reduce
expression(y=m*x), op(+), term(b)
SEMI(b)
reduce
expression(y=m*x+b)
SEMI(b)
shift
expression(y=m*x+b), SEMI
??
reduce
statement(y=m*x+b;)
??
??

Beware: the expression produced will not be what you think it should be without all that operator precedence stuff. Instead, you'll get an expression tree like (((y=m)*x)+b), wrong in nearly every language.

So what's the difference?
Well, one difference is that bottom-up parsers can recognize a broader set of languages than top-down parsers. Another difference is that bottom-up parser generators are generally easier to code than top-down generators. And the bottom-up parsers generally perform better than their top-down brethren, because they are using small, simple data structures (a big table of rules, and a small stack).

But more importantly, the difference is in the "legacy" of bottom-up versus top-down parsing. Technically, top-down parsers are called LL, and bottom-up parsers are called LR -- usually both terms are surrounded by other letters and numbers, but if you look hard you'll find an LL or LR in there somewhere. And the difference there -- L versus R -- is in 'left' versus 'right' parsing. When a top-down parser is given a sequence of tokens, it consumes from the left. When a top-down parser is given the same sequence, it consumes from the right.

One place you can see that difference is in C versus Perl 4 or 5 parsing. A variable declaration and function declaration in C look like:
int x = 1;
int foo() { return x; }
In perl, you get:
our $x = 1;
sub foo() { return $x; }
Notice that in C, the "differences" between the two are on the right, while in Perl they are on the left. Larry Wall is generally a pretty people-oriented person, so he may have done that to make reading declarations easier for coders. But I'd also bet that the first perl parser was a top-down parser, that consumed from the left.

At last!
Because this is the point: top-down parsers have different operating characteristics. And one of those differences is that they want to 'commit' early to pursuing a particular path. As a result, top-down parsers are going to excel at recognizing languages where the major decisions are encoded to the left. Declaring "sub foo" puts the fact that this is a function declaration right up front. A top-down parser doesn't have to ask a single question about what's going on: it knows from the very first token that this is a subroutine.

One of the more frequent questions on the ANTLR support list was about parsing C-like declarations, where the information that a function declaration is taking place didn't show up until the very end. The only difference between an external declaration of a function and the function declaration itself is that the final semicolon is replaced by a block:
int foo(void) ;
int foo(void) { return 1; }

How does that affect me?
If you are coding a top-down parser, you hate C syntax for this reason. Almost all of the grammar specifications for languages like C are built expecting a bottom-up parser. So blindly converting the grammar from one syntax to the other will produce a parser that occasionally does a lot of work to parse a declaration, then discards it all, then redoes the exact same work to parse a function definition. How much fun is that?

Getting around this requires changing the structure of the grammar definition. Instead of having two rules for declarations and function definitions, you want a single rule:

declaration: specifiers declarator ';'
definition: specifiers declarator block
becomes:
any_decl: specifiers declarator [ ';' | block ]

This avoids the parse-forget-reparse problem, at the expense of making your grammar actions heavier.

In turn, rewriting your rules like this means you tend to change your grammar. If your declarations and definitions are all the same rule, why not accept more than one?
int foo(void) { return 1;}, bar(void) { return 2; }

Anyway, that's enough for today.

No comments:

Post a Comment