Saturday, September 26, 2009

The Whitespace Hack

Every culture has certain rites of passage that they expect members to pass through before they are fully accepted. In some places, they're pretty extreme -- taking knives to your naughty bits, and the like. Fortunately, the Parrot community is a little more laid back than that. There are two rites of passage for new members to participate in. The second one, which I haven't done yet, is to code a new garbage collector. Apparently, when you've coded a new GC, you get to do
svn commit -m "Today, I am a Parrot."
and you're totally in. I haven't reached that far, but I'm here to tell you that the first rite of Parrot-hood is pretty easy: filling out a ticket on http://trac.parrot.org.



You have to register, obviously, and a lot of the tickets get closed out with a comment like "You're doing it wrong." or "That's how it's supposed to work. Read the docs." But if you keep trying, eventually you'll get a real ticket in. And then comes the best part -- they fix stuff! There are few things in life more satisfying than watching people work your tickets on #parrot, while chortling "Dance, puppets! Dance while I pull your strings with my ticket submissions!"

Actually, it turns out there are some things more satisfying than that. But submitting tickets is definitely a good way to get into the community. And the earlier you start, the more bugs there will be, so the easier they are to find. It's like a Multi-Level Marketing program -- the earlier you join, the easier it is, and the more you gain. In the case of Parrot, mostly you're gaining karma on #parrot. But everybody starts somewhere.

Anyway...
I submitted a ticket (TT#1065) today about a problem with PIR that is inserted in-line in PGE rules. The ticket isn't very interesting, but it lets me segue into talking about the Whitespace Hack.
token ws {
    | <?{{
        $P0 = get_global '$!ws'
        if null $P0 goto noshort

        $P1 = $P0.'to'()
        $P2 = match.'to'()

        if $P1 != $P2 goto noshort
        .return(1)

    noshort:
        set_global '$!ws', match
        .return(0)
        }}>
    | <!ws> <.WS_ALL>*
}
Give that a read-through, and see if it's obvious what is going on.

It's sure not obvious to me. Even when I went back and read it later, I totally missed some stuff. Pmichaud is a pretty smart guy, and he's solving an important problem, but even when you know what he's doing, it's easy to miss stuff.

The <.ws> rule

The first thing to be aware of is that the ws rule has a special place in Perl6 grammars. If you specify a rule (as opposed to a token or a regex), or if you add some modifiers (:sigspace, I think, but you'd do well to check the specs), then the rule automatically matches optional whitespace everywhere you have white space in your grammar.
rule my_rule {
    my dog has fleas
}

This rule matches a bunch of literal characters - because all "normal" characters match themselves unless you do something to them - and it also optionally matches whitespace in between some of them.

The way this works is that the regex above is modified by adding calls to a whitespace-matching rule -- you guessed it, . The effect is the same as if you wrote:
rule my_rule {
    <.ws>my<.ws>dog<.ws>has<.ws>fleas<.ws>
}

(Just so you know, putting that period in front means that the results of those matches don't get collected. It's "match ws and don't save the contents," which is backwards from old-style regexes, where you had to put parens around all the groups you wanted to keep.)

One important part of "parsing" -- as opposed to "regexing," I guess -- is that lots of stuff in the input text gets treated as whitespace. For example, a fairly common approach to computer language parsing is to treat the comments as whitespace, so they can be ignored no matter where they appear. (Some old C hacks relied on a quirky implementation of this in the preprocessor to do "token pasting." Since comments were whitespace, a token followed by a comment ended the token. But since comments were replaced by 0 characters, a C comment could "glue" two tokens together with no spaces between them.)

So we've got two different imperatives for the ws rule: it will get called a lot because of the automatic whitespace recognition feature; and it needs to potentially recognize a lot of different stuff, like "real" whitespace, comments, heredocs, and whatever else strikes the coder's fancy. Here's the WS_ALL rule I use for Close:
token WS_ALL {
   [ \h+                # WS
   | \n [ {*} #= start_heredoc
          [
             <?{{    $P0 = get_hll_global [ 'close' ; 'Grammar' ; 'Actions' ], '$Heredocs_open'
                  $I0 = $P0

                  .return($I0)
             }}>
             [ $=[ \h* \h* [ \n | $ ] ]  {*} #= check_for_end
             || $=[ \N* \n ]
             ]
         ]*
         #{*} #= finish_heredoc
      ]
   | '/*' .*? '*/'            # C_BLOCK_COMMENT
   | '//' \N* [ \n | $ ]        # C_LINE_COMMENT
   | <.POD>
   ]
}
Matching nothing
The key fact to remember about the ws rule is that whitespace is optional. The ws rule is being inserted because you, the developer, have whitespace in your rule. Maybe that's because you expect whitespace, or maybe it's because you want to separate two parts of a complicated rule to make things readable. So ultimately, ws absolutely must accept a zero-length pattern.

The token ws regex at the top is broken into three alternative paths. The first path is a check for a single-entry cache. I'll come back to the caching theory later -- it's one of those things I missed the first time through.

The second rule is where most of the "work" of whitespace-ignoring is going to be done. Let's look at that in some detail:
    | <!ww> <.WS_ALL>*

It seems pretty simple. The first part of the rule calls another rule, ww, that is also built-in to the PGE-generated grammar. The ww rule is a zero-width condition test, and it returns true if the current position is between two 'word' characters. That is, if the previous character matches \w and if the next character matches \w, then is true.
token ww {
    <?after \w>

    <?before \w>
}

Of course, ww uses some magic to do its testing, so it looks nothing like that! But it could look like that, if it were slow.

So what does do? Well, it returns true if is false. is the logical negation syntax in Perl6 grammars. If both characters are not word characters, then we're in the middle of some spaces or line noise. If both characters are word characters, then we're in the middle of a word. And if one character is while the other character isn't a word character, we're looking at either the start (\W\w) or the end (\w\W) of a word.

Using matches three of those four cases: it matches the case where we are in the middle of spaces or line noise, the start of a word, and the end of a word. Most users won't care about the start-of-word case -- not many modern languages use words to introduce comments. So is a speed optimization, and a filter that eliminates middle-of-the-word invocations of the ws rule.

Only when checking for "space" might be relevant -- when approves -- does the work of actually checking for whatever it is that Close calls "whitespace" begin.

Caching is a win
One way to speed up any repetitive task is to cache the results. This is a standard dogma for most computer people, but does it apply to whitespace? Especially, does it apply when you're scanning forward through a string doing pattern matching?

It turns out that it does. Take a look at this code from the Close grammar.
rule declaration {
    <specifier_list> <declarator_part> [ ','  <declarator_part> ]* {*}
}

rule declarator {
    <dclr_pointer>* <dclr_atom> <dclr_postfix>*     {*}
}

rule declarator_part {
    <declarator>
    <dclr_alias>?
    # ...

The entry point for these rules is declaration, which will match a specifier list, then check for whitespace, then call declarator_part.

When declarator_part is called, it will check for whitespace, then call declarator.

When declarator is called, it first will check for whitespace, then call dclr_pointer.

Can you see where this is going?  When an outer rule calls declaration, the first thing that happens is a whitespace check. The second, third, and so on, things that happen are whitespace checks. Assuming there is some whitespace to consume, the very first invocation of the ws rule is going to gobble it all up. After that, each subsequent call to the ws rule do a lot of extensive testing for various ways that it might not fail, before failing.

So we come to the first alternate branch of the ws rule:

    |<?{{
        $P0 = get_global '$!ws'
        if null $P0 goto noshort

        $P1 = $P0.'to'()
        $P2 = match.'to'()

        if $P1 != $P2 goto noshort
        .return(1)

    noshort:
        set_global '$!ws', match
        .return(0)
     }}>
When a subrule is invoked, even a subrule that is coded in PIR in-line, rather than being written using Perl6 regex syntax, there are two properties that already have a meaningful value. The match.to() property is the end of the current match, while the match.from() property is the beginning. The results are offsets from the start of the text -- numbers, in other words.

When a new match is being set up, the start and end are going to point to the same place. And for zero-length matches, like the ww rule, they are always pointing at the same place. So what this code does is not use the start of the match, but rather the end. Because the first time a whitespace rule runs, it might actually match some whitespace, and so .to() and .from() will be different.

But the next time the ws rule gets called in the same location, it will be called with .to() equal to .from(), and both of them will be pointing at the same place where that first ws match ended. So this cache ignores the start of the match, and focuses on comparing the end of the match with the endpoint of the cached result. It essentially checks if the very last whitespace match ended at the current position. If so, then the next call to the same function will also decide to stop here. So don't bother making the call.

When caching doesn't work
There's two problems with the whitespace hack that you probably can't see. They are related, and they spring from an earlier post, in which I mentioned that Close calls itself to handle #include processing, as well as some internal voodoo.

The first problem is that the PGE compiler outputs the inline PIR with no namespace. That's the ticket I just submitted, and while it isn't fatal, it did cause me to spend some time tail-chasing before I figured out where the $!ws variable was winding up.

The second problem is that when there is no whitespace at the beginning of a translation unit -- as happens when the first line is #include -- the whitespace hack caches a match entry with an ending offset of 0. And when the included file starts, it checks for a cache entry, finds one (after all, it's a global), and requires that the beginning of the file be something other than whitespace. Whoops!

The solution, for me, was to implement a "whitespace hack stack" to go along with my #include'd filename stack. So when I change files, I also save and restore the cached $!ws variables. Technically, I could just blow away the contents, but since I had to write all the code to reach into the variable and set it to something, I decided to save and restore it.

1 comment:

  1. Coding and debugging isn't the piece of cake.It takes a lot of time to find errors.I am glad that you are not wasting your time,instead you are doing something that is helpful for people.keep it up

    ReplyDelete