Close

I'm back!

2010-03-22T08:04:00.000-07:00

With the completion of Kakapo release-10 a while back, I've been focusing on compiler work again. I have to say, Kakapo is making it smooth.

The change to NQP-rx is causing all kinds of problems. Some of them are voluntary: I'm moving to attribute based classes, and I'm taking advantage of a bunch of features of the new -rx engine. But some of them are involuntary, like climbing the learning curve for how optable parsing works.

I put up a wiki page about that over on the Parrot.org trac site, but even after writing out a bunch of stuff, it's still not totally obvious to me how some parts work. That's part of the learning process, I guess.

One thing I'm doing somewhat differently now is testing. Before, I was writing tests for the compiler. Now I'm writing tests of the compiler.

I've got this much of the TDD approach figured out: find out what needs to be done, write some code to do it, and then bury it in tests.

In theory, this is a bad idea because it leads to "test sclerosis," a condition where the system becomes unmodifiable because of the cost of updating the test cases. That's not an issue for me (yet) because I'm testing parsing rules. First, if things get a little sclerotic I don't think I'll mind, because the testing has been working - in the sense that it is finding places where my code doesn't do what I think it does. (Switching to a new language that uses the same syntax makes this a particularly acute problem.)

Second, although it's still early in the process, I'm testing stuff that I don't expect to change much. Recognizing this or that token should produce pretty much exactly the same PAST output. As I get higher in the grammar, this may not be true any more - maybe when I'm up high, I'll want to quickly change the grammar and the tests will be a drag.

But for now, it's smooth sailing. The Kakapo library has a little bit of syntactic sugar added just for this purpose: the Matcher::PctNode class and its friend Matcher::PAST::Node. Here's what a testcase looks like:

method test_binary() {
    my $matcher := val( :value(1) );
    my $code    := "0b01";

    my $slam    := Slam::Compiler.new;
    my $past    := $slam.compile: $code,
        :rule<EXPR>,
        :target<past>,
        ;

    assert_match($past, $matcher,
        'Failed to parse {' ~ $code ~ '} as expected');
}

In fact, there are so many test cases that the processing is more scripted than this. But this represents what one test would look like. The sugar I've built is in the "my $matcher := " line - that expression is generating a matcher.

Anyway, this next stage is going to be mostly me migrating the grammar over to -rx, and building test scaffolding around it. I've started working on a private branch, instead of trunk, so there won't be so many bogus commits.

Kakapo release-2

2010-02-15T17:46:00.000-08:00

I released a limited version of Kakapo today. Then, I released again.

I created the release-1 tag, and felt good for a bit. Then I noticed that some common documentation files were missing, so I added them and created release-2.

Release-2 only provides the _base library, with PMC type extensions. But it's something. Documentation is over at GoogleCode: http://code.google.com/p/kakapo-parrot/

There's more than one way to do it...

2010-02-12T06:48:00.000-08:00

In the process of paying the taxes, I've come across one of the important classes in Kakapo that needs "updating."

In this case, "updating" means "complete rewrite," since I've got the new class system working. So I thought I'd show a before/after version of the code.

Here's the before version:

module DependencyQueue;
# A queue that orders its entries according to their prerequisites.

_pre_initload();

method added(*@value)        { self._ATTR_HASH('added', @value); }
method already_done(*@value)    { self._ATTR_HASH('already_done', @value); }
method cycle(*@value)        { self._ATTR_HASH('cycle', @value); }
method cycle_keys(*@value)    { self._ATTR_ARRAY('cycle_keys', @value); }
method open(*@value)        { self._ATTR_HASH('open', @value); }
method pending(*@value)        { self._ATTR_HASH('pending', @value); }
method queue(*@value)        { self._ATTR_ARRAY('queue', @value); }

method add_entry($name, $value, :@requires?) {
    unless @requires { @requires := Array::new(); }

    my @entry := Array::new($name, $value, @requires);
    self.pending{$name} := @entry;
}

method already_added($name) {
    return self.already_done{$name} || self.added{$name};
}

method init(@args, %opts) {
    for @args {
        self.mark_as_done(~ $_);
    }

    self.open(1);
}

method is_empty() {
    if self.open {
        return self.pending.elements == 0;
    }

    return self.queue.elements == 0;
}

method mark_as_done($label) {
    self.already_done{$label} := 1;
}

method marked_done($label) {
    return self.already_done{$label};
}

method next() {
    if self.open {
        self.tsort_queue();
    }

    if self.queue.elements {
        my $node := self.queue.shift;
        self.mark_as_done($node[0]);
        return $node[1];
    }

    return my $undef;
}

sub _pre_initload() {
    if our $_Pre_initload_done { return 0; }
    $_Pre_initload_done := 1;

    use('Dumper');

    Class::SUBCLASS('DependencyQueue',
        'Class::HashBased');
}

method reset() {
    self.open(1);
    self.pending(Hash::empty());
}

method tsort_queue() {
    self.open(0);
    self.cycle_keys(Array::empty());
    self.cycle(Hash::empty());
    self.added(Hash::empty());

    self.tsort_add_keys(self.pending.keys);
}

method tsort_add_keys(@list) {
# Visits a list of keys, adding the attached calls to the queue in topological order.

    for @list {
        my $key := $_;

        unless self.already_added($key) {
            ## First, check for cycles in the graph.
            my $cycle_elts := self.cycle_keys.elements;
            self.cycle_keys.push($key);

            if self.cycle.exists($key) {
                my @slice := self.cycle_keys.slice(:from(self.cycle{$key}));

                Opcode::die("Cycle detected in dependencies: ",
                    @slice.join(', '),
                );
            }

            self.cycle{$key} := $cycle_elts;

            ## Put everything $key depends on ahead of $key
            my $node := self.pending{$key};

            my @prerequisites := $node[2];
            self.tsort_add_keys(@prerequisites);

            ## Finally, it's my turn.
            self.added{$key} := 1;
            self.queue.push($node);
        }
    }
}

The first few lines are the 'module' declaration and the call to _pre_initload. The call is a waste, since Kakapo.nqp calls this one in the "super-duper-early" section, since the dep-q is used to implement the ordering for all the other modules.

Looking at the _pre_initload sub, the class is declared as a direct subclass of HashBased, which was my old root behavior. I can rewrite this class header using the 'class' keyword and no call to _pre_initload. I should put my copyright block in:

# Copyright (C) 2009-2010, Austin Hastings. See accompanying LICENSE file, or
# http://www.opensource.org/licenses/artistic-license-2.0.php for license.

class DependencyQueue;
# A queue that orders its entries according to their prerequisites.

The next block is a bunch of attribute-methods. Those are automatically generated using the P6metaclass 'has' sub:

INIT {
    use(    'P6metaclass' );

    has(    '%!added',
        '%!already_done',
        '%!cycle',
        '@!cycle_keys',
        '%!open',
        '%!pending',
        '@!queue'
    );
}

The next bunch of methods is generally not going to change. The automatically generated methods should have the same names, and the same behaviors, as the hand-coded ones they replaced. The 'init' method is different, though. That method has been replaced by a series of methods, one of which fills the bill here:

method _init_positional_(@pos) {
    for @pos {
        self.mark_as_done(~ $_);
    }

    self.open(1);
}

Plus, of course, the _pre_initload sub goes away entirely, since the class creation has moved to the class header. If this were not such an important class, I would probably add a call to the INIT header to tell the system that there is no further initload processing to be done. But dep-q runs before the Program:: module does, which is where that information is tracked. (Program:: creates a DependencyQueue to do the tracking.)

Anyway, here's the rewritten code:

# Copyright (C) 2009-2010, Austin Hastings. See accompanying LICENSE file, or
# http://www.opensource.org/licenses/artistic-license-2.0.php for license.

class DependencyQueue;
# A queue that orders its entries according to their prerequisites.

INIT {
    use(    'P6metaclass' );

    has(    '%!added',
        '%!already_done',
        '%!cycle',
        '@!cycle_keys',
        '%!open',
        '%!pending',
        '@!queue'
    );
}

method add_entry($name, $value, :@requires?) {
    unless @requires { @requires := Array::new(); }

    my @entry := Array::new($name, $value, @requires);
    self.pending{$name} := @entry;
}

method already_added($name) {
    return self.already_done{$name} || self.added{$name};
}

method _init_positional_(@pos) {
    for @pos {
        self.mark_as_done(~ $_);
    }

    self.open(1);
}

method is_empty() {
    if self.open {
        return self.pending.elements == 0;
    }

    return self.queue.elements == 0;
}

method mark_as_done($label) {
    self.already_done{$label} := 1;
}

method marked_done($label) {
    return self.already_done{$label};
}

method next() {
    if self.open {
        self.tsort_queue();
    }

    if self.queue.elements {
        my $node := self.queue.shift;
        self.mark_as_done($node[0]);
        return $node[1];
    }

    return my $undef;
}

method reset() {
    self.open(1);
    self.pending(Hash::empty());
}

method tsort_queue() {
    self.open(0);
    self.cycle_keys(Array::empty());
    self.cycle(Hash::empty());
    self.added(Hash::empty());

    self.tsort_add_keys(self.pending.keys);
}

method tsort_add_keys(@list) {
# Visits a list of keys, adding the attached calls to the queue in topological order.

    for @list {
        my $key := $_;

        unless self.already_added($key) {
            ## First, check for cycles in the graph.
            my $cycle_elts := self.cycle_keys.elements;
            self.cycle_keys.push($key);

            if self.cycle.exists($key) {
                my @slice := self.cycle_keys.slice(:from(self.cycle{$key}));

                Opcode::die("Cycle detected in dependencies: ",
                    @slice.join(', '),
                );
            }

            self.cycle{$key} := $cycle_elts;

            ## Put everything $key depends on ahead of $key
            my $node := self.pending{$key};

            my @prerequisites := $node[2];
            self.tsort_add_keys(@prerequisites);

            ## Finally, it's my turn.
            self.added{$key} := 1;
            self.queue.push($node);
        }
    }
}

Progress!

2010-02-12T05:02:00.000-08:00

My taxes aren't paid yet.

That said, I've at least seen some forward progress under the new NQP. The change to not using :init on code at the top level means that I'm getting the chance to write more test cases for every ... single ... file in the kakapo library. And yeah, there's some "why the hell doesn't this work?" going on, as well.

I'm pleased to report that I have a working UnitTest system. It's modeled on xUnit, but the default also includes a TAP Listener so that test cases report as TAP tests.

Here's what it looks like:

class Test::Pmc::Undef
    is Test::Pmc::COMMON {

    INIT {
        use(    'P6metaclass' );
        use(    'UnitTest::Testcase' );

        Program::register_main();
    }

    sub main() {
        my $proto := Opcode::get_root_global(Opcode::get_namespace().get_name);
        $proto.suite.run;
    }

    method test_defined() {
        verify_that( "Defined returns false" );
        my $object := self.class.new;

        if $object.defined { fail( "Undef.defined reports yes" ); }
    }

Note that I've created a Test::Pmc::COMMON testcase class to handle some common test cases for all the Pmc types. So each Test::Pmc::<type> gets a bunch of test_... methods inherited.

The INIT block runs after the class is declared. I import P6metaclass, which Kakapo extends with some class definition subs. You don't see them here, but things like " has( '$!attribute' ); " are there.

The main sub is magic. Black magic. Secrets man was not meant to know. Kakapo's "register_main" only works with subs - not methods. (For now...PHP anyone?) So there needs to be a sub, and it needs to have some kind of link to the class or namespace of the test case. So each test file includes this boilerplate 'main' that does eldritch mummery to figure out what the current class's proto-object is, and uses that to create and run a test suite from the current test case.

The test_defined method is a pretty representative test method. It's short and simple, checks a single condition and either returns quietly or throws an exception. The verify_that function sets the $.verify attribute on the current object (it uses dynamic scope to find 'self') so that tests can provide a little more description. Here's the output:

/usr/local/bin/parrot -Llibrary t/Pmc/Undef.t.pbc
1..5
ok 1 - 'new' returns an object of the right type
ok 2 - Defined returns false
ok 3 - 'isa' returns correct results
ok 4 - Clone returns a different, valid object
ok 5 - A 'can' method exists, and returns known results

More snow .. err, code!

2010-02-10T17:09:00.000-08:00

We got some more snow here in the northeast. Which means I've had plenty of time to code. Sadly, though, while I've been very active on Parrot's trac submitting tickets, I haven't made much forward progress. This was my tree at about 4:30pm.

I have learned something about the C3 linearizer, and about how P6object classes work. And I was able to make my first Parrot commit this week, providing a test case. (Yay! Test cases.)

This was my tree at about 6:30pm. Notice that there's about another inch and a half of snow on the branches. While I was walking home from dinner, I heard a double "ka-krak!" I turned to look and saw some pine branches falling off their trees, broken under the weight of the snow they were carrying.

Codin' weather

2010-02-06T12:09:00.000-08:00

Well, the snowpocalypse is upon us, and I've got 15 inches of snow outside my front door. There's nothing to do (and not really any way to do it) until the Super Bowl tomorrow. Obviously, this is codin' weather. Hopefully I can get a Kakapo release in the can, and start back on Close development.

I think there's about a foot on the roof, with 18 inches or so in the lee of my ornamental shrubbery:

And there's a little more out by the street, where there are no "wind effects" to mess up the delivery of the snowy goodness. Let's call it two feet:

Paying taxes

2009-11-17T22:59:00.000-08:00

A long time ago - 3 or 4 months, when I started this project - I tried to update my version of Parrot every few days. I'd get some code working, then the next day I'd svn update in the Parrot dir, rebuild, and carry on. But a couple of things happened, and I realized that that was a stupid idea. Once you give out commit privs, it's next to impossible to enforce any kind of discipline without chasing volunteers off your project, so Parrot is stuck with the occasional commit-without-testing.

So I decided that me updating without any kind of knowledge about stability was a stupid thing to do, and I started just updating after the monthly releases. This is a bad idea, too, because the release process seems to involve a "code freeze" for a little bit just before the release, and then everybody slams in whatever branch they've been holding off on merging for the last week. So I'm basing my stuff on the tagged release, rather than trunk, which is as it should be.

A few times, there have been failures even then, because the code didn't work, or a bug was introduced, or whatever. I've evolved a procedure for that, too -- I move the old Parrot workspace out of the way until I'm sure the new one is good, and I don't waste any time waiting for fixes. This means sometimes I go two months with no updates, but that's not such a bad thing.

At any rate, every month or so I "pay the upgrade tax." That is, I invest an hour or more in getting and building and rebuilding, and oh, yeah, I forgot I need to run dos2unix, and re-rebuilding Parrot. My buddy Jesse is now convinced that I've generated a meme, since other folks on #parrot keep telling me about paying the tax.

Today is tax day, and there's this new version of NQP, called NQP-rx. (That's the last sentence in this post that I'll be able to write without having to go back and remove profanity.)

And the thing is, it's "better" than NQP + PGE, because it promises to finally let me put NQP in-line in the grammar targets, rather than just inline PIR. But there's a price, and the price is a bunch of changes. Like a stricter POD parser.

IMO, pod6 leaves a fair amount to be desired. Damian delivered something that looks a whole lot like xPmOlD, and for documenting source code it's *way* more verbose than it should be. I wasn't too unhappy with the NQP pod parser, since it would accept code like:

method foo() {
=method
Blah blahblah
=end

code_for_foo();
}

But now there's a new sheriff in town, and I'd have to say something like "=begin method" at the top. Which wouldn't suck too bad (it's only 14 characters or so, compared with the <ahem> 1 character required for docstrings in elisp, or the <cough> 3 characters required in python). But then I'd also have to say "=end method" at the end, because matching end-tags have been such a great f.....reaking idea in Ada and Xml, so I decided that "ctrl-q" - one "character" - was a much better solution. So I've converted my POD comments into block comments (# at beginning-of-line) which Notepad++ does with a keystroke. The suboptimal part is that I'm having to do that over and over again, whereever I have written any documentation. I'm doing this in Kakapo, which (I'm sorry, it's true) I'm still working on. And sometimes I wind up just deleting the documentation. Hell, *I* know what the stuff does, and isn't that enough?

Another change which irritates me (but probably only me) is the $-variable interpolation in NQP. Yesterday it wasn't there, today it is. And since I've been doing a lot of PIR-generation, I'm caught replacing " with ' in a thousand places.

I'm sure there will be some stuff that doesn't irritate me until run-time, too. But I'm going off to pay the upgrade tax now, and this month the taxes are high. Should I blame Obama for this, too?

No Close! Kakapo!

2009-11-13T07:41:00.000-08:00

I've been silent for a while, and that's pretty bad.

The reason is that I've been busy not working on Close at all. Instead, I stopped to turn a bunch of the code I had already written into a general-purpose library, called Kakapo.

Kakapo is a runtime library for NQP, that helps developers (like me!) do useful stuff without having to drop into PIR every couple of lines. Since this is code that has been extracted from Slam, obviously Slam will be using it. I hope that other coders find it useful, too.

rtype not set

2009-10-19T10:34:00.000-07:00

I just got this diagnostic from NQP, and boy, is that not a helpful error message. Of course, since it's in PAST->POST compilation, there's no line number or token output.

FWIW, the problem was that I was doing this:

self.declarator := '' ~ $name;

when, as anyone can plainly see, I should have been doing this:

self.declarator('' ~ $name);

because .declarator is a method, not an attribute. D'oh.

I've seen this error once before, and I didn't remember what caused it. I hope that next time I'll remember it, or that The All-Knowing Oz will be able to remind me.

Parrot Multi-Subs and Inheritance

2009-10-16T16:53:00.000-07:00

One of the things I've added to Slam (the Close compiler) is support for declaring multi-subs. In another language, these would be called "overloaded methods" or "overloaded functions," but in Close, it's all about the underlying hardware, and that means MultiSubs.

Before I write too much, let me make this clear: Parrot is not the same as Perl6, or Python, or whatever your favorite language is. And so, the way Parrot (and Close) handle things like object dispatch -- the "Object Meta-model," as it is called -- is different from the way other languages handle these things. This is one of those times where the difference matters. (And specifically, the stuff I write below about inheritance isn't true for Perl6. You have been warned.)

In Parrot's default object model (P6object), each class can have one or more parent classes, which can have one or more grandparent classes, etc. And in order to dispatch method calls "optimally," an algorithm called the C3 algorithm is used.

The subject in question is method resolution order, and it has to do with picking between a method, say "bark," that might be inherited from two different parents. If you have a class Tree, with a "bark" method, and a class Dog, with a "bark" method, and a child class DogWood that inherits from both Tree and Dog, which "bark" method runs when you invoke it? That's method resolution.

And when you "order" the inheritance hierarchy, so that methods from one class get chosen in preference to methods from another class, you have a method resolution order, or MRO. In most class hierarchies, MRO is simple. For instance, in the Parrot Compiler Toolkit (PCT) libraries, there are classes like this:

PCT::Node - the root of the class hierarchy
PCT::Block - a lexical scope
PCT::Stmts - a collection of statements, but not a lexical scope
PCT::Var - a declaration or reference to a variable
PCT::Val - a literal value
PCT::Op - an operation, like add or multiply
PCT::VarList - a collection of variable declarations

Conveniently, every one of those classes (except Node) is derived from PCT::Node. So the class hierarchy looks like a fan. But that makes the MRO trivial. For each class, the MRO is "search my methods, then search PCT::Node, then quit."

So what does this have to do with MultiSubs, you ask? Well, it turns out that MultiSubs are magical in two ways. First of all, a MultiSub is a single entity. Get that through your head, and let it sink in. It took me a while to understand it. A MultiSub is a PMC that is created by the compiler when you use the :multi modifier on a sub declaration (in PIR). Likewise in Close -- if you declare a function with :multi, it will be created as a MultiSub.

These MultiSub PMCs are treated somewhat like Sub PMCs, meaning that they get compiled into PBC files, loaded, and they respond to the .invoke() method call. But that's where things get ugly. Because a MultiSub secretly encodes a list of ordinary Subs. (I don't think it's possible to "nest" MultiSubs, but I haven't done any research on it.) Each of the ordinary Subs is associated with a Signature that indicates how many arguments, and of what types, are to be used in MultiMethod Dispatch (MMD). And yeah, it's called MMD even though it can be done on plain old non-method Subs. Go figure.

So when a MultiSub gets invoke()'ed, it takes its list of arguments and tries to match the args against a Sub using the signatures. There's an algorithm for that, too, that uses the "manhattan distance" -- essentially (but not exactly) the sum of the counts of how many "steps away" each argument is from the corresponding parameter in its own MRO list.

For example, since DogWood inherits from Tree and Dog, it might be one step, or two steps, from Tree. So a function declarated to take a single Tree parameter will have a manhattan distance of 1 or 2 from the DogWood argument. If a function of the same name were declared to take a DogWood parameter, it would have a distance of 0, and so be a "closer" match.

That's where the simplicity ends, though, because when dispatch is done on two parameters, you have the case where a function with one very close match and another very far match might be a better or worse match than a function with two "medium close" classes. For example, say you have a class hierarchy A -> B -> C -> D, where A is the "root" of the hierarchy. In this example, we have two subs "foo" declared as multi, with parameter lists (B, B) and (D, A). If I call foo(d, d), which is closer? How about foo(d, c)?

First of all, this is the stuff that gives academics nightmares. But it's okay, really, because if you find yourself in that position, you probably are getting what you deserve. (If the two subs do radically different things, what the hell were you thinking by overloading them? And if they do similar things, you're probably okay.)

While all that is interesting, it's not particularly relevant. In fact, it all "just works." What's more interesting is what happens when you don't match. That is, when you call a MultiSub with a set of arguments that just don't match any candidate. Consider an obvious (and recently bothersome) example: the compiler builds a tree of nodes. The Slam compiler extends the PCT::Node class hierarchy as I mentioned in my post on Multiple Inheritance, so there are a bunch of different node types to support. While I was trying to implement the Visitor pattern on the syntax tree, I implemented a Class::MULTISUB function in NQP that dynamically generates and compiles trampolines to convert a set of sub declarations into a multisub. The resulting Visitor class had overrideable methods for each possible node type (every class in the hierarchy). So if I have a visitor, $v, and call $v.visit($node), multidispatch invokes the right method based on the type of the first argument.

But what happens when I add a class to the hierarchy? Well, I usually forget to re-run the script that generates all the visitor methods. And so there is no multisub with a signature that matches. Whoops.

An even better question is this: what happens when I overload some of the visit methods in a new Visitor class?

Here's the problem: Parrot does not dispatch "beyond" a MultiSub. That is, when you call a method like 'visit', and Parrot's dispatch mechanism finds a method entry 'visit' that points to a MultiSub, it is done. Parrot hands control to the MultiSub, says "Here ya go. Good luck, buddy!" and doesn't wait around to see what happens. What this means is that by default, calling a MultiSub with an argument list that doesn't match any of the available signatures is a fatal error. Yikes!

So what happens if, say, I want to specify a particular action for handling nodes of a single type -- symbol declarations, say -- but I want the "default" handling to be inherited from some ancestor?

Well, now it gets tricky. First, remember that I'm not using multisubs or multimethods directly. Instead, I'm using these automatically generated trampolines. So the behavior you want to specify lives in a method called something like _visit_Slam_Symbol_Declaration() and it gets called by a trampoline that is part of some parent class' visit MultiSub. So the simple answer is, if you replace the right method, it will just work. Because calling $child.visit($node) results in a call to the inherited Ancestor::visit($child, $node) which in turn invokes $child._visit_Slam_Symbol_Declaration($node) which has been overridden in the Child class. Score!

But there's a problem. There's always a problem, or it wouldn't be worth writing about. The problem is that the host of little per-node-type visit methods is coded waaaay down at the root of the Visitor class hierarchy. And there are a few generations of other classes between the root of the hierarchy and where most code lives. (And here, let me put in a plug for an excellent paper by van Deurssen and Visser that serves as a solid introduction to Visitor/Combinators and the JJTraveler class library.)

The intervening layers of parent classes are not necessarily a problem, except when they override the .visit() method. Because a parent class that replaces the MultiSub with a simple Sub method suddenly eliminates the "it just works" mechanism -- the MultiSub dispatcher down at the root of the tree -- that makes per-type method overriding possible.

So what to do? Well, clearly I needed to replace the "root" MultiSub with one in the Child class. Okay, I've got code that does that automatically. But if the Child class only provides one MultiSub method -- because the Child only wants to process symbol declaration nodes, remember? -- there won't be any other methods available. And if there are no valid candidates in a MultiSub, it's a fatal error. Yikes.

One obvious solution was to implement a second multi-method that matched Slam::Node or PCT::Node or something, so that there was always a fall-back multimethod available. But that's boilerplate, and shouldn't that be generated automatically? Well, yes, IMO. So I implemented that, too.

Creating a multisub now does two things. First, it seeks out all the matching method names, and creates trampolines for them (so a call to Class::MULTISUB($class, 'visit', :starting_with('_visit_')) creates a MultiSub named 'visit', and automatically creates trampolines for each existing method that starts with _visit_ in the class' namespace. Each method is assumed to start with whatever prefix you specify, and then contain a class name, with '::' replaced by '_', so that a visit method that takes a Slam::Symbol::Declaration parameter would be named _visit_Slam_Symbol_Declaration.

The second thing is the automatic generation of a "default" multi. This is a multi-method with a signature of (_) -- that is, it is dispatched only on the 'self' parameter and considers any other parameters to be extras. This apparently works because the manhattan sort gives some kind of preference to "right number of parameters", so that while a signature of (_) does match, a signature of (_, Some::Parent::Class) is still "closer."

The "default" multi is generated in one of two ways. The multi compiler searches the class' MRO (available via the 'all_parents' argument to the inspect method on the class object) for the first ancestor with a matching method -- be it a MultiSub or plain old method -- and creates the default multi to call that method if it finds one. Otherwise, the default multimethod just die's with a diagnostic message indicating failure.

Okay, that's it for today. FMTYEWTKA MultiSubs. The short summary: multisubs are great, but that don't play well with inheritance. The longer summary: you have to do your inheritance chaining yourself. And if you need to know more, I recommend you do what I do - get on the #parrot IRC channel and pester PMichaud with questions. That dude knows just about everything.

Top versus Bottom

2009-10-02T06:11:00.000-07:00

Last time, I posted on multiple inheritance, and on "class"-ifying the Slam code in general. Sadly, that is a "rat-hole" that I've gone down, and other than "still working on it" there hasn't been much to post. :-(

So I thought I'd dig out this little gem of a topic that I've had sitting in reserve: the difference between top-down and bottom-up parsing, and what it means to you.

Top-down parsing
First, top-down parsing is what you're doing when you write code in PGE, or Perl6, or in ANTLR or if you just do it yourself. In a top-down parser, you describe structures in terms of their sub-components:

A program is a sequence of zero or more statements.
A statement is an expression followed by a semicolon.

These rules turn into code like:

sub program() {
    my @program := Array::empty();
    repeat {
        @program.push(statement());
    }
   until (eof());
   return @program;
}

sub statement() {
    my $statement := expression();
    match( ';' );
    return $statement;
}

This is a very intuitive way to write programs, and if you know what your language looks like in advance, it can be really, really fast to code. (If you're stuck wandering in a maze of features, like me, then maybe not so much.)

Bottom-up parsing
On the other hand there is something called bottom-up parsing. To get your head around this, first you need to understand that most parsing is broken into two separate phases, called "lexical analysis" (or "lexing") and "parsing." The distinction is that the lexing phase takes input characters and turns them into tokens. A token is an abstraction of the input characters, identifying what role it plays in the language. This can be a slippery concept to grasp: in written human languages, for example, we talk about "words." So is a word a token? Well, if you're trying to process the text, sure. But if you're trying to convert the words into a different tense, maybe you want to separate out prefixes and suffixes as tokens. Or if you're trying to understand the text, maybe you want the tokens to be parts of speech: noun, verb, adjective. Finally, if you're processing Thai, you're doomed - they don't put spaces between the words, so how will your lexer know where one starts and the next one stops?

At any rate, lexing isn't 100% necessary. It's a convenience that you will almost immediately re-invent for yourself if you don't have it - kind of like a "print" statement. But tokens are generally what parsers operate on, whether they have to create them or whether a lexer does it for them.

That said, bottom-up parsing defines a higher-level structure in terms of compositions of a sequence of low-level tokens. It's the reverse of top-down, and it can look almost exactly the same if you aren't careful:

A sequence of zero or more statements is a program.
An expression followed by a semicolon is a statement.

See the difference? Yeah, it's tough. And it's made tougher by the fact that the tool writers, who build automatic parser generators, generally all choose a similar syntax:

program: statement*
statement: expression SEMICOLON

Is that top-down or bottom-up? It's impossible to say, really. So you just have to know what kind of tool you're using, and adjust accordingly. But if you're using a bottom-up parser, you'll hear the terms "shift" and "reduce" a lot, so that usually helps. Here's why.

A bottom-up, or shift/reduce, parser uses a separate stack to keep track of the tokens it sees. So when you type in "y = m * x + b;" it receives a sequence of tokens like:

IDENT(y) EQUALS IDENT(m) STAR IDENT(x) PLUS IDENT(b) SEMI

By default, the parser will 'shift' these tokens onto its internal stack, in this order, so that the right-most token is at the "top" of the stack. As it shifts tokens onto the stack, it matches the top-most few tokens against its internal database of rules, to see if any rule has a pattern that the stack matches. If a "production" (a rule) matches what is on the stack, the matching tokens are taken off the stack and replaced with whatever the production makes.

In this example, maybe you have an expression rule that looks like:

term : IDENT ;

op : EQUALS | PLUS | STAR ;

expression : expression op term | term;

(Note that I am ignoring things like operator precedence. If you want to know more, Google is your friend.)

In our example, the parser will do the following:

Stack	Next token	Action?
<empty>	IDENT(y)	shift
IDENT(y)	EQUALS	reduce
term(y)	EQUALS	reduce
expression(y)	EQUALS	shift
expression(y), EQUALS	IDENT(m)	reduce
expression(y), op(=)	IDENT(m)	shift
expression(y), op(=) IDENT(m)	STAR	reduce
expression(y), op(=), term(m)	STAR	reduce
expression(y=m)	STAR	shift
expression(y=m), STAR	IDENT(x)	reduce
expression(y=m), op(*)	IDENT(x)	shift
expression(y=m), op(*), IDENT(x)	PLUS	reduce
expression(y=m), op(*), term(x)	PLUS	reduce
expression(y=m*x)	PLUS	shift
expression(y=m*x), PLUS	IDENT(b)	reduce
expression(y=m*x), op(+)	IDENT(b)	shift
expression(y=m*x), op(+), IDENT(b)	SEMI(b)	reduce
expression(y=m*x), op(+), term(b)	SEMI(b)	reduce
expression(y=m*x+b)	SEMI(b)	shift
expression(y=m*x+b), SEMI	??	reduce
statement(y=m*x+b;)	??	??

Beware: the expression produced will not be what you think it should be without all that operator precedence stuff. Instead, you'll get an expression tree like (((y=m)*x)+b), wrong in nearly every language.

So what's the difference?

Well, one difference is that bottom-up parsers can recognize a broader set of languages than top-down parsers. Another difference is that bottom-up parser generators are generally easier to code than top-down generators. And the bottom-up parsers generally perform better than their top-down brethren, because they are using small, simple data structures (a big table of rules, and a small stack).

But more importantly, the difference is in the "legacy" of bottom-up versus top-down parsing. Technically, top-down parsers are called LL, and bottom-up parsers are called LR -- usually both terms are surrounded by other letters and numbers, but if you look hard you'll find an LL or LR in there somewhere. And the difference there -- L versus R -- is in 'left' versus 'right' parsing. When a top-down parser is given a sequence of tokens, it consumes from the left. When a top-down parser is given the same sequence, it consumes from the right.

One place you can see that difference is in C versus Perl 4 or 5 parsing. A variable declaration and function declaration in C look like:

int x = 1;
int foo() { return x; }

In perl, you get:

our $x = 1;
sub foo() { return $x; }

Notice that in C, the "differences" between the two are on the right, while in Perl they are on the left. Larry Wall is generally a pretty people-oriented person, so he may have done that to make reading declarations easier for coders. But I'd also bet that the first perl parser was a top-down parser, that consumed from the left.

At last!

Because this is the point: top-down parsers have different operating characteristics. And one of those differences is that they want to 'commit' early to pursuing a particular path. As a result, top-down parsers are going to excel at recognizing languages where the major decisions are encoded to the left. Declaring "sub foo" puts the fact that this is a function declaration right up front. A top-down parser doesn't have to ask a single question about what's going on: it knows from the very first token that this is a subroutine.

One of the more frequent questions on the ANTLR support list was about parsing C-like declarations, where the information that a function declaration is taking place didn't show up until the very end. The only difference between an external declaration of a function and the function declaration itself is that the final semicolon is replaced by a block:

int foo(void) ;
int foo(void) { return 1; }

How does that affect me?

If you are coding a top-down parser, you hate C syntax for this reason. Almost all of the grammar specifications for languages like C are built expecting a bottom-up parser. So blindly converting the grammar from one syntax to the other will produce a parser that occasionally does a lot of work to parse a declaration, then discards it all, then redoes the exact same work to parse a function definition. How much fun is that?

Getting around this requires changing the structure of the grammar definition. Instead of having two rules for declarations and function definitions, you want a single rule:

declaration: specifiers declarator ';'
definition: specifiers declarator block

becomes:

any_decl: specifiers declarator [ ';' | block ]

This avoids the parse-forget-reparse problem, at the expense of making your grammar actions heavier.

In turn, rewriting your rules like this means you tend to change your grammar. If your declarations and definitions are all the same rule, why not accept more than one?

int foo(void) { return 1;}, bar(void) { return 2; }

Anyway, that's enough for today.

Multiple Inheritance

2009-09-27T22:10:00.000-07:00

As an update to my previous post on subclassing PAST::Node classes, I ran into another problem. It turns out that the PAST -> POST compiler knows about, and depends on the differences between, the different Node subclasses.

For example, there are multi-methods that match on the types of their parameters, so that a PAST::Block gets processed by a different method than a PAST::Var. You get the idea.

So my original idea, of deriving a set of roughly parallel Block, Var, Val classes from a new root that was a child of PAST::Node won't work. Because while a Slam::Block may be a Slam::Node, which may be a PAST::Node, it turns out that it also has to be a PAST::Block for the other stuff to work.

So I'm stuck needing to "insert" my class behavior into several different places. I can't get between PAST::Block and PAST::Node, and I shudder to think about getting behind PAST::Node as a parent class.

Deriving from PAST::Block, Node, Var, Val, Stmts, Op, etc. seems like the only reasonable way to go, except that I would have to reimplement all the Slam::Node methods in each child, and would lose the convenience of common ancestry.

P6object to the Rescue!
But all is not lost, because among all the other cool things about Parrot is this: Parrot doesn't care about how many parents an object has. What's more, the P6object library supports adding parents willy-nilly.

So the solution, in this case, was pretty easy:

        my $base := $meta.new_class('Slam::Node');

        my $block := $meta.new_class('Slam::Block', :parent($base));
        $meta.add_parent($block, PAST::Block);

        my $test := Slam::Block.new();
        if $test.isa(PAST::Block) {
            say("I'm a blockhead, ma!");
        }
        if $test.isa(Slam::Node) {
            say("I'm a slammer, too.");
        }

Et voila! Multiple inheritance. Or, if you prefer, a "slatheron."

Subclassing PAST::Node in NQP

2009-09-27T08:42:00.000-07:00

One of the problems with NQP is that it's not quite perl6. And that means if you do much development, you eventually run up against a corner of the language where the sidewalk just ends.

The support for objects is an example of this. There's no problem with defining methods, or defining a class name. There's no problem with creating a new instance of the class. But that's where it stops being easy. Because the syntax for extending another class is missing.

One thing I'd like to do is subclass the PAST::Node class(es) in my own code, so I can use method invocations to call functions in a different namespace. This would change code like:

close::Compiler::Type::merge_specifiers($node1, $node2)

into something like:

$node1.merge($node2);

which would make my fingers happy, if nothing else.

Automatically generated, but wrong
One problem is that when I declare a class in NQP, like:

class close::Compiler::Type;

NQP emits an initload block that creates a new class. The problem is that it creates the class with no parent. Whoops. So the first thing to do is to change from using a 'class' keyword to using 'module.'
Now I won't have any class definition at all, which is actually good. All I need now is to make my own initload that creates the class.

Make your own initload sub
The trick to rolling your own initload sub is not to do it -- you can't get there from here in NQP. What you can do, though, is take advantage of the fact that any code that is at package scope in your NQP source code is put into the initload sub for the class or module you are defining.

So, since package scope is an initload sub, my solution is to define a sub and call it from package scope:

_onload();

sub _onload() {
say("Hello, from _onload");
}

Creating a class, the right way
Now that we know how to get a class created, it's time to dig into the PCT source code to see how it creates the classes. I looked in $parrot/compilers/pct/src/PAST/Node.pir, and found this:

    p6meta = new 'P6metaclass'
    base = p6meta.'new_class'('PAST::Node', 'parent'=>'PCT::Node')
    p6meta.'new_class'('PAST::Op', 'parent'=>base)

Well, that looks pretty simple! Let's convert that to NQP:

module PAST::Subclass {
_onload();

   sub _onload() {
      my $meta := Q:PIR { %r = new 'P6metaclass' };
      $meta.new_class('PAST::Subclass', :parent('PAST::Node'));
   }
}

Note that I chose to inherit from PAST::Node, rather than PCT::Node. Once you know the trick, you can subclass just about anything.

The init problem
The next problem is that the P6metaclass system generates a really dumb 'new' method. So the easiest thing is to replace it. But if I replace it, how will I initialize the superclass data?

This is where you have to investigate on your own. In this case, I chose to inherit the PAST::Node version of 'new,' because it calls self.init(...), which I could override.

So I have an init method, but I need to call the PAST::Node::init method as well. That wouldn't be a problem, except that (1) there is no PAST::Node init method, it is inherited from PCT::Node; and (2) the PCT::Node init method wants its parameters flattened. It's more PIR code to the rescue, because NQP doesn't support flattening args:

sub init(*@children, *%attributes) {
    # do my own stuff ...
    # ... then call
     Q:PIR {
        .local pmc children, attributes
        children = find_lex '@children'
        attributes = find_lex '%attributes'
        $P0 = get_hll_global [ 'PCT' ; 'Node' ], 'init'
        self.$P0(children :flat, attributes :named :flat)
    };

    return self;
}

And now I've got a subclass. Just to bundle it into one big copy/pasteable bunch, here you go:

module PAST::Subclass {
    _onload();

   sub _onload() {
      my $meta := Q:PIR { %r = new 'P6metaclass' };
      $meta.new_class('PAST::Subclass', :parent('PAST::Node'));
   }

   sub init(*@children, *%attributes) {
      # do my own stuff ...
      # ... then call
      Q:PIR {
         .local pmc children, attributes
         children = find_lex '@children'
         attributes = find_lex '%attributes'
         $P0 = get_hll_global [ 'PCT' ; 'Node' ], 'init'
         self.$P0(children :flat, attributes :named :flat)
      };

      return self;
   }

}

The Whitespace Hack

2009-09-26T07:42:00.000-07:00

Every culture has certain rites of passage that they expect members to pass through before they are fully accepted. In some places, they're pretty extreme -- taking knives to your naughty bits, and the like. Fortunately, the Parrot community is a little more laid back than that. There are two rites of passage for new members to participate in. The second one, which I haven't done yet, is to code a new garbage collector. Apparently, when you've coded a new GC, you get to do

svn commit -m "Today, I am a Parrot."

and you're totally in. I haven't reached that far, but I'm here to tell you that the first rite of Parrot-hood is pretty easy: filling out a ticket on http://trac.parrot.org.

You have to register, obviously, and a lot of the tickets get closed out with a comment like "You're doing it wrong." or "That's how it's supposed to work. Read the docs." But if you keep trying, eventually you'll get a real ticket in. And then comes the best part -- they fix stuff! There are few things in life more satisfying than watching people work your tickets on #parrot, while chortling "Dance, puppets! Dance while I pull your strings with my ticket submissions!"

Actually, it turns out there are some things more satisfying than that. But submitting tickets is definitely a good way to get into the community. And the earlier you start, the more bugs there will be, so the easier they are to find. It's like a Multi-Level Marketing program -- the earlier you join, the easier it is, and the more you gain. In the case of Parrot, mostly you're gaining karma on #parrot. But everybody starts somewhere.

Anyway...
I submitted a ticket (TT#1065) today about a problem with PIR that is inserted in-line in PGE rules. The ticket isn't very interesting, but it lets me segue into talking about the Whitespace Hack.

token ws {
    | <?{{
        $P0 = get_global '$!ws'
        if null $P0 goto noshort

        $P1 = $P0.'to'()
        $P2 = match.'to'()

        if $P1 != $P2 goto noshort
        .return(1)

    noshort:
        set_global '$!ws', match
        .return(0)
        }}>
    | <!ws> <.WS_ALL>*
}

Give that a read-through, and see if it's obvious what is going on.

It's sure not obvious to me. Even when I went back and read it later, I totally missed some stuff. Pmichaud is a pretty smart guy, and he's solving an important problem, but even when you know what he's doing, it's easy to miss stuff.

The <.ws> rule

The first thing to be aware of is that the ws rule has a special place in Perl6 grammars. If you specify a rule (as opposed to a token or a regex), or if you add some modifiers (:sigspace, I think, but you'd do well to check the specs), then the rule automatically matches optional whitespace everywhere you have white space in your grammar.

rule my_rule {
my dog has fleas
}

This rule matches a bunch of literal characters - because all "normal" characters match themselves unless you do something to them - and it also optionally matches whitespace in between some of them.

The way this works is that the regex above is modified by adding calls to a whitespace-matching rule -- you guessed it, . The effect is the same as if you wrote:

rule my_rule {
<.ws>my<.ws>dog<.ws>has<.ws>fleas<.ws>
}

(Just so you know, putting that period in front means that the results of those matches don't get collected. It's "match ws and don't save the contents," which is backwards from old-style regexes, where you had to put parens around all the groups you wanted to keep.)

One important part of "parsing" -- as opposed to "regexing," I guess -- is that lots of stuff in the input text gets treated as whitespace. For example, a fairly common approach to computer language parsing is to treat the comments as whitespace, so they can be ignored no matter where they appear. (Some old C hacks relied on a quirky implementation of this in the preprocessor to do "token pasting." Since comments were whitespace, a token followed by a comment ended the token. But since comments were replaced by 0 characters, a C comment could "glue" two tokens together with no spaces between them.)

So we've got two different imperatives for the ws rule: it will get called a lot because of the automatic whitespace recognition feature; and it needs to potentially recognize a lot of different stuff, like "real" whitespace, comments, heredocs, and whatever else strikes the coder's fancy. Here's the WS_ALL rule I use for Close:

token WS_ALL {
   [ \h+                # WS
   | \n [ {*} #= start_heredoc
          [
           <?{{    $P0 = get_hll_global [ 'close' ; 'Grammar' ; 'Actions' ], '$Heredocs_open'
            $I0 = $P0
                  .return($I0)
             }}>
             [ $=[ \h* \h* [ \n | $ ] ] {*} #= check_for_end
             || $=[ \N* \n ]
             ]
         ]*
       #{*} #= finish_heredoc
    ]
   | '/*' .*? '*/'            # C_BLOCK_COMMENT
   | '//' \N* [ \n | $ ]        # C_LINE_COMMENT
   | <.POD>
   ]
}

Matching nothing
The key fact to remember about the ws rule is that whitespace is optional. The ws rule is being inserted because you, the developer, have whitespace in your rule. Maybe that's because you expect whitespace, or maybe it's because you want to separate two parts of a complicated rule to make things readable. So ultimately, ws absolutely must accept a zero-length pattern.

The token ws regex at the top is broken into three alternative paths. The first path is a check for a single-entry cache. I'll come back to the caching theory later -- it's one of those things I missed the first time through.

The second rule is where most of the "work" of whitespace-ignoring is going to be done. Let's look at that in some detail:

| <!ww> <.WS_ALL>*

It seems pretty simple. The first part of the rule calls another rule, ww, that is also built-in to the PGE-generated grammar. The ww rule is a zero-width condition test, and it returns true if the current position is between two 'word' characters. That is, if the previous character matches \w and if the next character matches \w, then is true.

token ww {
<?after \w>
<?before \w>
}

Of course, ww uses some magic to do its testing, so it looks nothing like that! But it could look like that, if it were slow.

So what does do? Well, it returns true if is false. is the logical negation syntax in Perl6 grammars. If both characters are not word characters, then we're in the middle of some spaces or line noise. If both characters are word characters, then we're in the middle of a word. And if one character is while the other character isn't a word character, we're looking at either the start (\W\w) or the end (\w\W) of a word.

Using matches three of those four cases: it matches the case where we are in the middle of spaces or line noise, the start of a word, and the end of a word. Most users won't care about the start-of-word case -- not many modern languages use words to introduce comments. So is a speed optimization, and a filter that eliminates middle-of-the-word invocations of the ws rule.

Only when checking for "space" might be relevant -- when approves -- does the work of actually checking for whatever it is that Close calls "whitespace" begin.

Caching is a win
One way to speed up any repetitive task is to cache the results. This is a standard dogma for most computer people, but does it apply to whitespace? Especially, does it apply when you're scanning forward through a string doing pattern matching?

It turns out that it does. Take a look at this code from the Close grammar.

rule declaration {
    <specifier_list> <declarator_part> [ ',' <declarator_part> ]* {*}
}

rule declarator {
    <dclr_pointer>* <dclr_atom> <dclr_postfix>*     {*}
}

rule declarator_part {
    <declarator>
    <dclr_alias>?
    # ...

The entry point for these rules is declaration, which will match a specifier list, then check for whitespace, then call declarator_part.

When declarator_part is called, it will check for whitespace, then call declarator.

When declarator is called, it first will check for whitespace, then call dclr_pointer.

Can you see where this is going? When an outer rule calls declaration, the first thing that happens is a whitespace check. The second, third, and so on, things that happen are whitespace checks. Assuming there is some whitespace to consume, the very first invocation of the ws rule is going to gobble it all up. After that, each subsequent call to the ws rule do a lot of extensive testing for various ways that it might not fail, before failing.

So we come to the first alternate branch of the ws rule:

    |<?{{
        $P0 = get_global '$!ws'
        if null $P0 goto noshort

        $P1 = $P0.'to'()
        $P2 = match.'to'()

        if $P1 != $P2 goto noshort
        .return(1)

    noshort:
        set_global '$!ws', match
        .return(0)
     }}>

When a subrule is invoked, even a subrule that is coded in PIR in-line, rather than being written using Perl6 regex syntax, there are two properties that already have a meaningful value. The match.to() property is the end of the current match, while the match.from() property is the beginning. The results are offsets from the start of the text -- numbers, in other words.

When a new match is being set up, the start and end are going to point to the same place. And for zero-length matches, like the ww rule, they are always pointing at the same place. So what this code does is not use the start of the match, but rather the end. Because the first time a whitespace rule runs, it might actually match some whitespace, and so .to() and .from() will be different.

But the next time the ws rule gets called in the same location, it will be called with .to() equal to .from(), and both of them will be pointing at the same place where that first ws match ended. So this cache ignores the start of the match, and focuses on comparing the end of the match with the endpoint of the cached result. It essentially checks if the very last whitespace match ended at the current position. If so, then the next call to the same function will also decide to stop here. So don't bother making the call.

When caching doesn't work
There's two problems with the whitespace hack that you probably can't see. They are related, and they spring from an earlier post, in which I mentioned that Close calls itself to handle #include processing, as well as some internal voodoo.

The first problem is that the PGE compiler outputs the inline PIR with no namespace. That's the ticket I just submitted, and while it isn't fatal, it did cause me to spend some time tail-chasing before I figured out where the $!ws variable was winding up.

The second problem is that when there is no whitespace at the beginning of a translation unit -- as happens when the first line is #include -- the whitespace hack caches a match entry with an ending offset of 0. And when the included file starts, it checks for a cache entry, finds one (after all, it's a global), and requires that the beginning of the file be something other than whitespace. Whoops!

The solution, for me, was to implement a "whitespace hack stack" to go along with my #include'd filename stack. So when I change files, I also save and restore the cached $!ws variables. Technically, I could just blow away the contents, but since I had to write all the code to reach into the variable and set it to something, I decided to save and restore it.

Recursive Compiling

2009-09-25T10:42:00.000-07:00

Years ago, back when GCC was still on version 1.x, I went looking through the code. I don't remember why, but I was looking for a way to "just make a function." I had some notion that down in the bowels of the code there would be a function called "create_function" or some such, and I could pass it a return type and a name and who-knows-what-else, and it would return a function object. Or maybe it would automatically emit the function object. I don't remember what I was thinking at the time, I just remember being frustrated by the nest of snakes I uncovered, each function requiring a huge initialized data structure as input, making a small update to the tree, and returning.

The other day I was working on Close, and I needed to create a function for the namespace init code. And I thought, "what I need is a function that will just create this one thing for me." Deja vu!

So I wrote that function. And with the caveat that I knew exactly what the return type was going to be (void), and what the parameters were (none), and what the content would be (none), it still took more than 40 lines of NQP code to produce the result I wanted.

Why? Well, the compiler is built around the assumption that it is parsing input, so each function requires a node, operates on that node and maybe a few more, and so forth. So my "create a function" function had to create a declarator, then create a type specifier, then look up the type specifier in the symbol table, then link the declarator and specifier with a symbol, then convert the resulting function declaration into a function definition, et cetera, et cetera.

Eventually, the created function would be added to the namespace as the very first function in the namespace. And even later, every single one of those functions would be automatically discarded by the code marshaller because they had no code inside.

Wouldn't it be nice if ...
I spent some time on IRC, watching people who are way smarter than me make progress on Parrot, and I happened to mention to WhiteKnight that I wished I could just put the declaration of my function into a string and invoke the parser on the string. It was kind of a joke, and WhiteKnight cautioned me that misery and evil lay down that path, and that I should avoid it.

Well, as mentioned, I sat down and grunted out the fortysomething lines of NQP code that it took to create the namespace init subs. And boy, did that suck. (A whole bunch of functions, every single one of which is being used in a way that was not intended. What could go wrong?)

Today I changed my mind.

It turns out that the way Close processes #include directives is by reading in the contents of the file to a String, then passing that String to the compiler's 'compile' method with a note saying "don't do more than PAST generation."

Here's two functions from the Close include-file module that illustrate how it works. Ultimately, it's parse_text that does the work. And that function is a whopping 15 lines, most of which is scaffolding. Thanks, PCT, for making life easy.

sub parse_text($text) {
    my $result := Q:PIR {
        .local pmc parser
        parser = compreg 'close'

        .local string source
        $P0 = find_lex '$text'
        source = $P0

        %r = parser.'compile'(source, 'target' => 'past')
    };

    DUMP($result);
    return $result;
}

sub parse_include_file($node) {
    get_file_contents($node);
    my $contents := $node<contents>;

    if Scalar::defined($contents) {
        push($node.name());
        close::Compiler::Scopes::push($node);

        # Don't worry about the result of this. The grammar
        # updates the current scope, etc.
        parse_text($contents);

        close::Compiler::Scopes::pop('include_file');
        pop_include_file();
    }

    return $node;
}

It turns out that invoking the compiler recursively is the right thing to do. The code I had to write for include files, and the code I had to refactor to provide a callable interface for special-purpose internal naughtiness, is nowhere near as brittle as the code for creating a function directly. To say nothing of being way more maintainable. The include file code looks like "open a file, slurp the file, call the parser," which is nicely understandable. The create-a-function-from-scratch code looked like a horrible mess, especially in the light of a new day.

Working the kinks out of this code encouraged me to find a bunch of other places in the code where I was building data structures "by hand" to see if I could replace them with recursive parsing. So far, so good.

Output ordering and Initializers

2009-09-22T21:37:00.000-07:00

Output ordering

Today I got Close to emit functions in the same sequence they occur in the input file. That was easy.

What wasn't easy was automatically creating a namespace init function to handle initialized declarators.

If you compile code that looks like this:

int x = 10;

There's no function involved. Clearly, this is a global variable 'x', and its initial value is 10. But Parrot does not use the same model that *nix uses, of loading a file that contains a partial memory image. Since there's no data segment, any and all initialization have to be done by code.

That's where the namespace init functions come in. A namespace may contain variables and functions that depend on the variables. For the case of simple initialization - like the x = 10 case above - I'm willing to take care of that for you. If you want a more complicated initialization, where some data depends on the initialization of other data, well, you can write that code yourself. (Seriously - that's what the ':init' adverb is for.)

Executive decision

One decision I had to make was the order of declaration of the _nsp_init functions. Should they be declared (and, by the rules of Parrot, be executed) after other functions, or before them. My decision was that _nsp_init functions will be "declared" the first time the namespace appears. Any initializers in the namespace will (eventually - this doesn't work yet) get appended to the _nsp_init function.

This means that initializers cannot depend on user-specified :init functions having run. If you need that, then you will have to perform your own initialization.

Results

The result of this is all but invisible to you. But it means that this code:

namespace A::B {

    int x = 10;

    void my_init() :init {

        say(y);

        y = 20;

    }

    int y = 100;

    void main() :main {

        say(y);

    }

}

Will produce these subs, in this order:

:: _nsp_init // empty
:: A :: _nsp_init // empty
:: A :: B :: _nsp_init
x = 10
y = 100
:: A :: B :: my_init
:: A :: B :: main

Of course, the empty _nsp_init subs will be silently deleted by the compiler, so they never appear in the output. But if they had content, that was added later, say, they would be emitted in that order.

The output of running the code above would be:

100
20

Because the _nsp_init function was emitted (and so, would run) before the my_init function.

Referencing symbols in another namespace

2009-09-22T04:18:00.000-07:00

Today, I got this code to DWIW:

#include <std/io>
#include <std/test>

namespace test {
    namespace hll: close :: test :: nested {
        namespace deeply {
            void goodbye() {
                say("Adios, amigo.");
            }
        }
    }

    void test() :main {
        using namespace ::test::nested::deeply;
        say("namespace-function-say");
        hello();
    }

    namespace nested::deeply {
        void hello() {
            say("Hello, world");
        }
    }

    namespace X {
        void xray() {
            say("Specs!");
        }

        namespace ::test::nested::deeply {
            void say(string what) {
                asm(what) {{ say %0 }};
            }
        }
    }
}

What I want, in this case, is for all of the various invocations of 'say', except the one in xray(), to refer to the say function declared down at the bottom.

There's a bunch of stuff I did wrong that I'd love to talk about, but it's late and I'm tired. So I'll say that compiler writers have things a lot easier than I thought - it's the guy writing the specs that does most of the heavy lifting. In this case, I'm writing a compiler with no spec, so every detail is a learning experience. (And let's face it: I'm old, and learning is painful.)

Anyway, here's the output:

.namespace []
.sub "anon" :subid("post15")
.end

.HLL "close"

.namespace ["test";"nested";"deeply"]
.sub "say" :subid("10_1253616492")
    .param pmc param_12
    .lex "what", param_12
    get_global $P13, "what"
say $P13
    .return ()
.end

.HLL "close"

.namespace ["test";"nested";"deeply"]
.sub "hello" :subid("11_1253616492")
    get_hll_global $P15, ["test";"nested";"deeply"], "say"
    $P16 = $P15("Hello, world")
    .return ($P16)
.end

.HLL "close"

.namespace ["test"]
.sub "test" :main :subid("12_1253616492")
    get_hll_global $P18, ["test";"nested";"deeply"], "say"
    $P18("namespace-function-say")
    get_hll_global $P19, ["test";"nested";"deeply"], "hello"
    $P20 = $P19()
    .return ($P20)
.end

.HLL "close"

.namespace ["test";"nested";"deeply"]
.sub "goodbye" :subid("13_1253616492")
    get_hll_global $P22, ["test";"nested";"deeply"], "say"
    $P23 = $P22("Adios, amigo.")
    .return ($P23)
.end

.HLL "close"

.namespace ["test";"X"]
.sub "xray" :subid("14_1253616492")
    get_global $P25, "say"
    $P26 = $P25("Specs!")
    .return ($P26)
.end

It's worth noting that I don't stop on an error - I keep trying to generate stuff. This will probably not be true in production - any error will cause an exit 1, etc. But for now, I need to see what's going on. Also, the order of generation is presently determined by the traversal order of some internal trees I've built. My next step will be to generate functions in the same order they appear in the source code. (With the caveat that variable initializers will get shuffled together.)

Howdy

2009-09-22T03:56:00.000-07:00

Everyone else is doing it, so why can't we?

In this case, I'll be blogging about Close - a programming language on and for the Parrot VM.