Writing simple parsers using Perl regular expressions

From AJS.COM

Jump to: navigation, search

Perl regular expressions are very powerful, and capable of performing most of the simple tasks that a traditional parser (such as yacc or bison) would be used for. This article will go over some of the modern techniques for using Perl in this way. We'll focus on an example from the real world, a log file written by a video game (World of Warcraft) that has a complex format (it's actually code in a scripting language called Lua), and must be parsed in order to extract data about the price of in-game items.

Contents

Extended regular expressions

Perl regular expressions like this one are the most familiar to the typical Perl programmer:

/a.*b/

However, as these expressions become increasingly complex, such regular expressions become unreadable and unmaintainable:

/\[([^\]]*)\]\[([^\]]*)\]/

The use of Perl's extended regular expressions can improve this dramatically:

m{
   \[            # match start of first bracketed value
       ([^\]]*)  # First value captured
   \]            # end of first bracketed value
   \[
       ([^\]]*)  # Second value
   \]
}x

These two expressions match the exact same data and save the exact same values to $1 and $2, but the second one can be read and maintained by another programmer with far more ease. When we start writing very long regular expressions that behave more like parsers, we really need this kind of cleanliness.

Building subexpressions

Because our data will have some repeated patterns, it's best to build them up as pre-parsed subexpressions. This allows us to insert them in ways that we can easily maintain later. For example, a Lua string or numeric value will be the same just about anywhere, so we can write two subexpressions to match those:

our $lua_str = qr{nil() | \\"([^"]*)\\"}x;
our $lua_num = qr{nil() | (-?\d+)}x;

The use of qr{...} creates a pre-compiled regular expression that we can store in a variable and use later.

Notice the use of the empty capturing parens. We'll get to that later, but for now just be aware of the fact that the string "nil" will match an empty group after it.

Sample data

Ok, so we have the basics, now we have to look at our data:

{\"|cff1eff00|Hitem:7461:0:0:0:0:0:189:-1095774430|h[
   Knight's Bracers of Strength]|h|r\",36,\"Armor\",\"Mail\",
   9,7772,4,1195049431,\"Knight's Bracers of Strength\",
   \"Interface\\\\Icons\\\\INV_Bracer_07\",1,2,nil,31,7772,
   0,9716,0,nil,\"Jania\",3,3092,7461,189,0,0,-1095774430,}

I've inserted some newlines here to make it format better, but the original is all one large string. I also happen to have the Lua code that wrote this, so I have names for all of these fields. That will make my job a bit easier.

Target format and numbered subexpression matches

I've decided that I'm going to store all of these values into a hash, and at the end of the match, I'll store the hash into a database as a single row. Now, I could do something like this:

 m{$lua_str,$lua_num,$lua_str};
 $item{name} = $2;
 $item{price} = $4;
 $item{time} = $6;

but this has several drawbacks. Even properly commented, it would be awful if the format changed slightly and I had to go back over all of those values, figuring out how the re-numbering worked after inserting a new value. Ultimately, this is going to result in many bugs that are hard to find. What's more, it's going to separate the setting of the hash element from the part of the regular expression that matched it, and that again leads to confusion and bugs in a large regular expression.

Fortunately, Perl 5.8 and later (5.8.8 is current as of this document's creation) provides a handy way to do both at once:

m{
 $lua_str,   (?{ $item{name} = $^N })
 $lua_num,   (?{ $item{price} = $^N })
 $lua_str    (?{ $item{time} = $^N })
}x

The funky (?{ ... }) section is just Perl code that gets executed whenever the regular expression hits that point. In order to use matched values here, you do have to add this to your program:

use re 'eval';

This tells Perl that you're aware that using matched data in evaluated code can be a security risk. Just be very careful to match exactly the data you wanted and never pass untrusted user data to other programs, languages or evals unless you've verified that it does not contain malicious code (e.g. don't run a command with a matched string unless you verify that it doesn't contain shell metacharacters). In this case, we're just storing the values, so we're good to go.

One last point about the above. Notice the use of $^N. This variable is created on-the-fly by Perl every time a subexpression in parens is matched. So, it will either match our parenthesized lua value or that empty paren group that came after "nil". Remember that? Now you can see why it's there. If a value is set to nil, then our Perl variable will contain the empty string, which we can later replace with undef. If we didn't have that empty group, then $^N would accidentally pick up the previous match, and that's not what we want.

The full code

m{
  \{
    \\\"(\|\w+\|Hitem:
     $lua_num:           (?{ $item{hid} = $^N })
     $lua_num:           # unused parts of the key
     $lua_num:           #
     $lua_num:           #
     $lua_num:           #
     $lua_num:           #
     $lua_num:           #
     $lua_num\|          #
     h\[([^\]]*)\]\|     (?{ $item{lname} = $^N })
    \w+\|\w+)\\\",      (?{ $item{itemhash} = $^N })
    $lua_num,           (?{ $item{ilevel} = $^N })
    $lua_str,           (?{ $item{itype} = $^N })
    $lua_str,           (?{ $item{isub} = $^N })
    $lua_num,           (?{ $item{isequip} = $^N })
    $lua_num,           (?{ $item{price} = $^N })
    $lua_num,           (?{ $item{tleft} = $^N })
    $lua_num,           (?{ $item{time} = $^N })
    $lua_str,           (?{ $item{name} = $^N })
    $lua_str,           (?{ $item{texture} = $^N })
    $lua_num,           (?{ $item{count} = $^N })
    $lua_num,           (?{ $item{quality} = $^N })
    $lua_num,           (?{ $item{canuse} = $^N })
    $lua_num,           (?{ $item{ulevel} = $^N })
    $lua_num,           (?{ $item{minbid} = $^N })
    $lua_num,           (?{ $item{mininc} = $^N })
    $lua_num,           (?{ $item{buyout} = $^N })
    $lua_num,           (?{ $item{curbid} = $^N })
    $lua_num,           (?{ $item{amhigh} = $^N })
    $lua_str,           (?{ $item{seller} = $^N })
    $lua_num,           (?{ $item{flag} = $^N })
    $lua_num,           (?{ $item{id} = $^N })
    $lua_num,           (?{ $item{itemid} = $^N })
    $lua_num,           (?{ $item{suffix} = $^N })
    $lua_num,           (?{ $item{factor} = $^N })
    $lua_num,           (?{ $item{enchant} = $^N })
    $lua_num,?          (?{ $item{seed} = $^N })
\}   }x;

Now I just set all empty strings to undef:

foreach my $key (%item) {
  $item{$key} = undef if $item{$key} eq "";
}

And I have a Perl datastructure to match the Lua one that was on disk. This regular expression runs very fast, and is quite easy to update if changes to the format are introduced later.

Performance

While the above runs fast relative to the need, there are some places where you end up having to parse an order of magnitude more records than a simple video game log file is likely to produce. I do this at work, in fact, and there, I don't have the luxury of being able to evaluate Perl code every time I match a value. I need to keep going. What do you do then? Well, one way to solve this is to separate out the names of the fields in one place, like an array that comes first. First you have to take out the empty subexpression in the interpolated regexes:


our $lua_str = qr{nil | \\"([^"]*)\\"}x;
our $lua_num = qr{nil | (\d+)}x;

Because we'll be referencing these positionally, we will get a truly undefined value when nil appears.

Now we list all of the field names, in order of where they start (that is, where their open-parens appear):

my @fields = qw(itemhash hid ignored ignored ignored ignored
                ignored ignored ignored
                lname ilevel itype isub isequip price
                tleft time name texture count quality
                canuse ulevel minbid mininc buyout
                curbid amhigh seller flag id itemid
                suffix factor enchant seed);

Replace all evaluated Perl code with comments like this:

    $lua_num,           # suffix
    $lua_num,           # factor
    $lua_num,           # enchant
    $lua_num,?          # seed

Now, after matching a record, you just populate your hash all at once:

my %item;
@item{@fields} = m{$parser_regex}x;

Where $parser_rexex is the text of the regular expression that you've removed the code from.

Warning: mixing list context assignment like the assignment above with the /g modifier to a regular expression almost certainly won't do what you expect, as /g has special behavior in a list context. Sadly, there's no reasonable way to mix those two constructs in Perl. If that didn't make sense to you, then just make sure you don't use the /g modifier on a regular expression that you're matching and then assigning the results to a list.

This executes much faster, but you have two places that you have to maintain the ordering of the matched fields. That's certainly better than having to directly reference the numbering of the fields, but it's still a bit of an annoying overhead in terms of maintainability. Which technique is better? It really depends on your application. If you're parsing millions of records, you probably want to go with this solution. If you just have thousands, the more maintainable solution is probably best.

Personal tools