Sand: Rules

From AJS.COM

Jump to: navigation, search
Note: I'm thinking about moving sand toward something much closer to true Perl 6 rules... no decision yet. -AJS

Sand Rules are the grammar definition tool in the Sand programming language. They are based on Perl 6 rules. At their most basic, a rule is simply a collection of (potentially nested and/or recursive) regular expressions ala. Perl regular expressions. A rule is essentially an object which has a set of code for matching strings, and can return not only a true/false value, but a data-structure which describes the match.

Contents

Sand syntax

Within a Sand program, rules are defined in one of two ways:

rule x { ... }

or

re{ ... }

In each case, the text that comes between the open-brace and close-brace is the rule body, and conforms to the definitions in this document.

Unicode

The following terms in this document map to the following Unicode General Category Properties:

alpha or alphabetic
Codepoints with Lu, Ll, Lt, Lm, or Lo
number or numeric
Codepoints with Nd
alpha-numeric
Any alpha or numeric codepoints
punctuation
Pc, Pd, Ps, Pe, Pi, Pf, or Po

"Whitespace" indicates any codepoints which have the White_Space binary property.

A "character" is roughly a grapheme, but is used in this document when speaking less formally about sequences within a rule.

All codepoints are allowed in rules, but rules are evaluated at a certain "level" which determines how codepoints are broken down into atoms. Some implementations may issue an exception when encountering combining or non-printing, non-whitespace codepoints. This simplifies the implementation substantially. Using \x, \N, and \p, all valid Unicode sequences can be matched in any implementation.

See Sand: Syntax and Structure for more information on basic character-level issues.

By default, whitespace is simply ignored in rules unless escaped, though that behavior can be modified, as will be discussed later.

Comments

Within a rule, the number sign character (#, U+0023) begins a comment. Comments extend from the number sign to the end of the current line, and are ignored entirely. Even a close-brace is not matched within the comment:

re{
  abc # Match abc and ignore }
}

Atoms

An atom is matched or not matched and added to the current match data in such a way that no partial behavior is possible. The simplest type of atom is a literal character:

a - the letter "a"
\. - The period (., U+0061)
\# - not a comment, but the character #

A literal is any alpha-numeric character or any non-alpha-numeric preceded by a backslash (\, U+005C). The following are rules which consist only of strings of literal atoms:

The\ quick\ brown\ fox
my\ \$x\ \=\ 1\ \+\ 1\;

Because using backslashes can be tedious in long strings, there is a shortcut:

<:qm my $x = 1 + 1>

The :qm stands for "quote meta" and is a shorthand way of quoting all characters except for >, which must be escaped with a backslash, even inside :qm.

Special atoms

^  start of text (string)
$  end of text
^^ start of any line
$$ end of any line
\b word boundary
\n logical newline
\t tab or tab-line character
\s whitespace (see <:ws>)
\S complement of \s
\d digit (see <:digit>)
\D complement of \d
\w word character (see <:word>)
\W complement of \w
\p{...} match any codepoints by property name
\N{...} match any codepoints by name
\x{...} match a specific codepoint by hex value

Alternation

An alternation is a set of one or more rules which may match at the current position. Alternations are separated with vertical bars (|, U+007C).

abc | xyz | 123 # Match any of "abc", "xyz" or "123"

By default an alternation begins at the start of the rule and ends at the end of the rule. A dash, however, alternates between the atom on the left and the atom on the right:

\. - \! # match period or exclamation point

When dash is used between two alpha-numerics, all characters whose Unicode codepoints are between (inclusive) the two characters' codepoints are matched as an alternation:

a-z # match any lowercase letter

This is called a character range. It is an error to specify a character range which includes non-alpha-numerics such as:

A-z

Since the upper- and lower-case letters have characters such as underscore (_, U+005F) between them, this range is illegal. To specify such a range, explicit Unicode codepoints must be used:

\x{41} - \x{7a} # Match Unicode A through z, inclusive

This extra step is required in order to prevent careless or unclear inclusion of unwanted characters in ranges. The most obvious pitfall being the above upper- and lower-case range. As will be seen later, this is a clumsy thing to do in most cases, and there are simpler ways.

Repetition

An atom may be followed by a repetition count. A repetition specifies how many times the atom should be matched.

a* - match zero or more "a"
a? - match zero or one "a"
a+ - match one or more "a"
a<2> - match 2 "a"
a<1-5> - match 1 to 5 "a"

While whitespace is allowed between the atom and its repetition count, it is not encouraged.

Subexpression grouping

The following is an atom:

[abc]

Notice that each of the letters is, on its own, an atom, but the brackets enclose them and act as a single atom. Any valid rule text may occur within brackets:

[ abc [ xyz \. ] ]

So far, these groups are not useful, but when combined with alternations and repetition, grouping can be extremely powerful.

go[ne]? # match "go" or "gone"
[19|20]0-9<2> # match a year in the 1900 or 2000 centuries
flower[s|ing|ed]? # Match flower alone or with an ending

Subexpressions may also be enclosed by parentheses.

(abc)

Parentheses and brackets behave exactly the same, except for the fact that parentheses save their matched contents as numbered sub-expressions, which can be accessed by indexing the match result as an array, starting from one. The zeroth entry in the match is always the entire match contents (though by default the match contents is not saved, so this entry will be undefined).

Named atoms

Any atom may be stored to a named slot within the match. Doing so looks similar to a Sand variable assignment:

abc :$number = [ 123 ] xyz

However, there is no variable named $number. Instead, the match results will contain this named value in a "named" structure. See the section on match results, below for more information on how to use these named values.

When a named atom is also a parenthesized sub-expression group, there will be identical named and numbered values saved.

Subrules

A subrule is any sequence delimited by angle-brackets (<, U+003C and >, U+003E). Subrules com in a number of forms.

Builtins

Builtin subrules always start with a colon, followed by the alpha-numeric name of the subrule and optionally argument text following one or more whitespace characters. For example:

<:ws> or <:whitespace> # Match any whitespace
<:digit> # Any digit
<:alpha> # Any alpha character
<:an> or <:alphanum> # Any alpha-numeric
<:punct> or <:punctuation> # Any punctuation
<:w> or <:word> # Any alpha-numeric or underscore
<:n> or <:number> # see below
<:qm arg> or <:quotemeta arg> # quote the argument
<:i> or <:nocase> # Begin case insensitive matching
<:sig> or <:sigspace> # see below

Variables

Variables may be interpolated into a match like so:

<:$variable>

The contents of the variable must be a rule or a value with a string representation. The contents will be matched as if they were a named sub-rule.

User-defined subrules

A user-defined subrule is any subrule which begins with an alpha character. These subrules refer to externally defined rules, which are evaluated as an atom within the current match. In a Sand program, this would look like:

rule a { a+ }
rule b { <a> b+ }

The sub-rule a will be matched first, and then the remaining part of b will be matched. Should evaluation fail, backtracking will occur, and may re-enter the subrule. This how the execution happens logically, though implementations may approach this very differently.

By default, all user-defined rules capture their results into a named value whose name is the name of the subrule. If this is not desired, the subrule may be written as:

<?rulename>

in which case, the match is not captured.

Assertions

Assertions are embedded code which is evaluated (potentially many times, in the face of backtracking) during a match, and informs the engine of its success or failure. For example:

:$n = (\d+) <{ $n > 4 }>

This rule matches a sequence of one or more digits whose numeric value is greater than four.

Closures

Closures are like assertions, but have no impact on the matching. They are only useful for their side-effects. For example:

:$a = (a+) { print($a) }

Match results

Match results are accumulated in the match result object. Within rules, this object may be referred to in two ways:

With numbered indexes (starting from 0)

(a)(b)(c) $0 $1 $2

or with named lookups

<aa><bb><cc> <:$aa> <:$bb> <:$cc>

When a rule returns, the return value will contain the match result object. This object will have both numeric and string indexes, so it can be indexed in both ways:

$names = rule {
   :$first = (Adam | Bob | Charlie | Dave) <:ws>
   :$last = (Smith | Wang | de Gaulle)
};
my $result = $names("Adam Smith");
print "Hello {$result[last]}, {$result[first]}";
print "First name: $result[0]";

It looks like this datastructure contains both an array and a hash, but that's not so. It's indexed as a hash, and this means that subexpressions cannot be named numerically. Individual implementations of the language may optimize numeric lookups on this datastructure, but that is not guaranteed.

Personal tools