Sand: Syntax and Structure

From AJS.COM

Jump to: navigation, search
This is a pre-alpha document, and will change radically, without warning. Also, while this is a wiki, the author requests that you keep changes to the discussion page so that they can be considered in the larger context of the language before becoming part of the specification.

This document will eventually outline the full specification for the syntax and grammatical structure of the Sand programming language. Sand is very similar to Perl, but this document will assume no knowledge of Perl.

Contents

Syntax

Encoding

A Sand program is comprised of UTF-8 encoded Unicode text.

Characters

Throughout this document the word "character" will be used loosely to refer to a non-combining, printable Unicode codepoint, roughly analogous to a grapheme.

Characters also include the "non-printable" characters that make up the traditional "whitespace" set:

tab      U+0009
linefeed U+000A
return   U+000D
space    U+0020

Codepoints defined by the Unicode standard as whitespace (other than those above), combining or non-rendering are only valid within special parser blocks which explicitly allow for them and POD. The same is true for "private use" codepoints, which some implementations may not allow, even within parser blocks.

See Sand: Rules for a discussion of terminology such as "punctuation" and "alphanumeric" with respect to Unicode. The terminology used in that document is also used here. Note, however, that there are some differences, specifically in the definition of "whitespace" which is much more liberal in rules than it is outside of them.

Lines

By default, lines of program text are terminated by a linefeed or the end of input. However, because carriage returns (U+000D) are considered whitespace, the carriage return/linefeed combination which is in use on some systems is almost never harmful.

Identifiers

Identifiers are sequences of one or more characters. They may not contain code-points which are used only for combining or are otherwise non-printing. The first character must be an alphabetic character or the underscore (_, U+005F). Subsequent characters must be alpha-numeric characters or the underscore. The alphabetic characters which make up an identifier must all come from the same Unicode block. Similarly, the numeric characters must all come from the same Unicode block. However, extension and supplemental blocks may be freely mixed with their basic block. So, for example, the following are all valid identifiers:

_
apple
_1234
æther
Я_ф_Б
απ314

Identifiers may also be composed of multiple individual identifiers, joined with two colons, e.g.:

fruit::apple::seeds

The validity of each component identifier is determined on its own, without respect to the others in the chain.

By convention, all identifiers starting with two underscore characters are used only by Sand and its libraries. Use by end-user programs should be viewed with considerable suspicion.

Also by convention (and possibly enforced by a stricture), alphabetic characters and numeric characters used within an identifier typically come either from the same code block or from the Latin block. That is, it would be considered questionable to mix Chinese numbers with Greek alphabetic characters. In some parts of this world, however, this might seem relatively natural, so there is no default requirement to this effect.

There are four special, predefined identifiers which may be used as bareword terms:

undef
true
false
inf

Variables

All variables must be prefixed with a dollar sign ($, U+0024) and otherwise be composed of a single, valid identifier. There is no difference between a container variable (such as an array) and a scalar variable in terms of how the variable name is written or prefixed.

$x;
$Math::pi;
$cars = [ "Ford", "Toyota", "BMW" ];

The combined dollar sign and identifier are considered to be a single token, not an identifier with a unary operator. Other prefixes for identifiers are actually unary operators:

::ident - The type/class/namespace named ident
&ident  - A function reference to the sub, ident

Indexing

A variable, close-square bracket or close-parenthesis which is followed by an open-bracket ([, U+005B) indicates an indexing operation. The indexing operation is terminated by a matching close-bracket (], U+005D).

$x[10]
$y["flower"]
[1,2,3][1]
["a"=>1,"b"=>2]["a"]

Auto-quoting

In the cases of indexing and pair construction, literal strings may be written without the use of quotes as long as the string:

  • appears on the left-hand side of a pair construction (=>) or within brackets ([...])
  • is a valid identifier
  • does not contain ::
$y[flower]
[a=>1,b=>2][a]

Function invocation

An identifier, variable, close-square bracket or close-parenthesis which is followed by an open-parenthesis ((, U+0028) is the start of a function invocation.

function();
$functionref();
$function_vector["apple"]();
($functionref)();
function_returning_functionref()();

All function and method invocation must use parentheses around the parameter list, even when the parameter list is empty.

Method invocation

A variable or close-parenthesis which is followed by a period (., U+002E) indicates a method invocation. After the period must follow either an identifier (the method name) or a variable (indirect method name).

Keywords

Keywords are those identifiers which have a predefined meaning when used without a prefix (such as $), suffix (such as (...)) or some other context which would map it to the name of some user-defined data. Typically keywords are used for constants which are constant with respect to the language, not the program or one of its libraries; or they are used for control structures and other non-function builtins. The full list is:

__END__
class
else
elsif
false
if
inf
method
module
my
never
our
sub
true
undef
use
when

Note that variables and function may have names which collide with keywords. The designation of a word as a keyword is only of interest to the compiler for purposes of tokenization.

Inline data

Data is represented in four primary ways:

Numeric

Numeric data can take all of the forms specified by this rule:

[ \+ | \- ]? [ \d<[\d_]>* [ \. \d* ]? | \. \d+ ] [ <:i>e \d<[\d_]>* ]?

Underscores are notational only, and do not affect the numeric value.

String

String literals take three forms:

'...' or q{...}

The only transformation performed on the string is the replacement of two backslashes (\, U+005C) with a single backslash and the replacement of a backslash followed by a single quote or close-brace with a literal single-quote or close-brace, respectively (which will not be counted when matching the initial token).

"..." or qq{...}

The enclosed string is scanned for backslashes, open-braces and dollar signs. Backslash sequences are:

\n - newline.
\r - return.
\f - form feed.
\a - bell.
\x - Followed by a hexadecimal codepoint to insert
\x{...} - Enclosed hex codepoint is inserted
\N{...} - The enclosed character name is inserted

Also, a newline or carriage-return/newline pair after a backslash consumes any subsequent spaces or tabs, and emits a single space. Any other non-alpha-numeric character following a backslash will simply result in that character as a literal, which will not be counted toward balancing the initial token (quote or open-brace). Thus \ and } may both appear within the string, preceded by a backslash to escape them.

A dollar sign allows the insertion of a simple variable. The only operation that may be performed on the variable is subscripting. Unlike the shell and Perl, there is no bracketing construct to isolate the name of the variable from surrounding text. Instead, use braces.

A brace-delimited substring is replaced by its evaluated results. So, the code:

$x = "Hello";
$y = [ "World" ];
$z="{$x}, {$y[0]}\n"

yields the string "Hello, World\n" in $z. Any valid Sand program may be placed inside the braces. Its return value may be delivered by default (the value of the last statement) or using the return directive.

qw{...}

The contained text is stripped of leading and trailing whitespace, and then split on whitespace and returned as a list.

Strings literals may contain any valid Unicode sequences.

Pairs

Pairs are a non-delimited data structure which consist of a left-hand-side value, the => operator and a right-hand-side value. Pairs are typically used in the construction of named parameters, hash entries or elements of an ordered, associative list.

Lists

Lists are described using the square brackets ([, U+005B and ], U+005D), enclosing a comma-separated list of values:

$fruit = [ "apple", "banana", "coconut" ];

Associative lists are lists whose elements are all pairs:

$produce = [
  "apple"   => 1.95,
  "banana"  => 1.60,
  "coconut" => 2.50
];

Hash, array and alist variables may be populated from lists.

Blocks

Parser blocks are covered, below, but braces encountered that do not follow an identifier (with no whitespace) are the indicator of a code block or simply, block. Blocks begin with an open-brace ({, U+007B) and end with a balanced, matching close-brace (}, U+007D). Typically, the matching close-brace flags the end of a statement, but the following cases (involving the characters that follow the close-brace) allow the current statement to continue past the close brace:

  • ==>
  • (

Each of these operate on the value of the expression to which the block belongs.

map *$list -> ($item) { $item+1 } ==> sort *;
{ &func }();

Parser blocks

A parser block is comprised of a valid identifier followed by an open-brace ({, U+007B) and a matching, balanced, close-brace (}, U+007D). What comes between the two braces is determined by the grammar specified by the identifier. No space may appear between the identifier and the open-brace. Some examples include:

qq{...} - Interpolating quoted string.
q{...} - Non-interpolating quoted string.
re{...} - Regex/rule.

Single and double quotes are a special case. They are interpreted as a parser block as if q or qq had been used.

POD

Any line which begins with an equal sign (=, U+003D) followed by and alphabetic character begins a special sort of parser block which continues until the first line that contains =cut by itself. Within this block, all text is ignored for purposes of code generation and execution. These regions are intended only for documentation in the POD format.

print("Hello, world");

=head1 NAME

hello - The hello world program.

=cut

Note: Perl 6's perldoc format may be transitioned to in the future.

Adverbs

Adverbs are identifiers preceded by a colon. They are used to modify the meaning of code or data. The colon must not have a space after it. The meaning of any given adverb (also called "modifiers") is determined by its definition.

Adverbs may take parameters, just as a subroutine invocation, but do not require empty parentheses when no parameters are passed.

Declarations

The keywords my and our introduce declarations and limit scope.

Structure

Grammar

The full grammar for Sand may be found at: Sand: Grammar

Statements

A statement is primary unit of code in a Sand program. A statement can be:

  • A use statement, causing the interpreter to include a library
  • Any expression
  • A block, subroutine definition or control construct

Expressions

An expression is any literal value or variable (term) or subroutine or method invocation which may, in turn, be associated with one or more operators, potentially joining multiple expressions into a larger one. Examples of expressions include:

1
$x
print()
$x + 1
print($x + 1)
3 * 6 / 8 + 100 / 4 - 7 ** 2

Parentheses are used to group expressions like so:

3 * ((6 / (8 + 100) / 4) - (7 ** 2))

Variable declarations are considered expressions, so the following is valid:

my $x = 7 + (my $y = 10);
Operators

The operator precedence levels are (all operators are infix, binary unless otherwise noted):

  1. terms, ... term
  2. (), [] post-circumfix
  3. .
  4. ++, -- prefix/suffix
  5. **
  6. !, ::, &, *, +, - prefix
  7. ~~, !~
  8. *, /, @
  9. +, -
  10. <<, >>
  11. <, >, <=, >=, lt, gt, le, ge
  12. ==, !=, <=>, eq, ne, cmp
  13. &
  14. | ^
  15. &&
  16. ||
  17. .., ^.., ..^, ^..^
  18. ??!! trinary
  19. =, +=, -=, etc.
  20. =>
  21. ,
  22. not prefix
  23. and
  24. or, xor

The associativity for these operators is:

left
terms, ... term,
., ~~, !~, &, | ^, &&, ||, *, /, =>, ,, and, or, xor
right
(), [] post-circumfix,
!, ::, &, *, +, -, not prefix,
??!! trinary
**, @, +, -, <<, >>, =, +=, -=, etc.
non
++, -- prefix/suffix,
<, >, <=, >=, lt, gt, le, ge, ==, !=, <=>, eq, ne, cmp, .., ^.., ..^, ^..^

Regexes

Regexes are a special sort of parser block which generate closures with a special signature for use with rules for matching text. Each character of a regex is, by default, a literal which matches itself in the input. Exceptions such as alternations, quantifiers and special grouping constructs allow any grammar to be described.

For more information, see Sand: Rules.

rule definition

When the rule keyword is used, special processing takes place. Its "block" is implicitly of parser block type "re". That is:

rule digits { \d+ }

Is parsed as if the block were preceded by the re parser block identifier, and is not parsed as normal Sand code.

Rules may take parameters like subroutines, but may not redefine their return value (which is always passed via the auto-lexical, $/.

Blocks, closures and subroutines

Any text which is not a parser block, and is enclosed by braces is called a "block" or "simple block". A block is a grouping of statements which can form a closure. For example, the following code creates a closure and then calls it:

$hello = { print("Hello, world"); };
$hello();

Blocks are also used by most of the loop and control operators such as if and while:

if $a == $b { print($a); }

while and for both accept blocks that are optionally parameterized. Parameters are specified before the start of a block with -> followed by a parenthesized parameter list:

for 1..100 -> ($odd,$even) {
 print("Odd: $odd Even: $even");
}

A semicolon may appear after a block, but if a close-brace occurs at the end of a line, then it automatically terminates the current statement.

Subroutines

Subroutines are blocks of code, no different from any other block except that they can be named and thus scoped. A subroutine definition matches this rule:

[ my | our ]? \s+ sub \s+ <identifier> <parameters> [ <return> ]? <block>

The parameters are enclosed by parentheses and may include a leading invocant which is separated from the other parameters by a semicolon.

The optional return value specifier is of the form:

\-\> \s* [ <type> | <return-typelist> | never ]

As a special case, the keyword never indicates that the routine does not return. This typically happens when a routine is intended to perform final debugging, notification or cleanup before it calls exit.

The return-typelist consists of a parenthesized sequence of zero or more type names, separated by commas and optional whitespace:

sub swapnums(num $a, num $b) -> (num, num) { return($b, $a) }

Invocants are only used in conjunction with methods (declared with the "method" keyword, rather than the "sub" keyword).

method x ($me; $when, $where, $how) { ... }

See the section on classes for more detail on methods and their invocants.

Control structures

Conditionals

The conditional control structures are if and given.

if $a == $b {
  print("$a == $b");
}
given $x {
  when 1 { print("x is one") }
  default { print("x is unexpexted") }
}

Loops

There are two primary looping operators, while and for.

while =$_stdin {
  print();
}
for *$list -> ($e) {
  $total += $e;
}

The following are variations of for which collect the block's return value:

*$list2 = map *$list1 -> ($e) {
  $e + 1;
}
*$list2 = grep *$list1 -> ($e) {
  $e > 0;
}

Because chaining map and grep can be cumbersome, the ==> operator is provided:

*$list2 = map *$list1 -> ($e) {
  $e + 1;
} ==> grep * -> ($e) {
  $e > 0;
} ==> sort *;

The ==> operator creates a pseudo-value * which contains the list returned by the right-hand-side expression, and makes it available to the left-hand-side-expression, returning the result of the left-hand-side expression. When used in this way, map and grep act as expressions, not as statements.

Because of this use, map, grep and any related control loops are considered expressions, not statements like for.

Types

Variables are declared with the my and our keywords. Between the keyword and the variable, is an optional type name. The builtin types are:

int
num
buf
str
array
alist
hash
bool

The following adverbs modify the meaning of these types:

For int and num:

:bits(width)
:unsigned

For str:

:encoding(name)
:charset(name)
:language(name)

For array, alist and hash:

:of(type)
:key(type)

Namespaces

A namespace is introduced with the module keyword. There are two forms:

module Foo; # until the end of current lexical scope
module Foo { ... } # only inside the given block

Lexicals from the enclosing scope are available, but must be referred to by their full namespace name. For example:

module Foo {
  my $x = 1;
  module Bar {
    print $x; # Error: no $x in scope
    print $Foo::x; # works
  }
}

Classes

A class is a namespace that is introduced with the class keyword.

class Dog :is(::Animal) {
  our int $legs = 4;
  my str $color;
  my int $age :rw;
  method bark() { .dosound() }
}

This example demonstrates a class Dog which derives from class Animal and has three attributes: an integer number of legs set to 4, common to all dogs; a string describing its color which has no default value but can be initialized at construction; and age which is an integer, and can be changed externally at any time. This class might be used like so:

my Dog $spot(:color("white"));
$spot.age(2);
$spot.bark();

Only one parent class may be defined using :is. If no :is modifier is provided, the immediate ancestor of the class is assumed to be the parent of the namespace path. That is:

class Animal::Dog { ... }

Would define the same parentage as the previous example.

If there is only one element in the identifier for the class name, then its parent defaults to Object.

Roles

A role looks much like a class, and is very similar, but roles cannot be used to instantiate objects. Instead, they are used to control the composition of classes. A role is never the parent of a class. Instead, it is "composed" into the class's definition. For example:

role Animal::Flying {
  has $wings;
  method take_flight() { ... }
}
class Animal::Dog :does(::Animal::Flying) {
  # a dog that can fly!
}

Now Animal::Dog will be composed as if it contained the text of Animal::Flying. Because classes can only have a single parent, this can greatly increase the flexibility of class construction. To test for a role, use the does function:

if does($x, ::Animal::Flying) { ... }

This does not determine if the named role was used in composing the object in question's class (that information is lost during composition). It only tests the object's capabilities (called "properties") to determine their compatibility with the properties of the given role. If they do match, true is returned. If they do not, then false is returned.

Notice that take_flight is undefined. This is typical of roles. They define interfaces (in the Java sense) that describe what the class is responsible for providing (through composition with other roles, inheritance from a class, or definition within the class).

Notes

  • While & is not used to prefix subroutine calls, Perl 4/5-like usage, &function(...) will actually work because it takes a code-ref and then invokes it.
  • Perl 6 handles parens more elegantly than Sand, currently. More work needed there. Specifically, the ambiguity between list-context and expression grouping needs to be resolved.

TODO

  • Types
  • Classes/objects/roles
  • Object system
  • Dispatch
  • Operators
    • Comparisons
    • Precedence
    • Hyper-operations?
  • Exceptions
  • Evaluation
  • Regex doc
  • Generators / reduction / etc.

Headline text

Personal tools