Sand: Syntax and Structure
From AJS.COM
- This is a pre-alpha document, and will change radically, without warning. Also, while this is a wiki, the author requests that you keep changes to the discussion page so that they can be considered in the larger context of the language before becoming part of the specification.
This document will eventually outline the full specification for the syntax and grammatical structure of the Sand programming language. Sand is very similar to Perl, but this document will assume no knowledge of Perl.
Contents |
Syntax
Encoding
A Sand program is comprised of UTF-8
encoded Unicode
text.
Characters
Throughout this document the word "character" will be used loosely to refer to a non-combining, printable Unicode codepoint, roughly analogous to a grapheme
.
Characters also include the "non-printable" characters that make up the traditional "whitespace" set:
tab U+0009 linefeed U+000A return U+000D space U+0020
Codepoints defined by the Unicode standard as whitespace (other than those above), combining or non-rendering are only valid within special parser blocks which explicitly allow for them and POD. The same is true for "private use" codepoints, which some implementations may not allow, even within parser blocks.
See Sand: Rules for a discussion of terminology such as "punctuation" and "alphanumeric" with respect to Unicode. The terminology used in that document is also used here. Note, however, that there are some differences, specifically in the definition of "whitespace" which is much more liberal in rules than it is outside of them.
Lines
By default, lines of program text are terminated by a linefeed or the end of input. However, because carriage returns (U+000D) are considered whitespace, the carriage return/linefeed combination which is in use on some systems is almost never harmful.
Identifiers
Identifiers are sequences of one or more characters. They may not contain code-points which are used only for combining or are otherwise non-printing. The first character must be an alphabetic character or the underscore (_, U+005F). Subsequent characters must be alpha-numeric characters or the underscore. The alphabetic characters which make up an identifier must all come from the same Unicode block. Similarly, the numeric characters must all come from the same Unicode block. However, extension and supplemental blocks may be freely mixed with their basic block. So, for example, the following are all valid identifiers:
_ apple _1234 æther Я_ф_Б απ314
Identifiers may also be composed of multiple individual identifiers, joined with two colons, e.g.:
fruit::apple::seeds
The validity of each component identifier is determined on its own, without respect to the others in the chain.
By convention, all identifiers starting with two underscore characters are used only by Sand and its libraries. Use by end-user programs should be viewed with considerable suspicion.
Also by convention (and possibly enforced by a stricture), alphabetic characters and numeric characters used within an identifier typically come either from the same code block or from the Latin block. That is, it would be considered questionable to mix Chinese numbers with Greek alphabetic characters. In some parts of this world, however, this might seem relatively natural, so there is no default requirement to this effect.
There are four special, predefined identifiers which may be used as bareword terms:
undef true false inf
Variables
All variables must be prefixed with a dollar sign ($, U+0024) and otherwise be composed of a single, valid identifier. There is no difference between a container variable (such as an array) and a scalar variable in terms of how the variable name is written or prefixed.
$x; $Math::pi; $cars = [ "Ford", "Toyota", "BMW" ];
The combined dollar sign and identifier are considered to be a single token, not an identifier with a unary operator. Other prefixes for identifiers are actually unary operators:
::ident - The type/class/namespace named ident &ident - A function reference to the sub, ident
Indexing
A variable, close-square bracket or close-parenthesis which is followed by an open-bracket ([, U+005B) indicates an indexing operation. The indexing operation is terminated by a matching close-bracket (], U+005D).
$x[10] $y["flower"] [1,2,3][1] ["a"=>1,"b"=>2]["a"]
Auto-quoting
In the cases of indexing and pair construction, literal strings may be written without the use of quotes as long as the string:
- appears on the left-hand side of a pair construction (
=>) or within brackets ([...]) - is a valid identifier
- does not contain
::
$y[flower] [a=>1,b=>2][a]
Function invocation
An identifier, variable, close-square bracket or close-parenthesis which is followed by an open-parenthesis ((, U+0028) is the start of a function invocation.
function(); $functionref(); $function_vector["apple"](); ($functionref)(); function_returning_functionref()();
All function and method invocation must use parentheses around the parameter list, even when the parameter list is empty.
Method invocation
A variable or close-parenthesis which is followed by a period (., U+002E) indicates a method invocation. After the period must follow either an identifier (the method name) or a variable (indirect method name).
Keywords
Keywords are those identifiers which have a predefined meaning when used without a prefix (such as $), suffix (such as (...)) or some other context which would map it to the name of some user-defined data. Typically keywords are used for constants which are constant with respect to the language, not the program or one of its libraries; or they are used for control structures and other non-function builtins. The full list is:
__END__ class else elsif false if inf method module my never our sub true undef use when
Note that variables and function may have names which collide with keywords. The designation of a word as a keyword is only of interest to the compiler for purposes of tokenization.
Inline data
Data is represented in four primary ways:
Numeric
Numeric data can take all of the forms specified by this rule:
[ \+ | \- ]? [ \d<[\d_]>* [ \. \d* ]? | \. \d+ ] [ <:i>e \d<[\d_]>* ]?
Underscores are notational only, and do not affect the numeric value.
String
String literals take three forms:
'...' or q{...}
The only transformation performed on the string is the replacement of two backslashes (\, U+005C) with a single backslash and the replacement of a backslash followed by a single quote or close-brace with a literal single-quote or close-brace, respectively (which will not be counted when matching the initial token).
"..." or qq{...}
The enclosed string is scanned for backslashes, open-braces and dollar signs. Backslash sequences are:
\n - newline.
\r - return.
\f - form feed.
\a - bell.
\x - Followed by a hexadecimal codepoint to insert
\x{...} - Enclosed hex codepoint is inserted
\N{...} - The enclosed character name is inserted
Also, a newline or carriage-return/newline pair after a backslash consumes any subsequent spaces or tabs, and emits a single space. Any other non-alpha-numeric character following a backslash will simply result in that character as a literal, which will not be counted toward balancing the initial token (quote or open-brace). Thus \ and } may both appear within the string, preceded by a backslash to escape them.
A dollar sign allows the insertion of a simple variable. The only operation that may be performed on the variable is subscripting. Unlike the shell and Perl, there is no bracketing construct to isolate the name of the variable from surrounding text. Instead, use braces.
A brace-delimited substring is replaced by its evaluated results. So, the code:
$x = "Hello";
$y = [ "World" ];
$z="{$x}, {$y[0]}\n"
yields the string "Hello, World\n" in $z. Any valid Sand program may be placed inside the braces. Its return value may be delivered by default (the value of the last statement) or using the return directive.
qw{...}
The contained text is stripped of leading and trailing whitespace, and then split on whitespace and returned as a list.
Strings literals may contain any valid Unicode sequences.
Pairs
Pairs are a non-delimited data structure which consist of a left-hand-side value, the => operator and a right-hand-side value. Pairs are typically used in the construction of named parameters, hash entries or elements of an ordered, associative list.
Lists
Lists are described using the square brackets ([, U+005B and ], U+005D), enclosing a comma-separated list of values:
$fruit = [ "apple", "banana", "coconut" ];
Associative lists are lists whose elements are all pairs:
$produce = [ "apple" => 1.95, "banana" => 1.60, "coconut" => 2.50 ];
Hash, array and alist variables may be populated from lists.
Blocks
Parser blocks are covered, below, but braces encountered that do not follow an identifier (with no whitespace) are the indicator of a code block or simply, block. Blocks begin with an open-brace ({, U+007B) and end with a balanced, matching close-brace (}, U+007D). Typically, the matching close-brace flags the end of a statement, but the following cases (involving the characters that follow the close-brace) allow the current statement to continue past the close brace:
-
==> -
(
Each of these operate on the value of the expression to which the block belongs.
map *$list -> ($item) { $item+1 } ==> sort *;
{ &func }();
Parser blocks
A parser block is comprised of a valid identifier followed by an open-brace ({, U+007B) and a matching, balanced, close-brace (}, U+007D). What comes between the two braces is determined by the grammar specified by the identifier. No space may appear between the identifier and the open-brace. Some examples include:
qq{...} - Interpolating quoted string.
q{...} - Non-interpolating quoted string.
re{...} - Regex/rule.
Single and double quotes are a special case. They are interpreted as a parser block as if q or qq had been used.
POD
Any line which begins with an equal sign (=, U+003D) followed by and alphabetic character begins a special sort of parser block which continues until the first line that contains =cut by itself. Within this block, all text is ignored for purposes of code generation and execution. These regions are intended only for documentation in the POD
format.
print("Hello, world");
=head1 NAME
hello - The hello world program.
=cut
Note: Perl 6's perldoc format may be transitioned to in the future.
Adverbs
Adverbs are identifiers preceded by a colon. They are used to modify the meaning of code or data. The colon must not have a space after it. The meaning of any given adverb (also called "modifiers") is determined by its definition.
Adverbs may take parameters, just as a subroutine invocation, but do not require empty parentheses when no parameters are passed.
Declarations
The keywords my and our introduce declarations and limit scope.
Structure
Grammar
The full grammar for Sand may be found at: Sand: Grammar
Statements
A statement is primary unit of code in a Sand program. A statement can be:
- A
usestatement, causing the interpreter to include a library - Any expression
- A block, subroutine definition or control construct
Expressions
An expression is any literal value or variable (term) or subroutine or method invocation which may, in turn, be associated with one or more operators, potentially joining multiple expressions into a larger one. Examples of expressions include:
1 $x print() $x + 1 print($x + 1) 3 * 6 / 8 + 100 / 4 - 7 ** 2
Parentheses are used to group expressions like so:
3 * ((6 / (8 + 100) / 4) - (7 ** 2))
Variable declarations are considered expressions, so the following is valid:
my $x = 7 + (my $y = 10);
Operators
The operator precedence levels are (all operators are infix, binary unless otherwise noted):
- terms, ... term
-
(),[]post-circumfix -
. -
++,--prefix/suffix -
** -
!,::,&,*,+,-prefix -
~~,!~ -
*,/,@ -
+,- -
<<,>> -
<,>,<=,>=,lt,gt,le,ge -
==,!=,<=>,eq,ne,cmp -
& -
|^ -
&& -
|| -
..,^..,..^,^..^ -
??!!trinary -
=,+=,-=, etc. -
=> -
, -
notprefix -
and -
or,xor
The associativity for these operators is:
- left
- terms,
...term,.,~~,!~,&,|^,&&,||,*,/,=>,,,and,or,xor - right
-
(),[]post-circumfix,!,::,&,*,+,-,notprefix,??!!trinary**,@,+,-,<<,>>,=,+=,-=, etc. - non
-
++,--prefix/suffix,<,>,<=,>=,lt,gt,le,ge,==,!=,<=>,eq,ne,cmp,..,^..,..^,^..^
Regexes
Regexes are a special sort of parser block which generate closures with a special signature for use with rules for matching text. Each character of a regex is, by default, a literal which matches itself in the input. Exceptions such as alternations, quantifiers and special grouping constructs allow any grammar to be described.
For more information, see Sand: Rules.
rule definition
When the rule keyword is used, special processing takes place. Its "block" is implicitly of parser block type "re". That is:
rule digits { \d+ }
Is parsed as if the block were preceded by the re parser block identifier, and is not parsed as normal Sand code.
Rules may take parameters like subroutines, but may not redefine their return value (which is always passed via the auto-lexical, $/.
Blocks, closures and subroutines
Any text which is not a parser block, and is enclosed by braces is called a "block" or "simple block". A block is a grouping of statements which can form a closure. For example, the following code creates a closure and then calls it:
$hello = { print("Hello, world"); };
$hello();
Blocks are also used by most of the loop and control operators such as if and while:
if $a == $b { print($a); }
while and for both accept blocks that are optionally parameterized. Parameters are specified before the start of a block with -> followed by a parenthesized parameter list:
for 1..100 -> ($odd,$even) {
print("Odd: $odd Even: $even");
}
A semicolon may appear after a block, but if a close-brace occurs at the end of a line, then it automatically terminates the current statement.
Subroutines
Subroutines are blocks of code, no different from any other block except that they can be named and thus scoped. A subroutine definition matches this rule:
[ my | our ]? \s+ sub \s+ <identifier> <parameters> [ <return> ]? <block>
The parameters are enclosed by parentheses and may include a leading invocant which is separated from the other parameters by a semicolon.
The optional return value specifier is of the form:
\-\> \s* [ <type> | <return-typelist> | never ]
As a special case, the keyword never indicates that the routine does not return. This typically happens when a routine is intended to perform final debugging, notification or cleanup before it calls exit.
The return-typelist consists of a parenthesized sequence of zero or more type names, separated by commas and optional whitespace:
sub swapnums(num $a, num $b) -> (num, num) { return($b, $a) }
Invocants are only used in conjunction with methods (declared with the "method" keyword, rather than the "sub" keyword).
method x ($me; $when, $where, $how) { ... }
See the section on classes for more detail on methods and their invocants.
Control structures
Conditionals
The conditional control structures are if and given.
if $a == $b {
print("$a == $b");
}
given $x {
when 1 { print("x is one") }
default { print("x is unexpexted") }
}
Loops
There are two primary looping operators, while and for.
while =$_stdin {
print();
}
for *$list -> ($e) {
$total += $e;
}
The following are variations of for which collect the block's return value:
*$list2 = map *$list1 -> ($e) {
$e + 1;
}
*$list2 = grep *$list1 -> ($e) {
$e > 0;
}
Because chaining map and grep can be cumbersome, the ==> operator is provided:
*$list2 = map *$list1 -> ($e) {
$e + 1;
} ==> grep * -> ($e) {
$e > 0;
} ==> sort *;
The ==> operator creates a pseudo-value * which contains the list returned by the right-hand-side expression, and makes it available to the left-hand-side-expression, returning the result of the left-hand-side expression. When used in this way, map and grep act as expressions, not as statements.
Because of this use, map, grep and any related control loops are considered expressions, not statements like for.
Types
Variables are declared with the my and our keywords. Between the keyword and the variable, is an optional type name. The builtin types are:
int num buf str array alist hash bool
The following adverbs modify the meaning of these types:
For int and num:
:bits(width) :unsigned
For str:
:encoding(name) :charset(name) :language(name)
For array, alist and hash:
:of(type) :key(type)
Namespaces
A namespace is introduced with the module keyword. There are two forms:
module Foo; # until the end of current lexical scope
module Foo { ... } # only inside the given block
Lexicals from the enclosing scope are available, but must be referred to by their full namespace name. For example:
module Foo {
my $x = 1;
module Bar {
print $x; # Error: no $x in scope
print $Foo::x; # works
}
}
Classes
A class is a namespace that is introduced with the class keyword.
class Dog :is(::Animal) {
our int $legs = 4;
my str $color;
my int $age :rw;
method bark() { .dosound() }
}
This example demonstrates a class Dog which derives from class Animal and has three attributes: an integer number of legs set to 4, common to all dogs; a string describing its color which has no default value but can be initialized at construction; and age which is an integer, and can be changed externally at any time. This class might be used like so:
my Dog $spot(:color("white"));
$spot.age(2);
$spot.bark();
Only one parent class may be defined using :is. If no :is modifier is provided, the immediate ancestor of the class is assumed to be the parent of the namespace path. That is:
class Animal::Dog { ... }
Would define the same parentage as the previous example.
If there is only one element in the identifier for the class name, then its parent defaults to Object.
Roles
A role looks much like a class, and is very similar, but roles cannot be used to instantiate objects. Instead, they are used to control the composition of classes. A role is never the parent of a class. Instead, it is "composed" into the class's definition. For example:
role Animal::Flying {
has $wings;
method take_flight() { ... }
}
class Animal::Dog :does(::Animal::Flying) {
# a dog that can fly!
}
Now Animal::Dog will be composed as if it contained the text of Animal::Flying. Because classes can only have a single parent, this can greatly increase the flexibility of class construction. To test for a role, use the does function:
if does($x, ::Animal::Flying) { ... }
This does not determine if the named role was used in composing the object in question's class (that information is lost during composition). It only tests the object's capabilities (called "properties") to determine their compatibility with the properties of the given role. If they do match, true is returned. If they do not, then false is returned.
Notice that take_flight is undefined. This is typical of roles. They define interfaces (in the Java sense) that describe what the class is responsible for providing (through composition with other roles, inheritance from a class, or definition within the class).
Notes
- While
&is not used to prefix subroutine calls, Perl 4/5-like usage,&function(...)will actually work because it takes a code-ref and then invokes it. - Perl 6 handles parens more elegantly than Sand, currently. More work needed there. Specifically, the ambiguity between list-context and expression grouping needs to be resolved.
TODO
- Types
- Classes/objects/roles
- Object system
- Dispatch
- Operators
- Comparisons
- Precedence
- Hyper-operations?
- Exceptions
- Evaluation
- Regex doc
- Generators / reduction / etc.
