PeggerUser Guide

Overview

Parsing Expression Grammar (PEG) for when Regular Expressions just aren't enough!

Pegger is a Parsing Expression Grammar (PEG) implementation. It lets you create text parsers by building up a tree of simple matching rules.

Advanced parsing options let you look ahead with predicates, and throw errors to fail fast.

Pegger has been used (by Fantom Factory) to parse HTML, CSS, Markdown, and ANBF - to name but a few. The general strategy is usually:

  1. Create a structure of Fantom data classes
  2. Create a grammar to parse text documents into a node tree
  3. Trim the tree by removing labels and branches, so the tree resembles the structure of the Fantom classes
  4. Walk the tree, recursively creating corresponding data classes

Pegger was inspired by Mouse, Parboiled for Java, and pegs for NIM.

Quick Start

using afPegger::Peg

class Example {
    Void main() {
        input := "<<<Hello Mum>>>"
        rule  := "'<'+ name:[a-zA-Z ]+ '>'+"
        match := Peg(input, rule).match

        name  := match["name"]    // --> "Hello Mum"
    }
}

Usage

The quick start example saw a lot of crazy symbols... so woah, what just happened?

Rules

Pegger attempts to match a rule against a given string. In the example the string was <<<Hello Mum>>> and the rule was that mixed bag of crazy characters. Rules can be written in PEG notation (see mixed bag of crazy characters) or they can be created programmatically via Fantom code using the Rules mixin:

// PEG Notation:
// '<'+ ([a-zA-Z] / " ")+ '>'+

// Fantom code:
rule := sequence {
    oneOrMore(char('<')),
    oneOrMore(firstOf { alphaChar, spaceChar }),
    oneOrMore(char('>')),
}

The Fantom code can be a lot simpler to read and understand, but is also a lot more verbose.

Once you run a match, the result is a tree. Use Match.dump() to see it:

rule := sequence {
    oneOrMore(char('<')),
    oneOrMore(firstOf { alphaChar, spaceChar }),
    oneOrMore(char('>')),
}

input := "<<<Hello Mum>>>"
match := Peg(input, rule).match
match.dump

// --> root : "<<<Hello Mum>>>"

Okay, so that single line starting with root is not much of a tree. To create a tree, we need to give parts of our rule a label:

rule := sequence {
    oneOrMore(char('<'))                       .withLabel("start"),
    oneOrMore(firstOf { alphaChar, spaceChar }).withLabel("name"),
    oneOrMore(char('>'))                       .withLabel("end"),
}

We can do the same in PEG notation by using a label: prefix:

// PEG Notation:
// start:'<'+ name:[a-zA-Z ]+ end:'>'+

match.dump() now gives:

root
 ├─ start : "<<<"
 ├─ name : "Hello Mum"
 └─ end : ">>>"

Each part of the match may be retrieved using the Match.get() operator:

match["start"].toStr  // -> "<<<"
match["name"].toStr   // -> "Hello Mum"
match["end"].toStr    // -> ">>>"

Grammar

The same could also be written as PEG grammar. Grammar defines multiple PEG rules. Grammars may be coded programmatically but are usually created from a string. There should always be the one top-level or root rule which is responsible for matching everything:

root  = start name end
start = '<'+
name  = [a-zA-Z ]+
end   = '>'+

Definitions may span multiple lines but proceeding lines must contain leading whitespace to distinguish it from a new rule definition.

root  = start
        name
        end
start = '<'+
name  = [a-zA-Z ]+
end   = '>'+

When run, the same result is given:

grammarStr := "..."
grammar    := Peg.parseGrammar(grammarStr)

input      := "<<<Hello Mum>>>"
match      := grammar["root"].match(input).dump

// root
//  ├─ start : "<<<"
//  ├─ name : "Hello Mum"
//  └─ end : ">>>"

Once a grammar (or rule) has been parsed and validated, it may be cached for future re-use.

Rules may be omitted from the result tree by prefixing the definitions with a -:

root   = start name end
-start = '<'+
name   = [a-zA-Z ]+
-end   = '>'+

// root
//  └─ name : "Hello Mum"

PEG Notation

Writing grammar files can be a lot easier to understand than the verbose programatic method. Here's your guide.

Pegger is primarily concerned with parsing displayable 7-bit ASCII characters, but where mentioned also provides support for 16-bit Unicode characters. Non-visible / non-printable characters are beyond the remit of Pegger; largely because Pegger uses strings as input! See Design notes for details.

Rule definitions

A rule is defined with a name followed by =. The more formal definition of <- may be used in place of =.

ruleDef1  = rule
ruleDef2 <- rule

Advanced Note: Prefixing a rule with - will remove it from the result tree.

-ruleDef1  = rule
-ruleDef2 <- rule

Advanced Note: Suffixing a rule with - will remove it from debug output.

ruleDef1-  = rule
ruleDef2- <- rule

Rules may have both a - prefix and suffix: -ruleDef- = rule

Sequence

Ordered sequences of rules are expressed by separating them with a space.

ruleDef = rule1 rule2 rule3

Choice / First of

When given a choices, Pegger will match the first rule that passes. Beware: order of choices can be important.

ruleDef = rule1 / rule2 / rule3

Grouping

Choice rules have a higher operator precedence than Sequence rules, but it is better to group rules with brackets to avoid ambiguity.

ruleDef = (rule1 rule2) / (rule3 rule4)

Repetition

Rules may be specified to be matched by different amounts.

rule1?      // optional
rule1+      // one or more
rule1*      // zero or more
rule1{4}    // exactly 4
rule1{2,6}  // between 2 and 6 (inclusive)
rule1{,5}   // at most 5 - same as {0,5}
rule1{3,}   // at least 3

Literal

Literal strings may be matched with either single or double quotes.

ruleDef1 = "literal 1"
ruleDef2 = 'literal 2'

Use backslash to escape the usual \b \f \n \r \t characters and to escape quotes. (Note that Fantom normalises all new line characters to just \n.)

Use an i suffix to indicate the match should be case-insensitive.

ruleDef3 = "new\nline\n"i

Character class

Individual characters, and ranges thereof, may be matched with a regex-esque character class.

ruleDef1 = [a]        // matches a
ruleDef2 = [abc]      // matches 'a', 'b', or 'c'
ruleDef3 = [a-z]      // matches any character in the range from 'a' to 'z' inclusive
ruleDef4 = [a-z]i     // as above, but case-insensitive
ruleDef5 = [0-9A-F]i  // matches any hex digit
ruleDef6 = [\n\] \t]  // backslash escapes

The hat ^ character will match any character BUT the chosen ones.

ruleDef6 = [^0-9]     // match any char BUT not digits

Any character

Use . to match any character.

ruleDef = . . .       // matches any 3 characters

Predicates

Use the ampersand & to look ahead for a match, but NOT consume any characters.

ruleDef = "something " &"good" .

Use exclamation ! to look ahead for a negative match, but again, NOT consume any characters.

ruleDef = "something " !"bad" .

As per the example above, predicates should always be used in conjunction with a character consuming rule.

Unicode

Use \uXXXX (hexadecimal) notation to match a 16-bit unicode character. May be used in string literals and character classes.

crlf = [\u000D] "\u000A"

Unicode escapes MUST contain exactly 4 hex digits; 4 to ease parsing and to match Java string characters which are 16 bits.

Macros

Pegger introduces macros for useful extensions. These may be used individually as rules.

name

function

\sos

Matches Start-Of-Stream (or Start-Of-String!?) - does not consume chars

\eos

Matches End-Of-Stream (or End-Of-String!?) - does not consume chars

\eol

Matches Start-Of-Line (also matches \sos) - does not consume chars

\sol

Matches End-Of-Line (also matches \eos) - does not consume chars

\lower

Matches a lowercase character in the current locale

\upper

Matches an uppercase character in the current locale

\alpha

Matches a character in the current locale

\pass

Always passes the Match

\fail

Always fails the Match

\err(xxx)

Throws a PegParseErr with msg xxx when processed

Comments

Hash # comments are allowed in grammar, as are double slash // comments.

#  Hash comments are allowed in grammar
// as are double slash comments

a = .  // end-of-line comments also allowed

b = [cd]
    // comments may also appear inbetween rules
    [de]

PEG Grammar

Interestingly, PEG grammar may itself be expressed as PEG grammar. And indeed, Pegger does use itself to parse your PEG definitions!

PEG grammar:

grammar       <- (!\eos (commentLine / ruleDef / \err(FAIL)))+
ruleDef       <- exclude:"-"? ruleName debugOff:"-"? cwsp* ("=" / "<-") cwsp* rule commentLine?
ruleName      <- [a-zA-Z] (("-" [a-zA-Z0-9_]) / [a-zA-Z0-9_])*
rule          <- firstOf / \err(FAIL)
firstOf       <- sequence (cwsp* "/" cwsp* sequence)*
sequence      <- expression (cwsp* expression)*
expression    <- predicate? (label ":")? type multiplicity?
label         <- [a-zA-Z] [a-zA-Z0-9_\-]*
type          <- group / ruleName / literal / chars / macro / dot
-group        <- "(" cwsp* rule cwsp* ")"
predicate     <- "!" / "&"
multiplicity  <- "*" / "+" / "?" / ("{" min:[0-9]* (com:"," max:[0-9]*)? "}")
literal       <- singleQuote / doubleQuote
-singleQuote  <- "'" (unicode / ("\\" .) / [^'])+ "'" "i"?
-doubleQuote  <- "\"" (unicode / ("\\" .) / [^"])+ "\"" "i"?
chars         <- "[" (unicode / ("\\" .) / [^\]])+ "]" "i"?
macro         <- "\\" [a-zA-Z]+ ("(" [^)\n]* ")")?
unicode       <- "\\" "u" [0-9A-F]i [0-9A-F]i [0-9A-F]i [0-9A-F]i
dot           <- "."
-commentLine- <- sp* (nl / comment)
-comment-     <- ("#" / "//") (!\eos [^\n])* nl
-cwsp-        <- sp / (!\eos cnl (sp / &("#" / "//")))
-cnl-         <- nl / comment
-sp-          <- [ \t]
-nl-          <- \eos / "\n"

Recursive / HTML Parsing

A well known limitation of regular expressions is that they can not match nested patterns, such as HTML. (See StackOverflow for explanation.) Pegger to the rescue!

Because PEGs contain rules that may reference themselves in a circular fashion, it is possible to create recursive parsers.

Below is an example that parses nested HTML tags. You can see the recursion from the element definition which references itself:

grammar := Peg.parseGrammar("element  = startTag (element / text)* endTag
                             startTag = '<'  name:[a-z]i+ '>'
                             endTag   = '</' name:[a-z]i+ '>'
                             text     = [^<]+")

html    := "<html><head><title>Pegger Example</title></head><body><p>Parsing is Easy!</p></body></html>"

grammar["element"].match(html).dump
Peg(html, element).match.dump

Which outputs the following result tree:

element
 ├─ startTag
 │   └─ name : "html"
 ├─ element
 │   ├─ startTag
 │   │   └─ name : "head"
 │   ├─ element
 │   │   ├─ startTag
 │   │   │   └─ name : "title"
 │   │   ├─ text : "Pegger Example"
 │   │   └─ endTag
 │   │       └─ name : "title"
 │   └─ endTag
 │       └─ name : "head"
 ├─ element
 │   ├─ startTag
 │   │   └─ name : "body"
 │   ├─ element
 │   │   ├─ startTag
 │   │   │   └─ name : "p"
 │   │   ├─ text : "Parsing is Easy!"
 │   │   └─ endTag
 │   │       └─ name : "p"
 │   └─ endTag
 │       └─ name : "body"
 └─ endTag
     └─ name : "html"

Parsing has never been easier!

Note that only Chuck Norris can parse HTML with regular expressions.

Converting the match results into XML is left as an exercise for the user, but there are a couple of options open to you:

1. Looping

This is usually the easiest way to convert your match results, but not always the cleanest.

It involves looping over the match results the same as you would with any other list, and recursively calling yourself to convert inner data.

2. Walking

Create a walk() method that recursively calls a given function every time it steps into, or out of, a match:

Void walk(Match match, |Match, Str startOrEnd| fn) {
    fn?.call(match, "start")
    match.matches.each { walk(it, fn) }
    fn?.call(match, "end")
}

Custom Rules

Custom rules may be created by subclassing the Rule class.

There (currently) is no support for introducing custom rules in PEG grammar, so use of custom rules limits their use to inclusion in programmatic Fantom based grammars.

Debugging

By enabling debug logging, Pegger will spew out a lot of debug / trace information. (Possibly more than you can handle!) But note it will only emit debug information for rules with names or labels.

Enable debug logging with the line:

Peg.debugOn

// or

Log.get("afPegger").level = LogLevel.debug

Which, for the above html parsing example, will generate the following:

[afPegger] [  1] --> element - Processing: startTag (element / text)* endTag with: <html><title>Pegger Ex...
[afPegger] [  2]  --> startTag - Processing: "<" [a-zA-Z]+ ">" with: <html><title>Pegger Ex...
[afPegger] [  2]    > startTag - Passed!
[afPegger] [  2]    > startTag - Matched: "<html>"
[afPegger] [  4]    --> element - Processing: startTag (element / text)* endTag with: <title>Pegger Example<...
[afPegger] [  5]     --> startTag - Processing: "<" [a-zA-Z]+ ">" with: <title>Pegger Example<...
[afPegger] [  5]       > startTag - Passed!
[afPegger] [  5]       > startTag - Matched: "<title>"
[afPegger] [  7]       --> element - Processing: startTag (element / text)* endTag with: Pegger Example</title>...
[afPegger] [  8]        --> startTag - Processing: "<" [a-zA-Z]+ ">" with: Pegger Example</title>...
[afPegger] [  8]          > startTag - Did not match "<".
[afPegger] [  8]          > startTag - Failed. Rolling back.
[afPegger] [  7]         > element - Did not match startTag.
[afPegger] [  7]         > element - Failed. Rolling back.
[afPegger] [  7]       --> text - Processing: (!"<" .)+ with: Pegger Example</title>...
[afPegger] [  7]         > text - Rule was successfully processed 14 times
[afPegger] [  7]         > text - Passed!
[afPegger] [  7]         > text - Matched: "Pegger Example"
...
...
...

Design notes

Pegger was purposefully designed to match strings or more specifically, Unicode character data (of variable byte length dependant on encoding) - not binary data. Hence it lacks notation to match bytes and byte ranges as seen in some RFC documents.

If required, hexadecimal Unicode escape sequences ( \uXXXX ) may be used to represent 8-bit binary data.