peg 2.0.2 copy "peg: ^2.0.2" to clipboard
peg: ^2.0.2 copied to clipboard

A command line tool for generating (streaming, chunk, file) top-down parsers from a parsing expression grammars (PEG).

peg #

A command line tool for generating (streaming, chunk, file) top-down parsers from a parsing expression grammars (PEG).

Version: 2.0.2

Pub Package GitHub Issues GitHub Forks GitHub Stars GitHub License

About this software #

This software does not contain a public API because it is a console application (command line tool).
This tool is intended to generate the source code of PEG parsers.
Before using this tool, it is necessary to activate it using the package manager pub.

Activation and usage #

To activate this command line tool, run the following command:

dart pub global activate peg

After activation, you can use the following command to use the command line tool:

dart pub global run peg

Source code of the generated parser #

All generated parsers support the following features:

  • Parse input data using events (event-based parsing) to save memory consumption
  • Parse directly from files without having to load input data into memory
  • Very efficient input data parsing in terms of performance and memory consumption

The source code of the generated parser consists of the following parts:

  • Source code of the parser class (generated from grammar rules)
  • Source code of the parser class members (defined by the grammar)
  • Source code of library members (defined by grammar)
  • Runtime source code

The source code of the parser class members and library members are specified directly in the grammar and are used for two purposes:

  • Ensuring the operability of the parser (for example, import directives)
  • Extending the capabilities and optimization of the parser by extending the capabilities of the instance of the parser class

Grammar #

To declare expressions, a syntax is used that is basically similar to the generally accepted syntax of PEG expressions and, at the same time, additional syntax is used to expand the capabilities of PEG expressions.
A modified syntax is used to declare production rules.
The grammar declaration uses its own syntax.

This generator generates itself from a grammar written using its own syntax.
For a more detailed acquaintance with the syntax, it is recommended to familiarize yourself with the syntax of the grammar used to generate this PEG parser.
https://github.com/mezoni/peg/blob/main/bin/peg_parser.peg

Grammar declaration is made using sections, like sections for a preprocessor, but at the same time, it should be noted that preprocessing is not performed and grammar processing (parsing) occurs in one stage.

3 sections are used to declare the grammar:

  • Section for declaring directives and global members
  • Section for declaring members of instances of the parser class
  • Section for declaring grammar rules

Example of a grammar declaration:

%{

import 'foo.dart';

}%

%%

const SimpleParser();

%%

A = [A-Za-z] * ;

The grammar must contain at least one production rule, which means that using a section to declare grammar rules is mandatory. The use of other sections is optional and is determined by the actual needs based on the chosen method of declaring the grammar.

Production rules #

The declaration of a production rule consists in specifying (in a certain sequence) the attributes of the rule and its body and consists of the following elements:

  • Metadata
  • Type of the returned result
  • Name of the production rule
  • Symbol =
  • Expression
  • Symbol ;

The type of the returned result and metadata are optional.
To specify the type of the returned result of the production rule, you must use the syntax of the Dart language.
The concept of metadata differs from that used in the Dart language and is used exclusively as instructions for the parser generator.

Example of a production rule declaration with result type and metadata:

- Metadata
bool
False = 'false' Spaces { $$ = false; } ;

@event
MapEntry<String, Object?>
KeyValue = k:Key Colon v:Value { $$ = MapEntry(k, v); };

Expressions #

Below is a list of available expressions.

Orderer choice #

Name: Orderer choice
Operator: /
Number of operands: 2 or more

Executes operands in the specified order, and if the execution of the next operand succeeds, this result is returned. If the execution of the next operand fails, then the next operand is executed and this happens until the execution of one of the operands succeeds. If all operands fail, then this expression fails.

Example:

'abc' / 'def'

Sequence #

Name: Sequence
Operator: not used
Number of operands: 2 or more

Executes, in the specified order, all operands, and if the execution of any operand fails, the expression immediately fails. If the execution of all operands succeeds, the result is returned, the value and type of which depend on whether semantic variables and/or semantic action were used .

Example:

'abc' 'def'

Optional #

Name: Optional
Operator: ?
Number of operands: 1

Executes the operand and if the execution of the operand succeeds, this result is returned. If the operand execution fails, then this expression succeeds and returns `null'.

Example:

'abc'?

Zero or more #

Name: Zero or more
Operator: *
Number of operands: 1

Cyclically executes the operand until the execution of the operand fails. During execution, adds the results of the operand execution in a list. After the loop is completed, the expression succeeds and a list of results is returned.

Example:

'abc'*

One or more #

Name: One or more
Operator: +
Number of operands: 1

Cyclically executes the operand until the execution of the operand fails. During execution, puts the results of the operand execution in a list. If the number of items in the list of results is at least 1, then the list of results is returned.

Example:

'abc'+

Repetition #

Name: Repetition
Operators: {min,max} or {min,} or {,max} or {n}
Number of operands: 1

⚠️ Important information:
The use of spaces in the operator body (i.e., between { and }) is not allowed and will lead to syntax error.

Example of misuse:

[0-9A-Zza-Z]{1, 4}

Operator {min,max}:

Cyclically executes the operand at least min and no more than max times. During execution, puts the results of the operand execution in a list. If (after the end of the loop) the number of items in the list of results is at least min, then the list of results is returned.

Example:

[0-9A-Zza-Z]{1,4}

Operator {min,}:

Cyclically executes the operand at least min times. During execution, puts the results of the operand execution in a list. If (after the end of the loop) the number of items in the list of results is at least min, then the list of results is returned.

Example:

[0-9A-Zza-Z]{2,}

Operator {,max}:

Cyclically executes the operand no more than max times. During execution, puts the results of the operand execution in a list. After completing the execution of the loop, a list of results is returned.

Example:

[0-9A-Zza-Z]{,4}

Operator {n}:

Cyclically executes the operand n times. During execution, puts the results of the operand execution in a list. If (after the loop is completed) the number of items in the result list is n, then the result list is returned.

Example:

[0-9A-Zza-Z]{4}

Literal #

Name: Literal
Operator: not used
Operands: string value enclosed in single quotes

Note: It is allowed to use an empty string as an operand. In this case, the expression always succeeds, without changing the current position.

Matches the input data at the current position with a string value. If the matching is successful, then the current position is incremented by the length of the string value and this string value is returned.

Example:

'abc'

Character class #

Name: Character class
Operator: not used
Operands is a character ranges enclosed between [ and ] or [^ and ]

Note: It is not allowed to use an empty range as an operand.

Operand: character ranges enclosed between [ and ]

Checks the character from the input data at the current position for occurrence in one of the specified ranges. If the check is successful, the current position is increased by the length of the current character and the current character is returned.

Example:

[0-9A-Za-z]

Operand is a character ranges enclosed between [^ and ]

Checks the character from the input data at the current position for non-occurrence in one of the specified ranges. If the check is successful, the current position is increased by the length of the current character and the current character is returned.

Example:

[^0-9A-Za-z]

Slice #

Name: Slice
Operator: $
Number of operands: 1

Executes the operand and if the execution of the operand succeeds, the text corresponding to the start and end positions of the operand is returned.

Example:

$[a-z]*

Symbol #

Name: Symbol
Operator: not used
Operand: Name of the rule

Executes an operand (a rule with the specified name) and returns the result of executing the operand.

Example:

number

AndPredicate #

Name: AndPredicate
Operator: &
Number of operands: 1

Executes the operand and, if the execution of the operand succeeds, returns the result of the execution of the operand. Upon completion of the operand execution, this expression restores the current position of the input data (that is, the input data is not consumed).

Example:

&[a-z]+

NotPredicate #

Name: NotPredicate
Operator: !
Number of operands: 1

Executes the operand and, if the execution of the operand fails, returns `null'. Upon completion of the operand execution, this expression restores the current position of the input data (that is, the input data is not consumed).

Example:

![a-z]+

Meta expressions #

Grammar parsing expressions has a fairly extensive set of expressions, which is quite enough to create complex grammars. But, given the fact that grammar is the basis for generating a top-down parser, the existing set of expressions is not enough to describe a complex parser. Meta expressions, in this case, are a means to describe the behavior of the generated parser. Meta expressions extend the grammar with capabilities that are not present in it.
From the point of view of grammar, meta expressions should be considered as built-in production rules with certain behavior that cannot be implemented by existing expressions.

The following meta expression exist in the current version.

  • @errorHandler
  • @matchString
  • @sepBy
  • @stringChars
  • @verify

@errorHandler #

Name: @errorHandler
Parameters:

  • Processed expression
  • Source code of the handler

⚠️ Important information:
To avoid situations where error registration in the @errorHandlermeta expression handler may be performed incorrectly, the local variable ParseError? error is intended for this purpose. Also, to implement the ability to rollback last errors, the local variable rollbackErrors will be available.

The meta expression @errorHandler is intended to generate errors defined by the developer.
The capabilities of the error handler allow you to add new error and additionally (if necessary) roll back the latest errors (that is, replace the errors generated by the processed expression with another, more informative error).

Example:

HexNumber = @errorHandler(HexNumberRaw, {
    rollbackErrors = true;
    error = ErrorMessage(state.pos - state.failPos, 'Expected 4 digit hex number');
  }) ;

@matchString #

Name: @matchString
Parameters:

  • Source code for expression to get string value

The meta expression @matchString is intended to match a string value that can be obtained directly when the expression is executed (for example, a value obtained from the parser parameters).

Example:

Sep = @matchString({ separator }) ;

@sepBy #

Name: @sepBy
Parameters:

  • An expression representing an element
  • An expression representing an separator

The meta expression @sepBy parses zero or more occurrences of element, separated by separator and returns a list of elements as a result.

Example:

Elements = @sepBy(Element, Separator) ;

This was an example of an optimized version of the following expression.

v:(h:Element t:(Separator v:Element)* { $$ = [h, ...t]; })? { $$ = v ?? const []; }

Grammar optimization is achieved by reducing grammar expressions. Optimization of the generated code is achieved by reducing array creation operations.
As a result, the grammar becomes more readable, and the generated parser code works more efficiently.

@stringChars #

Name: stringChars
Parameters:

  • An expression representing an normalCharacters
  • An expression representing an escapeCharacter
  • An expression representing an escape

⚠️ Important information:
The expressions normalCharacters and escape must have a result type of String.

The @stringChars meta expression is intended to add the ability to efficiently parse string characters.
The term string characters in this case refers to the characters enclosed in quotation marks in string literals.

Example:

StringChars = @stringChars(
    $[\u{20}-\u{21}\u{23}-\u{5b}\u{5d}-\u{10ffff}]+,
    [\\],
    (EscapeChar / EscapeHex)
  ) ;

String = '"' v:StringChars Quote ;

This was an example of an optimized version of the following implementation.

String
StringChars =
    $[\u{20}-\u{21}\u{23}-\u{5b}\u{5d}-\u{10ffff}]+
  / [\\] v:(EscapeChar / EscapeHex) ;

String
String = '"' v:StringChars* Quote { $$ = v.join(); } ;

@verify #

Name: @verify
Parameters:

  • Processed expression
  • Source code of the handler

⚠️ Important information:
To avoid situations where error registration in the @verify meta expression handler may be performed incorrectly, the local variable ParseError? error is intended for this purpose. If this variable is set to a value in the handler, this will mean that the verification was completed unsuccessfully and this error must be registered.

The meta expression @verify is intended to support the implementation of certain functions of context-sensitive grammars.
Despite the fact that this meta expression looks at first glance as dependent on the processed expression, nevertheless it can also be used as an independent expression, in the case of using the processed expression, which always succeeds.
What is the meaning of the above statement?
As follows from the name of the specified meta expression, it verifies the result of the processed expression.
If the processed expression fails, the verifier handler is not executed.
If the processed expression succeeds, then the verifier handler performs two functions:

  • Verifies the value of the result of the processed expression
  • Either does nothing, or generates an error based on the verification results

This meta expression allows you to simply solve the problems that arise when creating context-dependent grammars.
It is recommended to use an empty Literal as an expression that always succeeds.
At the same time, any available data can be used as verification data (for example, user settings of the parser implemented by the developer).

Example of result verification:

Verify41 = @verify(Integer, { if ($$ != 41) { error = ErrorMessage(state.pos - pos, 'error'); } }) ;

Example of parser configuration verification:

Verify41 = @verify('', { if (!flag) { error = ErrorMessage(0, 'error'); } }) ;

Semantic variables and actions #

In this implementation of the PEG parser generator, the most complex expression (for generation) is considered to be the expression Sequence.

This expression is (conditionally) divided into two types:

  • Sequences of expressions with one expression
  • Sequences of expressions with more than one expression

A quite natural question arises: "Why is a sequence of expressions with one expression a sequence?".
The short answer to this question may sound something like this: "Because in the current implementation, only a sequence can have a semantic action."

And, since the semantic action, in this case, is not considered as an independent expression, this allows us to make semantic actions a full-fledged analogue of the map function, indicating the type of the returned result if required.
At the same time, this makes it easier to implement nested expressions, due to simple mapping.

Before explaining the principles of semantic variables and actions, it is necessary to explain to explain the principles of forming the results of executing Sequence expressions.

If the sequence consists of one element, then the result of executing this element is returned as a result.

Example:

A = 'abc';

The type of the returned value is String, the value is 'abc'.

If the sequence consists of more than one element and does not use semantic variables then a value with the List<Object?> type is returned as a result, in which the results of the elements are used as list elements.

Example:

A = 'abc' 'def';

The return value type is List<Object?>, the value is ['abc', 'def'].
Although the generator might infer the type of this expression as List<String>, it will return a value with the value List<Object?>. This is the current design and, in this case, this was done intentionally.

Semantic variables allow you to designate the results used and assign names to them.

Example:

A = a:'abc';

The type of the returned value is String, the value is 'abc'.
The use of a semantic variable, in this case, does not affect the type of result, because the sequence for forming the result contains one element.

Example:

A = a:'abc' b:'def';

The return value type is ({String a, String b}), the value is (a: 'abc', b: 'def').

A = OpenBrace v:Value CloseBrace';

The type of the returned value is determined by the type of the returned value by the expression Value.
The use of one semantic variable, in this case, indicates that the sequence, for the formation of the result, contains one element.

Semantic actions, as mentioned above, play the role of map functions and are used to form a result from a sequence of expressions. The use of semantic variables, in this case, is determined by the implemented logic of the sequence.

Example of using a semantic action without variables:

bool
False = 'false' Spaces { $$ = false; } ;

Since there is no need to use variables in the above example, they are not used.

Example of using a semantic action with variables:

MapEntry<String, Object?>
KeyValue = k:Key Colon v:Value { $$ = MapEntry(k, v); };

In the above example, the semantic action uses variables to form the final result, which, in essence, works similarly to the function of the function map(KeyType k, ValueType v).

Also, when using semantic shares, you can use the result type of the semantic shares value.

Example of using a semantic action with an indication of the type of result:

KeyValue = k:Key Colon v:Value <MapEntry<String, Object?>>{ $$ = MapEntry(k, v); };

Metadata #

Metadata is instructions for the generator. Metadata is optional.
Metadata can be specified when declaring rules by specifying them before the name of the rule.

Examples of specifying metadata.

@inline
Foo = 'foo' ;

The following instructions are supported in the current version:

  • @event
  • @inline
  • @memoize

The @event instruction indicates to the generator that for this rule, when parsing data, it is necessary to call the beginEvent and endEvent methods. Calling these methods actually allows for Event-based parsing.

The @inline instruction indicates to the generator that the source code of this production rule should not create a separate method in the parser class and should be generated as code embedded in the method that calls this rule.

The @memoize instruction is not implemented in the current version.

Streaming parsing #

A few words as an introduction #

The term streaming parsing means the possibility of asynchronous data parsing.
The term asynchronous parsing means the possibility of parsing data by chunks.
The term chunked parsing means analyzing data in parts as those parts become available.

The implementation of data parsing in parts differs from regular parsing in at least two ways:

  1. The entire input data for parsing is not available
  2. Backtracking is not possible without buffering of input data

The first specificity is not a difficult task to solve, since it can be solved using special parsing algorithms.
That is, it can be solved without any action on the part of the grammar developer.
The second specificity is not too easy to solve and requires special attention. In this case, buffering certain data. Both automatic and forced buffering.

What is buffering of input data #

Buffering is a way to ensure that sufficient input data is available.
At the same time, buffering solves two tasks:

  1. Accumulation of sufficient input data to perform parsing by expression
  2. Caching input data for backtracking if expression parsing fails

Accumulation of sufficient input data is a fairly simple task.
All expressions in this sense are self-sufficient and cope with this work themselves.

Caching input data for backtracking is not a trivial task.
The main problem is that not all expressions buffer input data and, in this case, there will simply be no (buffered) input data to perform backtracking.
But this does not mean that this is a difficult or insoluble task.

How and what do expressions buffer? #

Expressions buffer only the data necessary to ensure a backtracking if parsing fails.
But what would happen if all expressions were able to provide this?
If this happens then this would defeat the purpose of implementing streaming parsing and would result in the buffer gradually filling up with entire input data.

To minimize buffer overload, not all expressions buffer input data.

Below is a list of expressions and meta expressions and what they buffer.

  • The AndPredicate expression buffers the child expression
  • The AnyCharacter expression buffers itself
  • The Buffer expression buffers the child expression
  • The CharacterClass expression buffers itself
  • The ErrorHandlerClass meta expression does not buffer anything
  • The Literal expression buffers itself
  • The NotPredicate expression buffers the child expression
  • The OneOrMore expression does not buffer itself, buffers only the current child expression in the iteration
  • The Optional expression buffers the child expression
  • The OrderedChoice expression does not buffer anything
  • The Repetition{m,} at the beginning the expression buffers itself up to m iterations, and after that it buffers only the current child expression in the iteration
  • The Repetition{m,n} at the beginning the expression buffers itself up to m iterations, and after that it buffers only the current child expression in the iteration
  • The Repetition{,n} expression does not buffer itself, buffers only the current child expression in the iteration
  • The Repetition{n} expression buffers itself
  • The SepBy meta expression does not buffer itself, buffers only the current child expressions in the iteration
  • The Sequence expression does not buffer anything
  • The Slice expression buffers the child expression
  • The StringChars meta expression does not buffer itself, buffers only the current child expression (combination) in the iteration
  • The Symbol expression does not buffer anything
  • The Verify meta expression buffers the child expression
  • The ZeroOrMore expression does not buffer itself, buffers only the current child expression in the iteration

How then to implement the possibility of backtracking?
The answer to this question lies in the question itself (to a large extent).
If you need it, then force it.
It's not difficult to force it. The meta expression Buffer serves this purpose.
It would seem that everything is simple, but the question remains unclear in what cases to use forced buffering.

The reason to use forced buffering #

If one of the previous (alternative) expressions in the expression OrderedChoice can fail not at the current parse position, but after moving forward, then in this case such an expression must be buffered forcibly.
This must be done only directly in the expression OrderedChoice in accordance with the logic of the work of only this expression OrderedChoice.

The simplest example.

Choice =
    A B C # 1
  / A B   # 2
  / X     # 3
  ;
A = 'a' ;
B = 'b' ;
C = 'c' ;
X = 'x' ;

If parsing the expression #1 fails to parse the expression C, then the expression #2 may be left without input data if the buffer does not contain the required data (ab). Because they have already been successfully parsed during the partially unsuccessful parsing of the expression #1 and quite possibly also flushed from the buffer.
At the same time, this will not have any effect on the parsing of the X expression.

In this case, this is solved simply, which in principle does not mean that taking this into account is just as simple.

Choice =
   @buffer(A B C)
  / A B
  / X
  ;
A = 'a' ;
B = 'b' ;
C = 'c' ;
X = 'x' ;

That is, always in such cases.

Choice =
    @buffer(A B C)
  / @buffer(A B)
  / A
  / M N
  / @buffer(XY)
  / X
  ;

XY = X Y ;

In principle, this can be done in a way that is more convenient.
For example like this.

Choice =
    XY
  / X
  ;

XY = @buffer(X Y) ;

That is, if the subsequent expression begins with the same character as the previous one, then it is necessary to buffer the previous expression.
Perhaps in subsequent versions this will be implemented at the generator level, but nevertheless, the most reliable solution is to do this explicitly in the description of the grammar.

Code snippets #

CheckCondition

CheckCondition = @verify('', { if (!condition) { error = ErrorMessage(0, 'error'); } }) ;

Eof

Eof = !. ;

Peak

Peak = &foo ;

Position

int
Position = '' { $$ = state.pos; } ;

SeparatedList

SeparatedList = @sepBy(Elem, Sep) ;

SeparatedList1

<List<int>>
SeparatedList1 = h:[a] t:([,] v:@sepBy([a], [,]))? { $$ = [h, if (t != null) ...t]; }

SeparatedPair

SeparatedPair = k:Key Sep v:Value ;

TakeWhile

TakeWhile = $[0-9]* ;

TakeWhile1

TakeWhile1 = $[0-9]+ ;

Value

int
Value = '' { $$ = 41; };

Verify

Verify = @verify(Integer, { if ($$ > 0xff) { error = ErrorMessage(state.pos - pos, 'Some error message'); } } );

Examples of parsers #

List of parser examples:
CSV parser
Calc parser
JSON parser

Parsing #

The generated parser classes do not contain anything superfluous except the rules and class members defined by the developer.
For the convenience of working with parser classes, it is proposed to use functions for data parsing.
These are top-level functions and they are also in the parser library file.
These are not generated functions, but they are universal functions.

Below is a list of these functions.

Name: parseString
Purpose: Calls the specified parsing function for the specified string and throws a FormatException if parsing fails. If parsing is successful, returns the result.

Usage example:

   const source = '1 + 2 * 3';
   final parser = CalcParser();
   final result = parseString(parser.parseStart, source);
   print(result);

Name: fastParseString
Purpose: Calls the specified fast parsing function for the specified string and throws a FormatException if parsing fails. If successful, returns no result.

Usage example:

   const source = '1 + 2 * 3';
   final parser = CalcParser();
   fastParseString(parser.fastParseSpaces, source);

Name: parseInput
Purpose: Calls the specified parsing function on the specified input source and throws a FormatException if parsing fails. If parsing is successful, returns the result.

Usage example:

   const source = '1 + 2 * 3';
   final input = StringReader(source);
   final parser = CalcParser();
   parseInput(parser.parseStart, input);

Name: tryParse
Purpose: Calls the specified parse function on the specified input source and returns a ParseResult value.

Usage example:

   const source = '1 + 2 * 3';
   final input = StringReader(source);
   final parser = CalcParser();
   final result = tryParse(parser.parseSpaces, input);

Name: tryFastParse
Purpose: Calls the specified parse function on the specified input source and returns a ParseResult value.

Usage example:

   const source = '1 + 2 * 3';
   final input = StringReader(source);
   final parser = CalcParser();
   final result = tryFastParse(parser.fastParseSpaces, input);
2
likes
130
pub points
0%
popularity

Publisher

unverified uploader

A command line tool for generating (streaming, chunk, file) top-down parsers from a parsing expression grammars (PEG).

Repository (GitHub)
View/report issues

Topics

#parser #parser-generator #peg #top-down-parser

Documentation

API reference

License

BSD-3-Clause (LICENSE)

Dependencies

args, path, strings

More

Packages that depend on peg