##Grammar Specification options
Grammar.Lex
and Grammar.Syntax
parts are similar in notation and functionality to Parsing Expression Grammars (PEGs
), (extended) BNF-like notation can also be used (see below)
###Contents
- Extra Settings
- Style Model
- Fold Model (not available)
- Lexical Model
- Syntax Model (optional)
- Parser
- Modularity and Future Directions
###Extra Settings
Grammar.RegExpID
defines the prefixID
for any regular expressions (represented as strings) used in the grammar
The first character after the RegExpID
is considered the regular expression delimiter (similar to php regex usage), in this way regular expression flags can be added (mostly the case-insensitive flag i
)
example:
"RegExpID" : "RegExp::"
// .. various stuff here
"aToken" : "RegExp::/[abc]+/i", // define a regular expression /[abc]+/i, or [abc]+ with case-insensitive flag
// note: the delimiters ( /, / ) are NOT part of the regular expression
// regular expression syntax and escaping is same as regexs defined with new RegExp() object in javascript
"anotherToken" : "RegExp::#[def]+#i", // define a regular expression /[def]+/i, or [def]+ with case-insensitive flag
// note: other delimiters are used ( #, # )
// .. other stuff here
Grammar.Extra
defines any (editor-specific) extra settings (eg.electricChars
,fold
, etc..) to be added to the generated mode, and is a map of the form:
editor_specific_setting_id -> editor_specific_setting_value
####Code Folding (new, optional)
Generic, editor-independent, code folding functionality is supported (by generic folders implementations).
Add a fold
type in the Grammar.Extra."fold"
option.
Generic Folding Types supported:
"brace"
/"cstyle"
, folds on braces (i.e{}
) and brackets (i.e[]
)"indent"
/"indentation"
, folds on blocks delimited by same indentation (e.g as inpython
)"markup"
/"xml"
/"html"
, folds based onxml
/markup
tags andCDATA
blocks- NOTE folding
"block"
-type comments, if existing and defined as such, is done automaticaly, no need to add separate folder option
NOTE One may use multiple different code-folders, combined with +
operator.
For example:
"Extra" : {
// combine multiple code folders in order by "+",
// here both indentation-based and brace-based (each one will apply where applicable, in the order specified)
"fold" : "indentation+brace"
}
###Style Model
Grammar.Style
Model defines the mapping of tokens to editor styles and is a map of the form:
token_id -> editor_specific_style_tag
###Fold Model (not available)
In future, support a parametrisable Grammar.Fold
Model which can parametrise code folders
for user-defined custom code folding (see above).
###Lexical Model
Grammar.Lex
Model defines the mapping of token patterns and token configuration to an associated token_id
and is a map of the form:
token_id -> token_configuration_object
token configuration
can be:
-
a
pattern
orarray of patterns
-
an
object
having at least (some or all of) the followingproperties
:"type"
:token
type
(default"simple"
)"msg"
:token
error message
(defaulttoken default error message
)"tokens"
:pattern
orarray of patterns
for this tokenproperties
depending ontoken type
(see below)- optionaly, token
"type"
can be annotated inside thetoken_id
(see below)
-
a token type can be
"simple"
(default),"block"
,"line-block"
,"escaped-block"
,"escaped-line-block"
,"comment"
,"action"
-
a token can extend / reference another token using the
extend
property; this way 2 tokens that share common configuration but different styles (depending on context) can be defined only once. Examples are tokens foridentifiers
andproperties
While both have similar tokenizers, the styles (and probably other properties) can be different (see syntax notations for a more convenient and more flexible alternative).
// Style
//
// .. stuff here..
"identifier" : "style1",
"property" : "style2",
// ..other stuff here..
// Lex
//
// .. stuff here..
"identifier": "RegExp::/[a-z]+/",
"property": {
"extend" : "identifier" // allow property token to use different styles than identifier, without duplicating everything
},
// ..other stuff here..
####Simple Tokens
-
a literal
null
valued token matches up to 'end-of-line' ; can be useful when needing to match remaining code up to end-of-line (e.g see single-line comments below) -
a literal
false
or0
valued token matchesempty production
; can be useful in defining syntax sequences that can repeat (can be used as alternative to"zeroOrMore"
,"oneOrMore"
group types, plus give a familiar feel to defining rule productions ) -
a literal
empty string
token (''
) matchesnon-space
; can be useful when multiple tokens should be consecutive with no space between them -
a literal
string
becomes a token (eg insideSyntax
Model sequence) with atokenID
same as itsliteral value
-
a token can be defined using just the
token_id
and the token pattern(s); token type is assumed"simple"
-
multiple
"simple"
tokens (which are NOT regular expresions) are grouped into one regular expression by default using"\\b"
(word-boundary) delimiter; this is usefull for speed fine-tuning the parser, adding the"combine"
property in the token configuration, can alter this option, or use a different delimiter -
"simple"
tokens can also be used to enable keyword autocomplete functionality ("autocomplete":true
, option )
####Block Tokens
-
"block"
,"line-block"
,"escaped-block"
,"escaped-line-block"
,"comment"
token types take pairs of patterns[start-pattern, end-pattern]
-
if
"end-pattern"
is missing,"end-pattern"
is same as"start-pattern"
-
if
"end-pattern"
has the (literal) valuenull
,"end-pattern"
matches what remains up toend-of-line
(eg. in single line comment blocks) -
if
"end-pattern"
is astring
, it can contain special string replacement patterns, the"end-pattern"
will be generated dynamically from the respective"start-pattern"
by replacement (providedstart-pattern
is a regex), make sure to properly escape literal"$"
values in order to match them correctly (see previous link) -
if
"end-pattern"
is anumber
, then the"end-pattern"
is generated dynamically from the respective"start-pattern"
match group (providedstart-pattern
is a regex) -
"escaped-block"
,"escaped-line-block"
type is a"block"
which can contain its"end-pattern"
if it is escaped ,"strings"
are classic examples -
"block"
,"comment"
,"escaped-block"
can span multiple lines by default, setting the"multiline": false
option can alter this -
"escaped-line-block"
,"line-block"
type is a"block"
which can span only a single line ("multiline":false
) -
"escaped-block"
,"escaped-line-block"
by default uses the"\\"
escape character, setting the"escape": escChar
option, can alter this -
"comment"
type is a"block"
type with the additional semantic information that token is about comments, so comment toggle functionality and comment interleave functionality can be enabled -
"comment"
type can be interleaved inside other syntax sequences automatically by the parser ("interleave": true
, token option) (the comment token should still be added ingrammar.Parser
part ), else comment interleaving could be handled manually in thegrammar.Syntax
part -
all block-type tokens can have different styles for
block delimiters
andblock interior
(see examples), for example having a block token with ID"heredoc"
, the interior different style can be represented inStyle
part of grammar as"heredoc.inside"
(make sure your tokenIDs
do not accidentally match this option)
####Action Tokens (new, experimental)
Action
tokens enable the grammar parser to perform some extra context-specific parsing functionality on tokens.
An action
token in a grammar applies only and directly to the token preceding it. It performs an action on that token only.
-
"action"
tokens canpush
orpop
stringIDs
onto the data stack generated from the (preceding) matched token, for example associated tag mathing can be done this way, (seetest/grammars/xml.js
for an example) -
"action"
tokens cancheck
the (preceding) matched token is unique, for example unique identifiers checking can be done this way, (seetest/grammars/xml.js
for an example) -
"action"
tokens canstart
("context":true
) andend
("context":false
) a new dynamiccontext
so succesive actions take place in that context, for example unique object literal properties and unique xml tag attributes can be done his way, (seetest/grammars/xml.js
for an example) -
"action"
tokens can generate a harderror
in context ("error":"error message"
), for example special syntax errors can be modeled, in the grammar, as needed, if needed, (seetest/grammars/xml.js
for an example) -
.. more actions to be added like
indent
/outdent
etc.. -
multiple
"action"
tokens in sequence are applied to the same preceding token
Example:
// stuff here..
"match" : {
// this will push a token ID generated from the matched token
// it pushes the matched tag name (as regex group 1)
// string replacement codes are similar to javascript's replace function
"type" : "action",
"push" : "<$1>"
},
"matched" : {
// this will pop a token ID from the stack
// and try to match it to the ID generated from this matched token
// it pops and matches the tag name (as regex group 1)
// string replacement codes are similar to javascript's replace function
// note 1: the ID has similar format to the previously pushed ID
// any format can be used as long as it is consistent (so matching will work correctly)
// note 2: a "pop" : null or with no value, pops the data unconditionally (can be useful sometimes)
"type" : "action",
"pop" : "<$1>",
"msg" : "Tag \"$1\" does not match!"
}
// other stuff here..
// syntax part
"start_tag": "open_tag match tag_attribute* close_open_tag",
"end_tag": "close_tag matched",
// other stuff here..
####Lex shorthand type annotations (new, optional)
Lexical tokens can annotate their type
in their token_id
as "token_id:token_type"
for convenience.
Example:
"a_token:comment": [["<--", "-->"]]
// is equivalent to =>
"a_token": {
"type": "comment",
"tokens": [["<--", "-->"]]
}
"a_token:action": {"push":"$1"}
// is equivalent to =>
"a_token": {
"type": "action",
"push": "$1"
}
// and so on..
###Syntax Model (optional)
Grammar.Syntax
Model defines the mapping of token context-specific sequences to an associated composite_token_id
and is a map of the form:
composite_token_id -> composite_token_configuration_object
-
Inside the
Syntax
Model, (single) tokens can also be defined (similar toLex
part), however it is recommended (single) tokens be defined inLex
part -
Syntax
includes special token types (like generalised regular expressions for composite token sequences, or parsing expressions in the style ofPEG
s) -
Syntax
token"type"
can be:"alternation"
,"sequence"
,"zeroOrOne"
,"zeroOrMore"
,"oneOrMore"
,"repeat"
; these tokens contain sequences of subtokens ("tokens"
) to be matched according to some scheme -
types:
"type":"repeat"
,"repeat":[min, max]
, match any of the tokens a minimummin
times and a maximummax
times else error (analogous to a regex:(t1 | t2 | t3..){min, max}
, wheret1
,t2
, etc are also composite tokens)"type":"zeroOrOne"
, match any of the tokenszero or one
times (analogous to a regex:(t1 | t2 | t3..)?
, wheret1
,t2
, etc are also composite tokens)"type":"zeroOrMore"
, match any of the tokenszero or more
times (analogous to a regex:(t1 | t2 | t3..)*
, wheret1
,t2
, etc are also composite tokens)"type":"oneOrMore"
, match any of the tokensone or more
times (analogous to a regex:(t1 | t2 | t3..)+
, wheret1
,t2
, etc are also composite tokens)"type":"alternation"
, match any of the tokens (analogous to a regex:(t1 | t2 | t3..)
, wheret1
,t2
, etc are also composite tokens)"type":"sequence"
, match all the tokens in sequence (analogous to a regex:(t1 t2 t3 ..)
, wheret1
,t2
, etc are also composite tokens) else error
-
a syntax token can contain (direct or indirect)
recursive references
to itself ( note: some rule factoring may be needed, to avoid grammarleft-recursion
orambiguity
) -
a
"ngram"
or"n-gram"
syntax token type, is similar to a"type":"sequence"
type, with the difference that it is only matched optionally (suitable to be used ingrammar.Parser
part) -
a syntax token in
Parser
that it wrapped in[..]
(i.e as anarray
), is shorthand notation to interpret it as ann-gram
-type token (for convenience of notation, one can see it in various example grammars) -
It is recommended to have only
Lex.tokens
orSyntax.ngrams
in thergrammar.Parser
part and notSyntax.group
tokens which are mostly auxilliary -
The
grammar.Syntax
part is quite general and flexible and can define a complete language grammar, however since this is for syntax highlighting and not for code generation, defining only necessary syntax chunks can be lighter
####Syntax PEG/BNF-like notations (new)
Syntax
part supports shorthand definitions (similar to PEG
/ BNF
-style definitions) for syntax sequences and groups of syntax sequences (see below).
Note: In order for a (grammar) specification for (programming) language highlight to be detailed and self-contained (to be re-usable and flexible under the most general conditions), it should model all available types of tokens which play a part in the code and the way it is highlighted (usualy not modeled by other approaches or other grammar specifications), for example:
<start-of-file>
token (modeled as^^
, see below)<first-non-blank-line>
token (modeled as^^1
, see below)<start-of-line>
token (modeled as^
, see below)<end-of-line>
token (modeled as$
, see below)<up-to-line-end>
token (modeled asnull
, see above,block
tokens)<non-space>
token (modeled as''
, see below)<indentation>
token (not available yet)<de-indentation>
token (not available yet)- and so on..
Specificaly:
"t": "t1 | t2 | t3"
// is equivalent to =>
"t": {
"type": "alternation",
"tokens": ["t1", "t2", "t3"]
}
"t": "t1*"
// is equivalent to =>
"t": {
"type": "zeroOrMore",
"tokens": ["t1"]
}
"t": "t1+"
// is equivalent to =>
"t": {
"type": "oneOrMore",
"tokens": ["t1"]
}
"t": "t1?"
// is equivalent to =>
"t": {
"type": "zeroOrOne",
"tokens": ["t1"]
}
"t": "t1{1,3}"
// is equivalent to =>
"t": {
"type": "repeat",
"repeat": [1,3], // match minimum 1 times and maximum 3 times
"tokens": ["t1"]
}
"t": "t1* | t2 | t3"
// is equivalent to =>
"t1*": {
"type": "zeroOrMore",
"tokens": ["t1"]
},
"t": {
"type": "alternation",
"tokens": ["t1*", "t2", "t3"]
}
"t": "t1* t2 | t3"
// is equivalent to =>
"t1*": {
"type": "zeroOrMore",
"tokens": ["t1"]
},
"t1* t2": {
"type": "sequence",
"tokens": ["t1*", "t2"]
},
"t": {
"type": "alternation",
"tokens": ["t1* t2", "t3"]
}
// tokens can be grouped using parentheses
"t": "(t1 t2)* | t3"
// is equivalent to =>
"t1 t2": {
"type": "sequence",
"tokens": ["t1", "t2"]
}
"(t1 t2)*": {
"type": "zeroOrMore",
"tokens": ["t1 t2"]
}
"t": {
"type": "alternation",
"tokens": ["(t1 t2)*", "t3"]
}
// literal tokens wrapped in quotes (' or ") (e.g 'abc') are equivalent to their literal value (i.e abc)
// literal tokens wrapped in brackets (e.g [abc]) are equivalent to matching any of the enclosed characters (like in regular expressions)
// note: LIKE REGULAR EXPRESSIONS having ^ first in a character selection, e.g [^abc] provides a NEGATIVE match
// note2: the PEG features
// 1. negative lookahead feature (i.e not-predicate !t)
// 2. positive lookahead feature (i.e and-predicate &t)
// are currently NOT supported
// regular expressions can also be defined inside a syntax rule, using /../[i] format, e.g /abc/, /abc/i
// this will create a new simple token (with token_id=/abc/i) which is a regular expression
// using a dotted-style modifier, on the regex defined this way, can style it dynamicaly (see below), e.g /abc/i.style1
// empty literal token w/ quotes (i.e '') matches NON-SPACE production (i.e fails if space is encountered)
// zero literal token w/o quotes (i.e 0) matches EMPTY production (i.e succeeds always)
// ^^ literal token w/o quotes (i.e ^^) matches SOF (i.e start-of-file, first line of code)
// ^^1 literal token w/o quotes (i.e ^^1) matches FNBL (i.e first non-blank line of code)
// ^ literal token w/o quotes (i.e ^) matches SOL (i.e start-of-line, any line)
// $ literal token w/o quotes (i.e $) matches EOL (i.e end-of-line, any line, along with any extra space)
// e.g
"t": "t1 '=' t2"
// is equivalent to =>
"t_equal": {
"type": "simple",
"tokens": "="
},
"t": {
"type": "sequence",
"tokens": ["t1", "t_equal", "t2"]
}
// any (simple or composite) token followed by a dotted-style modifier, uses this style in context
// for example, the token "t3", below, will be styled differently depending on context
// i.e whether it comes after "t1" or after "t2"
// ..
// Style..
"Style": {
"style1": "an-editor-style",
"style2": "another-editor-style"
}
// ..
// Syntax..
"t": "t1 t3.style1 | t2 t3.style2"
// style modifier for composite tokens also works
// i.e below both "t2" and "t3" tokens will be styled with "style1", if after "t1"
// note: ".style1" overrides internal "t2.style2" modifier as well
"t": "t1 (t2.style2 t3).style1"
// and so on..
// ..
###Parser
Grammar.Parser
defines what to parse and in what order ( only patterns defined in this part of the grammar
will actually be parsed , everything else is auxilliary
)
- a syntax token used in the
parser
, which is enclosed in brackets[..]
, i.e as anarray
, is interpreted as ann-gram
-type token (see syntax part, above). This is used for convenience ans succintness of notation, instead of creating separaten-gram
token(s), for use in theparser
part
###Modularity and Future Directions
The model envisioned for modular highlighting is shown below:
- User specifies a
specification
(agrammar
model) for a language - The specification along with the source code (to be highlighted) is passed to the
analyser
(parser
), which uses the specification to analyse the code and extract data (see below) - The analysed code data and the source code are passed to the
renderer
(editor
, e.gcodemirror
,ace
,..) which renders the code and interacts with the user
More or less this model is used now by highlight editors (e.g codemirror
, ace
, ..) however it is not modular.
The new specification-analyser-renderer model
has the following features / advantages:
- Specification is at the same time both concise and detailed, self-contained (this is achieved by modeling all necessary features of a language in a symbolic way, see below).
- Specifications can be extended / merged / combined to construct new specifications for variations, dialects, composite languages, with minimum effort (this is achieved by the appropriate format for a specification and how/what it models, see below).
- The model is based on interfaces for analyser and renderer, while (underlying) implementations can vary as needed. This decouples highlighting specifications from underlying implementations and differences between platforms,editors,languages and so on..
- Optionaly, a specification can be directly transformed into hard code (highlight mode source code for a language) and be used directly (in place of analyser) without re-doing the first 2 stages (
specification - analyser
).
The format for a specification should be such that is widely supported, is textual or also has a equivalent strictly-textual representation, is concise, and enables composition operations. JSON
is such a format (and ubiquitous in javascript-based projects) and is the format used by the grammar
(specification
).
In order for a (grammar) specification for (programming) language highlight to be detailed and self-contained (to be re-usable and flexible under the most general conditions), and also concise, it should model all available types of tokens/actions which play a part in the code and the way it is highlighted (usualy not modeled by other approaches or other grammar specifications).
For example (see above):
-
<start-of-file>
,<start-of-line>
,<end-of-line>
token -
<first-non-blank-line>
,<up-to-line-end>
token -
<non-space>
,<indentation>
,<dedentation>
token -
<block>
,<delimited>
token (e.g in the sense thatcomments
,strings
,heredocs
,.. are delimited, block tokens) -
and so on..
-
handle arbitrary, user-defined, toggle comments and keyword autocompletion functionality (achieved by semanticaly annotating language
comments
andkeywords
) -
handle arbitrary, user-defined, lint-like syntax-annotation functionality (achieved by modeling language
syntax
analysis except strictlylexical
analysis, in-context) -
handle arbitrary, user-defined, code
folding
(e.g viafold action
token or via aFold
Model, see above) -
handle arbitrary, user-defined, code
(out-)indentation
(e.g viaindent action
token or via anIndentation
Model) -
handle arbitrary, user-defined, code
matching
(e.gbrackets
,tags
, etc..) viamatch action
token -
handle arbitrary, user-defined, dynamic contexts via
context action
token -
handle arbitrary, user-defined, unique identifiers via
unique action
token -
handle arbitrary, user-defined,
(operator) precedence
relations viaprecedence action
token -
handle arbitrary, user-defined,
local/global/scoped
relations viascope action
token -
and so on..
The difference between <indentation>
token and <indent>
action token (although related) is that the indentation
token recognizes indentations and indentation changes (that may signal a code block, for example like braces do), while an indent
action token creates dynamic indentation (e.g by inserting indentation
tokens, or symbols to be recognised by indentation
tokens).