JavaScript Parser Generators

Mai 14

Posted by Michael Ernst in Programming Stuff

When validating the content of an HTML text input field it is often ineffective to trigger the remote validation for each character change. Sometimes you can use a buffer, so validation is only done after a fix time interval. For other scenarios it is a good solution to create a validator which works completely on the client side, using JavaScript. This blog post provides a quick overview and a simple tutorial for creating a JavaScript parser.

There are only two ways how to create a JavaScript parser:

Create your own parser: Choose this way if the grammar is easy to understand.
Use tools to generate a parser based on a grammar definition: Choose this way if the grammar is complex like an expression grammar for math operations.

In the following let us focus on criteria to select the appropriate parser tool. The following listed tools provide the most interesting possibilities to create a parser:

JS/CC: A parser and lexical analyzer generator, written in JavaScript. I didn’t try it out yet.
ANTLR: A parser and lexical analyzer generator, written in Java. It can also generate parsers for other languages (called “targets”) like Java or C#. There is an interesting blog post about creating a parser. I tried it and it failed with the current version (available since four months) because the JavaScript target is broken. I mailed with the author of the target and he admitted that he almost hasn’t worked for a year at the project. So I don’t know, if there will be further development on that JavaScript target. The ANTLR team itself is currently working on the next major version, so let’s see if a JavaScript target will still be supported.
ANTLRWorks (the ANTLR IDE) provides really good help if grammar definitions are ambiguous. It also provides features like live interpretation, debugging and grammar visualization. According to the author of the mentioned blog post a working setup should be ANTLRWorks 1.4.2 and ANTLR JavaScript runtime 3.1.
PEG.js: This tool is just awesome. It is easy to use, works as expected, but you have to know what you are doing. There is no such nice grammar tool like ANTLRWorks. Instead there is an online tool which helps you while writing a grammar. The parser is also very fast. I could not see any input leaks while typing.

PEG.js uses a friendly license and can be used even in commercial products for free. The integration of the license text is required as usually for a lot of other common libraries which we daily use. The same is also valid for the ANTLR license, although the used Artistic License 1.0 of the JS/CC project is a little bit more restrictive.

The main difference between the ANTLR and PEG.js solution is that the ANTLR API provides some hooks into the error behavior, enabling you to collect all parse errors. The ANTLR parser is a stateful object which must be created for each parse request whereas PEG.js is stateless and aborts the parse request after a single error.

To summarize this introduction the ANTLR solution would be your choice if you want to parse really complex grammars where you could not expect that the string to parse is valid, like in an editor for a Java grammar. On the other hand, if you only want to parse for example an expression, a fragment of a more complex grammar, or you know that the content is almost always valid, PEG.js would be your choice. Since I expect that the PEG.js scenarios represent the most common use cases, I will focus on this tool.

The PEG.js Tutorial

In the following PEG.js tutorial you can see a small example on the grammar. The example will create an Abstract Syntax Tree (AST) for simple math operations. These are additive, multiplicative operations, and parenthesis. The math rules, e.g. to interpret parenthesis before multiplicative operations and to execute multiplicative operations before additive operations, are fully respected.

As you certainly know, such rules will be realized while defining the grammar rules in their reverse order. Thus, an additive operation can either be a multiplicative operation with an additive operator and again an additive operation or a multiplicative operation. So let’s have a look on our simple grammar for an additive operation.

additive
  = multiplicative OPADD additive
  / multiplicative
 
OPADD =
  '+'
  / '-'

As you can see, we define a grammar rule ‘additive’ and a second rule ‘OPADD’ to define the operators. The slash (‘/’) is used to express an ‘OR’. For the ‘additive’ rule PEG.js will generate by default an array with all elements of the rule. So when parsing “2+3” PEG.js will create the following array, if the numbers are interpreted as integers ‘[2, “+”, 3]’.

How can we improve the layout of the result?

PEG.js allows the integration of JavaScript by using curly brackets ‘{ }’. This allows us to rewrite the result of our ‘additive’ rule to create a JavaScript object. See the following rewrite. To access the used rules we have to name each rule element we want to reference in the parser rewrite function. This is done by prefixing the rule with a name and a colon as separator.

additive
  = left:multiplicative op:OPADD right:additive { return {left:left, op:op, right:right}; }
  / multiplicative

When parsing the “2+3” expression again, this will cause the creation of a JavaScript object like the following:

{
   "left": 2,
   "op": "+",
   "right": 3
}

As there could also be some whitespaces between a number and an operator this example filters whitespaces using a whitespace rule called “_”:

_
  = [ \r\n\t]*

This rule is integrated into the parse rule of the operator which now looks like the following.

OPADD
  = _ c:"+" _ {return c;}
  / _ c:"-" _ {return c;}

But what has to be done if you want to use the parser for example for syntax highlighting? Then you need to retrieve also the position of the elements. If you enable the feature “track line and column”, PEG.js will provide this information without any other effort on your side. The column as well as the line number will then be available in the parser rule function and can be accessed by using the variables “line” and “column”.

PEG.js further allows the integration of additional JavaScript code to the parser by using also the curly brackets before specifying the first grammar rule. See in the following a rewrite of the ‘OPADD’ rule now including some metadata and the function to create the metadata object.

{
   function createMetadata(line, column) {
      return {line : line, column : column };
   }
}
 
OPADD
  = _ c:"+" _ {return {op:c, metadata: createMetadata(line, column)};}
  / _ c:"-" _ {return {op:c, metadata: createMetadata(line, column)};}

This will create the following object for parsing the expression “2 + 3”.

{
   "left": 2,
   "op": {
      "op": "+",
      "metadata": {
         "line": 1,
         "column": 3
      }
   },
   "right": 3
}

Complete Example Grammar

Here is the complete simple math grammar which generates an AST without any metadata.

A further thing worth to mention is the parsing of the integer. This is done by specifying the accepted characters. As PEG.js will create an array entry for each matching character, it must be joined and converted to an integer. The “10” as last parameter of the function call “parseInt” defines the numeral system.

additive
  = left:multiplicative op:OPADD right:additive { return {left:left, op:op, right:right}; }
  / multiplicative
 
multiplicative
  = left:primary op:OPMULTI right:multiplicative { return {left:left, op:op, right:right}; }
  / primary
 
primary
  = integer
  / OPENPAREN additive:additive CLOSEPAREN { return additive; }
 
integer "integer"
  = _ digits:[0-9]+ _ { return parseInt(digits.join(''), 10); }
 
/**
* Define tokens
*/
 
OPENPAREN = _ '(' _
CLOSEPAREN = _ ')' _
 
OPADD
  = _ c:"+" _ {return c;}
  / _ c:"-" _ {return c;}
 
OPMULTI
  = _ c:"*" _ {return c;}
  / _ c:"/" _ {return c;}
 
_
  = [ \r\n\t]*

Parsing the expression “2 * (3 + 4)” would result in the following object.

{
   "left": 2,
   "op": "*",
   "right": {
      "left": 3,
      "op": "+",
      "right": 4
   }
}

Download the Parser Code

You can easily try and play with this grammar by using the PEG.js online tool. For enabling the metadata click the checkbox next to the download button as highlighted in the following image of the PEG.js online tool.

When entering ‘Parser’ into the ‘Parser variable’ text field of the online tool, you can simply download the parser by copy paste the displayed code into your web project and use it as in the following integration example. It doesn’t require any further libraries, nor the PEG.js file itself, just the generated code.

Integrate the Parser

This is an implementation of an ExtJS validator for an ExtJS text field.

validator : function(input) {
   var trimedInput = Ext.util.Format.trim(input);
   if (trimedInput === '') {
      return true;
   }
   try {
      Parser.parse(trimedInput);
   } catch (e) {
      return e.message;
   }
   return true;
}

This function first trims the input by removing whitespaces at the beginning and end of the given string. If the trimmed text isn’t an empty string it will be parsed by calling the “parse” function on the “Parser” object.

If you want to validate only a subset of the grammar you can also invoke the method with an additional string parameter, which represents the parse rule name. For example calling “Parser.parse(input, ‘multiplicative’)” would only be able to parse multiplicative expressions like “1 * 2”. If no parse rule is explicitly given, the first specified parse rule is used.

In case the given string could not be parsed, an exception is thrown. The exception object provides the property “message” which would look like the following when parsing the expression “1 2”:

“Line 1, column 3: Expected „*“, „+“, „-„, „/“ or [ \r\n\t] but „2“ found.”

The thrown exception object contains all data necessary for your own error handling strategy (e.g. to internationalize the exception message or use it for an auto completion feature).

One Last Tip

Even if you decide to create the parser with PEG.js you can take advantage of ANTLRWorks to define the grammar. It is easy to use and the syntax is easy to learn. After this tutorial it should be easy to transform your grammar into a PEG.js accepted language definition.

Schlagwörter: ANTLR, JavaScript Parser, JS/CC, PEG.js

Comments are closed.

Michael's Blog