Child nodes can now be named in JavaCC21 ASTs

We've just added a new feature to JavaCC21 (soon to be renamed CongoCC) – the ability to access child nodes of an AST node by name. In this post, we'll discuss why that's useful and how to use the feature.

JavaCC21 tree generation

JavaCC21 creates a concrete syntax tree (CST) – a tree which contains every single token and non-terminal, with position information. There are applications (e.g. refactoring, code formatting, code style checking and so on) where this level of information capture is essential. However, for code-generating applications, the level of detail in the information captured can be a distraction.

We saw this in practice when working on multi-language generation support for JavaCC21. At the time of writing, we have first-cut support for generating self-contained recognizers (lexers and parsers) in Python and C#, using the example grammars bundled with JavaCC21 (these grammars are pretty up to date - the latest language versions are supported).

Handling semantic actions

The example grammars all contain semantic actions, which are code snippets written in Java (the implementation language of JavaCC21) and which get nicely folded into recognizers generated in the Java language. The snippets include Java code declared in the grammar using the INJECT directive of JavaCC21, which allows code to be added to generated classes without the need for any post-processing of the generated code.

The generated recognizers need the semantic action code to be present in order to work. But how do we get these actions into the recognizers generated in, say, Python or C#?

There are some broad approaches to achieving this:

  1. Duplicate the semantic actions in the target languages. The example grammars would then have multiple copies of actions – in each target language, selected using preprocessor definitions such as

    #if (lang == 'java')
    <Java code snippet>
    #elif (lang == 'csharp')
    <C# code snippet>
    #elif (lang == 'python')
    <Python code snippet>
    #endif
    

    JavaCC21 would then choose the relevant snippet based on the target language.

    This would be tedious and error-prone in the case where multiple languages needed to be supported in a grammar (though not problematic if only one language needed to be supported for that grammar).

    As an alternative to preprocessor definitions, the delimiter for semantic actions in the grammar could be modified to indicate the language of the action, and one could attach multiple semantic actions in place of one, selecting only the relevant one when generating code – something like:

    {<csharp>
    Snippet in C#
    <csharp>}
    {<python>
    Snippet in Python
    <python>}
    {
    Snippet in Java
    }
    

    This would require changes in JavaCC21 to handle snippets in other languages and to handle multiple snippets at a given point in the grammar. It looks a little ugly/unwieldy, but it seems workable.

    Whether this approach or preprocessor directives are used, support for multiple languages in a grammar using this approach violates the DRY (Don't Repeat Yourself) principle.

  2. Transpile the Java code in snippets into the target language. This has the advantage that the DRY principle isn't violated, and that any existing grammar could perhaps be handled without changes, but also has drawbacks:

    • Transpilation is not feasible if arbitrary code is used in action snippets. For example, code using the Java streams API would be hard to transpile to Python, say. So transpiling only works if you constrain the Java used in snippets to constructs common to all languages (e.g. using common types such as arrays/lists, mappings, and primitive/basic types).

    • Transpilation can be hard to do correctly because of the limited context available in snippets. For example, type information in a given snippet can be incomplete, because types are hidden behind common Java APIs, so it can be hard to know exactly what code to generate from the transpilation.

    Despite these drawbacks, JavaCC21 currently implements the transpilation approach for a couple of reasons:

    • The absence of support for the duplicated-code approach in current JavaCC21.

    • As an experiment to test the feasibility of the transpilation approach – it's feasible, but how far can we take it?

  3. Use a pseudo-language for snippets which then gets transpiled into the language used for generation. This would allow users to specify grammar actions just once and have them transpiled into the target language. The advantage is in upholding the DRY principle, and through the pseudo-language design ensuring that full support for transpilation is available. The drawback is that the pseudo-language needs to be designed, a parser for it implemented and transpilers to the various target languages devised.

The second of these approaches, transpilation from Java, has allowed us to develop working parsers for JSON, Java and Python generated in Python and C# as well as the "native" Java versions. A parser for the example C# grammar generated in Java exists and is quite robust; transpiler support for generating Python and C# generation for the C# grammar snippets is in progress. And it is in the context of this work that the need for named child nodes has become apparent.

Why are named child nodes useful?

With the transpilation-from-Java approach, we need to convert Java code to the target generation language (Python or C#). Consider a (somewhat simplified) portion of the Java grammar built into JavaCC21:

MethodDeclaration :
  Modifiers [ TypeParameters ]
  ReturnType <IDENTIFIER>
  FormalParameters ( (Annotation)* "[" "]" )* [ ThrowsList ]
  ( Block | ";" )
;

ConstructorDeclaration :
  Modifiers [ TypeParameters ]
  TypeIdentifier FormalParameters [ ThrowsList ]
  "{"
  [ ExplicitConstructorInvocation ]
  ( BlockStatement )*
  "}"
;

A transpiler would need to process declarations of these types and generate equivalent code in a target language. Note that there are some common elements in these: for example, Modifiers, TypeParameters, FormalParameters and ThrowsList. However, they occur in completely different positions in the rules for MethodDeclaration and ConstructorDeclaration. Also note that even in two given instances of say MethodDeclaration, the FormalParameters item might be in different positions in the child node list of the MethodDeclaration (e.g. if one declaration has type parameters and the other doesn't). Thus, to locate nodes of interest in the CST for e.g. a MethodDeclaration, we have a number of options:

  • Walk through the child nodes one by one, looking for nodes of interest, by type and/or position, and skipping the others.

  • Use a Node interface API which all CST nodes implement, such as for example firstChildOfType<FormalParameters>() or childrenOfType<IDENTIFIER>(). This might work for some grammars and parent nodes (as in this example, where there are no duplicate nodes of a given type in the grammar rules). But suppose you are working on a grammar containing a rule such as

    ImportDeclaration :
        (<FROM> <IDENTIFIER> ("." <IDENTIFIER> )*
         <IMPORT> <IDENTIFIER> [ <AS> <IDENTIFIER> ]
         |
         <IMPORT> <IDENTIFIER> ("." <IDENTIFIER> )* [ <AS> <IDENTIFIER> ]
        )
        ";"
    ;
    

    clearly just looking for child <IDENTIFIER> nodes doesn't help identify (excuse the pun) what the different identifiers mean.

If the grammar author can name the relevant grammar elements, where the name ascribes some semantic meaning, then those elements could be extracted by name from a parent node. If the above rule is changed to

ImportDeclaration :
    (<FROM> <IDENTIFIER> /module/ ("." <IDENTIFIER> /[submodules]/ )*
     <IMPORT> <IDENTIFIER> /imported/ [ <AS> <IDENTIFIER> /alias/ ]
     |
     <IMPORT> <IDENTIFIER> /module/ ("." <IDENTIFIER> /[submodules]/ )* [ <AS> <IDENTIFIER> /alias/ ]
    )
    ";"
;

then it's pretty clear what each identifier means semantically. Given a CST node that represents an ImportDeclaration. you could access the relevant children using

importDecl.getNamedChild("module")          // => a Token with the module identifier
importDecl.getNamedChildList("submodules")  // => a List<Token> with the submodules
importDecl.getNamedChild("imported")        // => a Token with the imported identifier
importDecl.getNamedChild("alias")           // => a Token with the alias identifier

So names are just identifiers delimited by / characters and optionally surrounded by [ and ] to indicate that there are multiple occurrences, to be returned as a list. For any names not specified in the grammar or where the corresponding grammar element is missing, the above APIs would return null.

Note that the existing Node APIs are unaffected, and function exactly as before. Grammar authors are free to use this new feature without any concerns about backwards compatibility.

Comments