Recursive Descent Parsing in Python

Red Dove Consultants

2021-08-12 10:52

There's a useful list of Python parsing tools maintained by Michael Bernstein. However, it doesn't contain a close equivalent of the tool we've used in the past when developing lexers and parsers in Java: JavaCC. In this post, we'll look at what's nice about JavaCC, and how recent developments in it make it feasible to create parsers with it in Python, with no external dependencies!

JavaCC Background

JavaCC is used to write parsers in Java and has a number of features which make it attractive:

It uses LL(k) [1] parsing techniques to create top-down recursive-descent parsers. These are a lot closer to parsers one might write by hand – the other common parsing technique used, LR(k) [1], creates bottom-up parsers which are harder to reason about and debug. Recursive-descent parsers also allow you to easily parse different categories of input – e.g. statements, expressions or entire modules.
It creates parsers which have no external dependencies other than the Java runtime. This is in contrast to other tools like ANTLR, which is a well-respected tool but has a separate runtime which will be referenced by the parsing code that's generated.
It allows you to insert semantic actions (bits of code) anywhere in the grammar. This can be very useful, for example, to assist with tree building and error recovery logic in the parser.
Although it uses a default k (lookahead value) of 1, that lookahead can be made larger in parts of the grammar using LOOKAHEAD(k) notation with k > 1, to help disambiguate the grammar. You can also use predicates in code to assist with disambiguation. This approach is very pragmatic, because LL(1) is good enough most of the time and you can override the default lookahead logic only where it's needed.

JavaCC Drawbacks

In addition to the advantages, JavaCC has historically suffered from some drawbacks, too:

The grammar file syntax requires one to use lots of parentheses and curly braces even when they aren't need to understand the grammar. This results in a lot of visual clutter. Here is an example of a rule for parsing a FooBar, which consists of a Foo followed by a Bar:
```
void FooBar() :
{}
{
    Foo() Bar()
}
```
This rule could be represented more succinctly as
```
FooBar : Foo Bar;
```
Sadly, JavaCC doesn't use the simpler syntax.
Building a syntax tree requires an additional step and a separate tool, JJTree.
The pace of development of the project is positively glacial. Very few new features have been added, and some long-standing problems related to nested lookahead remain unresolved. There has been some activity of late relating to improved code generation, though no new releases have resulted as a result of that activity.
The codebase is – ahem – gnarly, though of course that's just an opinion, and it doesn't affect users of the generated code. But back in 2008, when we had looked at the possibility of modifying JavaCC to generate Python code rather than Java code, the codebase appeared too hard to work with, containing lots of print(...) statements to produce the generated code. (That's perhaps because better approaches weren't readily available prior to 2003, when JavaCC was released under an Open Source licence.)

Enter JavaCC21

Of late, there's a new version of JavaCC on the block – JavaCC21. In addition to supporting a more streamlined syntax, built-in syntax tree construction, an INCLUDE statement (allowing large grammars to be divided into e.g. lexer and parser specifications), a preprocessor and a number of other enhancements, JavaCC21 has for us two particular features of interest:

It's under active development, compared to the original JavaCC.
It uses templates to generate parser code, which means that it can generate parsers in other languages provided that a set of templates for those languages can be developed. The template-based approach is streets ahead of using print() statements and offers the real possibility of generating parsers in Python, and other languages too, of course!

Using JavaCC21 to generate parsers in Python

To understand the feasibility of generating recursive-descent parsers in Python using JavaCC21, we dived right in. The JavaCC21 codebase is easier to follow than the old codebase, and readily allowed us to make the following changes:

Add a -lang command line argument which accepts values of java or python and defaults to java if not specified.
Use the -lang parameter value to determine the root directory of templates to be used for code generation.
Create a set of templates in the template root directory for Python, using the existing templates for Java as a guide.
Render the selected templates with the data obtained from parsing a source grammar file to produce the lexer, parser and utility code needed.
Create simple tests which run the lexer and parser generated from a grammar using the Java templates on a set of test files and produce text result files (lexer results are a list of tokens with their type, text and position data; parser results are the parse tree dumped as an indented set of nodes). The same test files are processed by the lexer and parser generated from the same grammar and the Python templates, and the test results stored in the same format and compared with the results from the Java parsers.

Initial results are encouraging. Using test grammars for JSON, Java and Python 3.9, identical results were obtained by running the Java and Python parsers on a smallish set of test files. Here's an example test run (not intended as any kind of benchmark – just intended to find any differences between Java and Python parsers):

$ PY_DEBUG=1 pypy3 python_tests.py --langs json,java,python
----------------------------------------------------------------------
Testing with JSON grammar
----------------------------------------------------------------------
Working directory: /tmp/javacc-python-test-73tya3xj
Test files copied to working directory.
Java version of lexer and parser created.
Java lexer and parser compiled.
Java lexer run completed (2.48 secs).
Java parser run completed (2.66 secs).
Python version of lexer and parser created.
Python lexer run completed (0.38 secs).
Python parser run completed (0.62 secs).
Results for Python & Java lexers & parsers are identical - yay!
----------------------------------------------------------------------
Testing with Java grammar
----------------------------------------------------------------------
Working directory: /tmp/javacc-python-test-54k1jjdr
Test files copied to working directory.
Java version of lexer and parser created.
Java lexer and parser compiled.
Java lexer run completed (4.90 secs).
Java parser run completed (3.70 secs).
Python version of lexer and parser created.
Python lexer run completed (6.21 secs).
Python parser run completed (15.90 secs).
Results for Python & Java lexers & parsers are identical - yay!
----------------------------------------------------------------------
Testing with Python grammar
----------------------------------------------------------------------
Working directory: /tmp/javacc-python-test-ufbbfnok
Test files copied to working directory.
Java version of lexer and parser created.
Java lexer and parser compiled.
Java lexer run completed (3.28 secs).
Java parser run completed (4.05 secs).
Python version of lexer and parser created.
Python lexer run completed (6.32 secs).
Python parser run completed (27.58 secs).
Results for Python & Java lexers & parsers are identical - yay!
Working directories deleted.

These test results were obtained using PyPy, but CPython also produces equivalent results – albeit slightly more slowly. The Java times appear a little longer than the Python ones in some cases, but that's due to a small amount of Jython overhead because a small common Python/Jython test harness is used to call the parsers, and that causes a slight distortion in the Java times reported when the total time is small.

In another test, the Python parsers in Java and Python were let loose on all the Python files in recent checkouts of the Django and CPython projects (almost 4700 Python source files in total), with each run writing the parse trees to disk in text format. When compared, the two sets of result trees were equivalent – in other words, the Python parser returns the same parse tree as the Java parser over all those files!

In subsequent posts, we'll outline in more detail the technical work that was involved in creating this Python-generating version of JavaCC21, and what's left to do to get it production-ready. Stay tuned!

JavaCC Background

JavaCC Drawbacks

Enter JavaCC21

Using JavaCC21 to generate parsers in Python

Comments