Generalized LL (GLL) Parser

TLDR; This tutorial is a complete implementation of GLL Parser in Python including SPPF parse tree extraction ¹. The Python interpreter is embedded so that you can work through the implementation steps.

A GLL parser is a generalization of LL parsers. The first generalized LL parser was reported by Grune and Jacob ² (11.2) from a masters thesis report in 1993 (another possibly earlier paper looking at generalized LL parsing is from Lang in 1974 ³ and another from Bouckaert et al. ⁴). However, a better known generalization of LL parsing was described by Scott and Johnstone ⁵. This post follows the later parsing technique. In this post, I provide a complete implementation and a tutorial on how to implement a GLL parser in Python.

We previously discussed Earley parser which is a general context-free parser. GLL parser is another general context-free parser that is capable of parsing strings that conform to any given context-free grammar. The algorithm is a generalization of the traditional recursive descent parsing style. In traditional recursive descent parsing, the programmer uses the call stack for keeping track of the parse context. This approach, however, fails when there is left recursion. The problem is that recursive descent parsers cannot advance the parsed index as it is not immediately clear how many recursions are required to parse a given string. Bounding of recursion as we discussed before is a reasonable solution. However, it is very inefficient.

GLL parsing offers a solution. The basic idea behind GLL parsing is to maintain the call stack programmatically, which allows us to iteratively deepen the parse for any nonterminal at any given point. This combined with sharing of the stack (GSS) and generation of parse forest (SPPF) makes the GLL parsing very efficient. Furthermore, unlike Earley, CYK, and GLR parsers, GLL parser operates by producing a custom parser for a given grammar. This means that one can actually debug the recursive descent parsing program directly. Hence, using GLL can be much more friendly to the practitioner.

Similar to Earley, GLR, CYK, and other general context-free parsers, the worst case for parsing is $O(n^3)$ . However, for LL(1) grammars, the parse time is $O(n)$ .

Synopsis

import gllparser as P
my_grammar = {'<start>': [['1', '<A>'],
                          ['2']
                         ],
              '<A>'    : [['a']]}
my_parser = P.compile_grammar(my_grammar)
for tree in my_parser.parse_on(text='1a', start_symbol='<start>'):
    print(P.format_parsetree(tree))

Synopsis
Definitions
1. Prerequisites
Traditional Recursive Descent
Naive Threaded Recognizer
The GSS Graph
GLL Parser
Utilities.
SPPF Graph
The GLL parser
Building the parser with GLL
Running it
SPPF Parse Forest
Extracting all trees
1. ChoiceNode
2. EnhancedExtractor
Running it
1. 2
2. 3
3. 4
4. 5
Expression
1. 1
2. 2
A few more examples
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Definitions

For this post, we use the following terms:

The alphabet is the set all of symbols in the input language. For example, in this post, we use all ASCII characters as alphabet.
A terminal is a single alphabet symbol. Note that this is slightly different from usual definitions (done here for ease of parsing). (Usually a terminal is a contiguous sequence of symbols from the alphabet. However, both kinds of grammars have a one to one correspondence, and can be converted easily.)

For example, x is a terminal symbol.
A nonterminal is a symbol outside the alphabet whose expansion is defined in the grammar using rules for expansion.

For example, <term> is a nonterminal in the below grammar.
A rule is a finite sequence of terms (two types of terms: terminals and nonterminals) that describe an expansion of a given terminal. A rule is also called an alternative expansion.

For example, [<term>+<expr>] is one of the expansion rules of the nonterminal <expr>.
A definition is a set of rules that describe the expansion of a given nonterminal.

For example, [[<digit>,<digits>],[<digit>]] is the definition of the nonterminal <digits>
A context-free grammar is composed of a set of nonterminals and corresponding definitions that define the structure of the nonterminal.

The grammar given below is an example context-free grammar.
A terminal derives a string if the string contains only the symbols in the terminal. A nonterminal derives a string if the corresponding definition derives the string. A definition derives the string if one of the rules in the definition derives the string. A rule derives a string if the sequence of terms that make up the rule can derive the string, deriving one substring after another contiguously (also called parsing).
A derivation tree is an ordered tree that describes how an input string is derived by the given start symbol. Also called a parse tree.
A derivation tree can be collapsed into its string equivalent. Such a string can be parsed again by the nonterminal at the root node of the derivation tree such that at least one of the resulting derivation trees would be the same as the one we started with.
The yield of a tree is the string resulting from collapsing that tree.
An epsilon rule matches an empty string.

Prerequisites

As before, we start with the prerequisite imports. If you are running this on command line, please uncomment the following line.

def __canvas__(g):
   print(g)

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

We need the fuzzer to generate inputs to parse and also to provide some utilities

We use the display_tree() method in earley parser for displaying trees.

We use the random choice to extract derivation trees from the parse forest.

Pydot is needed for drawing

As before, we use the fuzzingbook grammar style. Here is an example grammar for arithmetic expressions, starting at <start>. A terminal symbol has exactly one character (Note that we disallow empty string ('') as a terminal symbol). Secondly, as per traditional implementations, there can only be one expansion rule for the <start> symbol. We work around this restriction by simply constructing as many charts as there are expansion rules, and returning all parse trees.

Traditional Recursive Descent

Consider how you will parse a string that conforms to the following grammar

In traditional recursive descent, we write a parser in the following fashion

class G1TraditionalRD(ep.Parser):
    def recognize_on(self, text):
        res =  self.S(text, 0)
        if res == len(text): return True
        return False

# S ::= S_0 | S_1
    def S(self, text, cur_idx):
        if (i:= self.S_0(text, cur_idx)) is not None: return i
        if (i := self.S_1(text, cur_idx)) is not None: return i
        return None

# S_0 ::= <A> 
 def S_0(self, text, cur_idx):
 if (i := self.A(text, cur_idx)) is None: return None
 if (i := self.B(text, i)) is None: return None
 return i

# S_1 ::= <C>
 def S_1(self, text, cur_idx):
 if (i := self.C(text, cur_idx)) is None: return None
 return i

def A(self, text, cur_idx):
        if (i := self.A_0(text, cur_idx)) is not None: return i
        return None

# A_0 ::= a
    def A_0(self, text, cur_idx):
        i = cur_idx+1
        if text[cur_idx:i] != 'a': return None
        return i

def B(self, text, cur_idx):
        if (i := self.B_0(text, cur_idx)) is not None: return i
        return None

# B_0 ::= b
    def B_0(self, text, cur_idx):
        i = cur_idx+1
        if text[cur_idx:i] != 'b': return None
        return i

def C(self, text, cur_idx):
        if (i := self.C_0(text, cur_idx)) is not None: return i
        return None

# C_0 ::= c
    def C_0(self, text, cur_idx):
        i = cur_idx+1
        if text[cur_idx:i] != 'c': return None
        return i

Using it

What if there is recursion? Here is another grammar with recursion

In traditional recursive descent, we write a parser in the following fashion

Using it

The problem happens when there is a left recursion. For example, the following grammar contains a left recursion even though it recognizes the same language as before.

Naive Threaded Recognizer

The problem with left recursion is that in traditional recursive descent style, we are forced to follow a depth first exploration, completing the parse of one entire rule before attempting then next rule. We can work around this by managing the call stack ourselves. The idea is to convert each procedure into a case label, save the previous label in the stack (managed by us) before a sub procedure. When the exploration is finished, we pop the previous label off the stack, and continue where we left off.

class NaiveThreadedRecognizer(ep.Parser):
 def recognize_on(self, text, start_symbol, max_count=1000):
 parser = self.parser
 parser.initialize(text)
 parser.set_grammar(
 {
 '<S>': [['<A>']],
 '<A>': [['<A>', 'a'],
 []]
 })
 L, stack_top, cur_idx = start_symbol, parser.stack_bottom, 0
 self.count = 0
 while self.count < max_count:
 self.count += 1
 if L == 'L0':
 if parser.threads:
 (L, stack_top, cur_idx) = parser.next_thread()
 if ((L[0], stack_top, cur_idx)
 == (start_symbol, parser.stack_bottom, (parser.m-1))):
 return parser
 continue
 else:
 return []
 elif L == 'L_':
 stack_top = parser.fn_return(stack_top, cur_idx) # pop
 L = 'L0' # goto L_0
 continue

elif L == '<S>':
 # <S>::=['<A>']
 parser.add_thread( ('<S>',0,0), stack_top, cur_idx)
 L = 'L0'
 continue

elif L == ('<S>',0,0): # <S>::= | <A>
 stack_top = parser.register_return(('<S>',0,1), stack_top, cur_idx)
 L = '<A>'
 continue

elif L == ('<S>',0,1): # <S>::= <A> |
 L = 'L_'
 continue

elif L == '<A>':
 # <A>::=['<A>', 'a']
 parser.add_thread( ('<A>',0,0), stack_top, cur_idx)
 # <A>::=[]
 parser.add_thread( ('<A>',1,0), stack_top, cur_idx)
 L = 'L0'
 continue

elif L == ('<A>',0,0): # <A>::= | <A> a
 stack_top = parser.register_return(('<A>',0,1), stack_top, cur_idx)
 L = "<A>"
 continue

elif L == ('<A>',0,1): # <A>::= <A> | a
 if parser.I[cur_idx] == 'a':
 cur_idx = cur_idx+1
 L = ('<A>',0,2)
 else:
 L = 'L0'
 continue

elif L == ('<A>',0,2): # <A>::= <A> a |
 L = 'L_'
 continue

elif L == ('<A>',1,0): # <A>::= |
 L = 'L_'
 continue

else:
                assert False

We also need a way to hold the call stack. The call stack is actually stored as a linked list with the current stack_top on the top. With multiple alternatives being explored together, we actually have a tree, but the leaf nodes only know about their parent (not the reverse). For convenience, we use a wrapper for the call-stack, where we define a few book keeping functions. First the initialization of the call stack.

Adding a thread simply means appending the label, current stack top, and current parse index to the threads. We can also retrieve threads.

Next, we define how returns are handed. That is, before exploring a new sub procedure, we have to save the return label in the stack, which is handled by register_return(). The current stack top is added as a child of the return label.

When we have finished exploring a given procedure, we return back to the original position in the stack by poping off the prvious label.

Using it.

This unfortunately has a problem. The issue is that, when a string does not parse, the recursion along with the epsilon rule means that there is always a thread that keeps spawning new threads.

The GSS Graph

The way to solve it is to use something called a graph-structured stack ⁶. A naive conversion of recursive descent parsing to generalized recursive descent parsing can be done by maintaining independent stacks for each thread. However, this approach is has problems as we saw previously, when it comes to left recursion. The GSS converts the tree structured stack to a graph.

The GSS Node

A GSS node is simply a node that can contain any number of children. Each child is actually an edge in the graph.

(Each GSS Node is of the form $L_i^j$ where $j$ is the index of the character consumed. However, we do not need to know the internals of the label here).

The GSS container

Next, we define the graph container. We keep two structures. self.graph which is the shared stack, and self.P which is the set of labels that went through a fn_return, i.e. pop operation.

A wrapper for book keeping functions.

GLL+GSS add_thread (add)

Our add_thread increases a bit in complexity. We now check if a thread already exists before starting a new thread.

next_thread is same as before

GLL+GSS register_return (create)

A major change in this method. We now look for pre-existing edges before appending edges (child nodes).

GLL+GSS fn_return (pop)

A small change in fn_return. We now save all parsed indexes at every label when the parse is complete.

With GSS, we finally have a true GLL recognizer. Here is the same recognizer unmodified, except for checking the parse ending. Here, we check whether the start symbol is completely parsed only when the threads are complete.

class GLLG1Recognizer(ep.Parser):
 def recognize_on(self, text, start_symbol):
 parser = self.parser
 parser.initialize(text)
 parser.set_grammar(
 {
 '<S>': [['<A>']],
 '<A>': [['<A>', 'a'],
 []]
 })
 L, stack_top, cur_idx = start_symbol, parser.stack_bottom, 0
 while True:
 if L == 'L0':
 if parser.threads:
 (L, stack_top, cur_idx) = parser.next_thread()
 continue
 else: # changed
 for n_alt, rule in enumerate(self.parser.grammar[start_symbol]):
 if ( ((start_symbol, n_alt, len(rule)), parser.stack_bottom)
 in parser.U[parser.m-1]):
 parser.root = (start_symbol, 0, parser.m)
 return parser
 return []
 elif L == 'L_':
 stack_top = parser.fn_return(stack_top, cur_idx) # pop
 L = 'L0' # goto L_0
 continue

elif L == '<S>':
 # <S>::=['<A>']
 parser.add_thread( ('<S>',0,0), stack_top, cur_idx)
 L = 'L0'
 continue

elif L == ('<S>',0,0): # <S>::= | <A>
 stack_top = parser.register_return(('<S>',0,1), stack_top, cur_idx)
 L = '<A>'
 continue

elif L == ('<S>',0,1): # <S>::= <A> |
 L = 'L_'
 continue

elif L == ('<A>',0,0): # <A>::= | <A> a
 stack_top = parser.register_return(('<A>',0,1), stack_top, cur_idx)
 L = "<A>"
 continue

elif L == ('<A>',0,2): # <A>::= <A> a |
 L = 'L_'
 continue

elif L == ('<A>',1,0): # <A>::= |
 L = 'L_'
 continue

else:
                assert False

Using it.

GLL Parser

A recognizer is of limited utility. We need the parse tree if we are to use it in practice. Hence, We will now see how to convert this recognizer to a parser.

Utilities.

We start with a few utilities.

Symbols in the grammar

Here, we extract all terminal and nonterminal symbols in the grammar.

Using it

First, Follow, Nullable sets

To optimize GLL parsing, we need the First, Follow, and Nullable sets. (Note we do not use this at present)

Here is a nullable grammar.

The definition is as follows.

def union(a, b):
    n = len(a)
    a |= b
    return len(a) != n

def get_first_and_follow(grammar):
    terminals, nonterminals = symbols(grammar)
    first = {i: set() for i in nonterminals}
    first.update((i, {i}) for i in terminals)
    follow = {i: set() for i in nonterminals}
    nullable = set()
    while True:
        added = 0
        productions = [(k,rule) for k in nonterminals for rule in grammar[k]]
        for k, rule in productions:
            can_be_empty = True
            for t in rule:
                added += union(first[k], first[t])
                if t not in nullable:
                    can_be_empty = False
                    break
            if can_be_empty:
                added += union(nullable, {k})

follow_ = follow[k]
            for t in reversed(rule):
                if t in follow:
                    added += union(follow[t], follow_)
                if t in nullable:
                    follow_ = follow_.union(first[t])
                else:
                    follow_ = first[t]
        if not added:
            return first, follow, nullable

Using

First of a rule fragment.

(Note we do not use this at present) We need to compute the expected first character of a rule suffix.

To verify, we define an expression grammar.

using

SPPF Graph

We use a data-structure called Shared Packed Parse Forest to represent the parse forest. We cannot simply use a parse tree because there may be multiple possible derivations of the same input string (possibly even an infinite number of them). The basic idea here is that multiple derivations (even an infinite number of derivations) can be represented as links in the graph.

The SPPF graph contains four kinds of nodes. The dummy node represents an empty node, and is the simplest. The symbol node represents the parse of a nonterminal symbol within a given extent (i, j). Since there can be multiple derivations for a nonterminal symbol, each derivation is represented by a packed node, which is the third kind of node. Another kind of node is the intermediate node. An intermediate node represents a partially parsed rule, containing a prefix rule and a suffix rule. As in the case of symbol nodes, there can be many derivations for a rule fragment. Hence, an intermediate node can also contain multiple packed nodes. A packed node in turn can contain symbol, intermediate, or dummy nodes.

SPPF Node

SPPF Dummy Node

The dummy SPPF node is used to indicate the empty node at the end of rules.

SPPF Symbol Node

j and i are the extents. Each symbol can contain multiple packed nodes each representing a different derivation. See getNodeP Note. In the presence of ambiguous parsing, we choose a derivation at random. So, run the to_tree() multiple times to get all parse trees. If you want a better solution, see the forest generation in earley parser which can be adapted here too.

SPPF Intermediate Node

Has only two children max (or 1 child).

SPPF Packed Node

The GLL parser

We can now build our GLL parser. All procedures change to include SPPF nodes. We first define our initialization