Important:Pyodide takes time to initialize.
Initialization completion is indicated by a red border around Run all button.
For fuzzing, we often need to generate inputs that have a particular pattern,
and this pattern could be easier to specify using a regular expression than
using a context free grammar. For example, the URL can be described as:
Can we use such regular expressions as producers? As before, we start with
our prerequisites.
We import the following modules
System Imports
These are available from Pyodide, but you may wish to make sure that they are
installed if you are attempting to run the program directly on the machine.
sympy
Available Packages
These are packages that refer either to my previous posts or to pure python
packages that I have compiled, and is available in the below locations. As
before, install them if you need to run the program directly on the machine.
To install, simply download the wheel file (`pkg.whl`) and install using
`pip install pkg.whl`.
Since we want to convert a regular expression to a generator, it is necessary
to parse it first. The following describes the grammar of regular expressions.
A regular expression is basically a set of alternatives separated by |
<regex> ::= <cex>
| <cex> `|` <regex>
Each alternative is an expression that is a sequence of more basic expressions
<cex> ::= <exp>
| <exp> <cex>
The basic regular expression unit is a single character, standing for itself.
One may also have a bracket expression [...] which matches the list of
characters inside the brackets, or a single . which matches any character.
However, one can also have more complex units such as a parenthesized regex
(...), a basic expression followed by Kleene star * which stands for any
number of matches including none, and a basic expression followed by + which
stands for at least one match of the preceding basic expression.
Let us see if we can parse a small regular expression.
While a regular expression parse tree is technically sufficient to produce
a generator, there is a better solution. We know how to build very good
fuzzers with grammars. Hence, it is better to convert the regular expression
to a grammar first, and use one of our fuzzers.
Regular expression to context-free grammar
First the terminal symbols
Next, the class.
unitexp
We first define our basic unit. The exp converter, which delegates to other
converters
The most basic regular expression is the character itself. We convert
it to a nonterminal that defines the single character. That is,
a gets translated to
<X> ::= `a`
Using it
The next basic regular expression is the brackets, which matches any
character inside the brackets [...]. We convert
it to a nonterminal that defines the single character. The following
is the regex grammar.
Given the regex [abc], this regex is converted to the following grammar
<X> ::= `a` | `b` | `c`
Using it
Next, we define the <dot>. A dot matches any character. That is, terminal
symbol.
<dot> ::= a | b | ...
Using it
exp
Next, we define the <exp>
<exp> ::= <unitexp>
| <regexstar>
| <regexplus>
For <regexstar> the regular expression grammar is
<regexstar> ::= <unitexp> `*`
Given a regular expression a*, this gets translated to
<X> ::= a <X>
|
For <regexplus> the regular expression grammar is
<regexplus> ::= <unitexp> `+`
Given a regular expression a+, this gets translated to
<X> ::= a <X>
| a
Using it.
cex
One basic operation of regular expressions is concatenation. It matches
two patterns in sequence. We convert
concatenation to a rule containing two corresponding nonterminals.
The regular expression grammar is
<cex> ::= <exp>
| <exp> <cex>
Given a regular expression ab, this gets converted into
<X> := a
<Y> := b
<Z> := <X> <Y>
Using it
Next, we define our top level converter.
<regex> ::= <cex>
| <cex> `|` <regex>
Given a regular expression a|b, this gets converted into
<X> := a
<Y> := b
<Z> := <X>
| <Y>
Using it
Next, we define our
A parenthesis is simply a grouping construct that groups all
elements inside it as one.
<parenexp> ::= `(` <regex> `)`
Given a parenthesis expression (a), this gets translated to
<X> = a
Using it
At this point, we have our full grammar, and we can use it to generate inputs
as below