Important:Pyodide takes time to initialize.
Initialization completion is indicated by a red border around Run all button.
I had previously discussed how one can implement
a programming language using big step semantics. In this post, I want to so
something similar. Here, we implement a meta-circular interpreter over Python.
The code is available here.
A meta-circular interpreter is an interpreter for a language that is written
in that language itself. The MCI can implement a subset or superset of the host
language.
Uses of a Meta Circular Interpreter
Why should one want to write a meta-circular interpreter? Writing such an
interpreter gives you a large amount of control over how the code is executed.
Further, it also gives you a way to track the execution as it proceeds.
Using the same machinery as the meta-circular interpreter, one can:
Write a concolic interpreter that tracks the concrete execution
Extract coverage
Extract the control flow graph, and execution path through the graph
Extract the call graph of execution
Write a symbolic execution engine
Write a taint tracker that will not be confused by indirect control flow
Extend the host language with new features
Reduce the capabilities exposed by the host language (e.g. no system calls).
I will be showing how to do these things in the upcoming posts.
First, we import everything we need.
The basic idea is to make use of the Python infrastructure as much as possible. That is,
we do not want to implement things that are not related to the actual interpretation.
Hence, we use the Python parsing infrastructure exposed by the ast module that parses
Python source files, and returns back the AST. AST (Abstract Syntax Tree) as the name
indicates is a data structure in the format of a tree.
Once we have the AST, we simply walk the tree, and interpret the statements as we find them.
The meta-circular-interpreter base
The walk() method is at the heart of our interpreter. Given the AST,
It iterates through the statements, and evaluates each by invoking the corresponding method.
If the method is not implemented, it raises a SynErr which is derived from SyntaxError.
We provide eval() which converts a given string to its AST, and calls
walk(). It is possible to write a parser of our own, as I have shown before
which can get us the AST. However, as I mentioned earlier, we use the Python infrastructure where possible.
The semantics class
The methods that correspond to particular AST elements may be interpreted
differently based on what semantics we are interested in. So, we define
a new class for semantics..
Modules
We first define modules because every other statement is enclosed in a module
when we use parse.
Module(stmt* body)
The complete AST is wrapped in a Module statement.
Example
An empty module with a comment.
The expressions
An expression is implemented as follows
Expr(expr value)
The Pythonic data structures.
We need to define data. For the primitive data types, we only implement string and number for now.
These are trivial as it is a direct translation of the AST values.
Constant(constant value, string? kind)
Example
Number(object n)
Example
Containers
Essentially, we want to be able to make use of all pythonic
container data structures such as lists, tuples, sets and
dictionaries. For demonstration, however, we have implemented
only list and tuple.
Unlike the primitives, the containers may be defined such that
the values inside them are result of evaluating some expression.
Hence, we first walk() the elements, and add the results.
List(elts)
Example
Tuple(elts)
Example
Containers provide the ability to access their contained items via Subscript.
The tuple and list provide a means to access its elements via
subscript. The subscript requires a special Index value as input, which is also defined below.
Similar to subscript for arrays, objects provide attribute access.
Attributes require symbol tables. Hence, we do not provide an example here.
Simple control flow statements
The return, break and continue are implemented as exceptions.
Implementation
Example
The difference between break and continue is in how they are handled in the
loop statemens as in While below. The return is handled in the Call part.
Major control flow statements
Only basic loops and conditionals – while() and if() are implemented.
While(expr test, stmt* body, stmt* orelse)
Implementing the While loop is fairly straight forward. The while.body is
a list of statements that need to be interpreted if the while.test is True.
The break and continue statements provide a way to either stop the execution
or to restart it.
As can be seen, these are statements rather than expressions, which means that
their return value is not important. Hence, we do not return anything.
Example
If(expr test, stmt* body, stmt* orelse)
The If statement is similar to While. We check if.test and if True,
execute the if.body. If False, we execute the if.orelse.
Example
The scope and symbol table
Now we come to a slightly more complex part. We want to define a symbol table. The reason
this is complicated is that the symbol table interacts with the scope, which is a
nested data structrue, and we need to provide a way to look up symbols in enclosing
scopes. We have a choice to make here. Essentially, what variables do the calling
program have access to? Historically, the most common conventions are lexical and
dynamic scoping.
The most intuitive is the lexical scoping convention. Hence, we implement lexical scoping,
but with a restriction: If we modify a variable in parent scopes, then the new variable is
created in current scope.
Hooking up the symbol table
We allow the user to load a pre-defined symbol table. We
have a choice to make here. Should we allow access to the
Python default symbol table? and if we do, what should form
the root? The Python symbol table or what the user supplied?
Here, we assume that the default Python symbol table is the
root.
The following statements use symbol table.
Name(identifier id, expr_context ctx)
Retrieving a referenced symbol is simple enough.
Example
Assign(expr* targets, expr value)
Python allows multi-target assignments. The problem is that, the type of the value received may be different based on
whether the statement is multi-target or single-target. Hence, we split both kinds.
Example
Call(expr func, expr* args, keyword* keywords)
During function calls, we need to make sure that the functions that
are implemented in C are proxied directly.
For others, we want to correctly bind the arguments and create a
new scope. The big question is how should the scopes be nested.
We use lexical scopes. So, we recover the symbol table used at the
time of definition, use it for the call, and reset it back to the
current one after the call.
The function definition itself is quite simple. We simply update the symbol table with the given values.
Note that because we implement lexical scoping, we have to maintain the scoping references during creation.
Example
Import(alias* names)
Import is similar to a definition except that we want to update the symbol table
with predefined values.
Arithmetic Expressions
The arithmetic expressions are proxied directly to corresponding Python operators.
Expr(expr value)
Example
A Complete Example
The source code of this notebook is available here