Important:Pyodide takes time to initialize.
Initialization completion is indicated by a red border around Run all button.
In the previous post, I described how one can write
an interpreter for the Python language in Python. However, Python itself is not
implemented as a direct interpreter for the AST. Rather, Python AST is compiled
first, and turned to its own byte code. The byte code is interpreted by the
Python virtual machine. The Python virtual machine is implemented in C in the
case of CPython, in Java in the case of Jython, in WASM for Pyodide, and in
(reduced) Python in the case of PyPy.
The reason to use a virtual machine rather than directly interpreting the AST is
that, a large number of the higher level constructs map to a much smaller number
of lower level constructs. The lower level language (the bytecode) is also easier
to optimize, and is relatively more stable than the higher level language.
For our purposes, the lower level language also allows us to get away with
implementing our analysis techniques (such as taint analysis — to be discussed
in later posts) on a much smaller number of primitives.
This post shows how to implement a very tiny Python virtual machine.
For more complete implementations, refer to the
AOSA Book
or more complete and current Byterun or
my fork of byterun.
We start as usual by importing the prerequisite packages. In the case of our
virtual machine, we have only the dis module to import. This module allows us
to disassemble Python bytecode from the compiled bytecode. Note that the package
xdis may be a better module here (it is a drop
in replacement).
As in the MCI, we try to use as much of the
Python infrastructure as possible. Hence, we use the Python compiler to compile
source files into bytecode, and we only interpret the bytecode. One can
compile source
strings to byte code as follows.
Compilation
The mode can be one of eval – which evaluates expressions, exec –
which executes a sequence of statements, and single – which is a limited form
of exec. Their difference can be seen below.
compile(mode=eval)
That is, the return value is the result of addition.
compile(mode=exec)
The top of the stack is popped off on execution. That is,
it is treated as a statement. This mode is used for evaluating a series of
statements none of which will return a value when eval() is called.
using exec
compile(mode=single)
The single mode is a restricted version of exec. It is applicable for a
single line statement, which can even be constructed by stitching multiple
statements together with semicolons.
The main difference is in the return value. In the case of exec, the stack is
cleared before return, which means only the side effects remain.
In the case of single, the last value in the stack is printed before return.
As before, nothing is returned.
Arithmetic operations
Next, we define all the simple arithmetic and boolean operators as below.
Boolean operations
Similar to arithmetic operations, we define all logical operators.
The Code
The compiled bytecode is of type code.
It contains the following attributes
Since it is a read-only data structure, and we want to modify it, we will use a
proxy class Code to wrap it. We copy over the minimum attributes needed.
The virtual machine
As can be inferred from the dis output, the Python Virtual Machine is a stack
based machine. So we define a bare bones virtual machine that can use a stack
as below.
The return instruction is simply a pop of the stack.
Now, we define how a function is called. We need to extract the
function and args, and pass the execution to the called function.