The 3 different code emitters, part 2
Hello again backers.
This is another technical update, with the aim to give you more insight into how the 3 different code emitters -- byte code, native and viper -- work.
Let's consider the Python function:
It just loops 8 million times, each time adding 2 to the variable x. For the 3 different emitters we have the following code size and speed on the board:
- byte code: 41 bytes, 26.3 seconds
- native code: 76 bytes, 12.3 seconds
- viper code: 60 bytes, 1.1 seconds
This function does not call any other functions and so the viper code can do better with optimisation. You see that native code runs more than twice the speed of byte code, and viper is more than 10 times faster than native code, and almost 24 times faster than byte code!
To understand what is happening here, let's look at the actual code that is generated by the 3 emitters. First the byte code:
The left column is the byte count, then the name of the byte code operation, then optionally its argument. The native code looks like this:
The columns from left to right are: code offset in bytes, encoded machine instruction, human-readable machine instruction and argument. In blue I have indicated what byte codes this machine code is executing. The main thing to notice is that it calls the underlying C runtime for binary operations and comparison (binary op 18 is add, compare op 0 is less-than). blx is call in Thumb language, and I use a lookup table (the ldr) to get the address of the appropriate C runtime function (this turned out to be the most efficient and compact way to do it). It uses registers r4 and r5 for the local variables x and i, and 8 words of stack space. Note that integers are stored in the upper 31 bits of a register, with the lower bit set to 1; hence 0 is encoded as 1 and 2 is encoded as 5, etc.
Looking at this code, you can understand why native code is faster than byte code, but still not as fast as proper C code: we are essentially using the CPU to dispatch the byte codes, but still need to call the runtime functions to do the hard work, like add numbers.
Now for viper:
All the blx's are gone! The viper emitter knows that the local variables x and i are integers and so uses Thumb machine instructions to add and compare. It also stores integers using the full 32 bits of the register. It is over 10 times faster than the native code because it does not call any C runtime functions. The keen reader will notice that there are still some optimisations that can be done (eliminating mov's, and combining the compare with the jump), but I haven't gotten around to that.
All of this compilation is done on the microcontroller itself.
You may want to know how the above relates to JIT. Just In Time compilation (JIT) is a very sophisticated way of doing the above automatically (ie, without the user needing to specify decorators) and while the code is running. JIT will analyse a function as it runs to determine which variables are, for example, integers, and then replace parts of the code which originally called the runtime with their equivalent machine instructions (among other things; take a look at PyPy if you're interested). This approach produces amazingly fast code, but at the cost of complexity. JIT code can also get quite large, and the execution time of the same function can change (due to on-the-fly optimisation). These things -- complexity of the compiler, large code size, non-deterministic function run times -- are not desirable on a microcontroller. This is why I chose to implement Micro Python with user-selectable emitters. The compiler and emitters are simple and compact (the same emitter is used for native and viper, but native ignores the type information) and the resulting compiled code is small and always runs at the same speed.
If you understood the above discussion, great! If not, don't worry, you don't need to know any of these nitty-gritty details to use Micro Python.