tags: tinyself, selflang, diy_programming_language, debugging, rpython, python, pypy
description: Description of how I gained performance in my programming language tinySelf. Rpython profiling and optimizations.


tinySelf is my pet programming language inspired by Self, which I am writing in my spare time. It was slow, so I've decided to speed it up a little. Last time (‣) was a massive success in term of speed gains, so I've had high hopes.

First of all, I've decided to run another profiling with callgrind. Here is the result from the profiler:


Unsurprisingly, it can be seen from the first glance, that 73.45% of the time is spent in the _do_send() method:


42% takes slot lookup, which is not much surprising. 34% of that takes parent lookup. Still, no surprise there.

A lot takes _create_intermediate_params_obj(). That kind of surprised me, because it was already optimized once. I will have to devise some method of caching of the parameters, maybe in the code context or something.

<aside> ⚠️ I am using several computers for development, from my notebook to my desktop, so don't look at the time in perf outputs, but at the number of crunched instructions.


Intermediate parent lookup cache

This was the first optimization which I've begun to implement. It has to do with mechanism how are parameters mapped to namespace of method calls.

Each time the _do_send() method is invoked and the message send results in code object, _push_code_obj_for_interpretation() method is called, which in turn calls Tail Call Optimization check and method _create_intermediate_params_obj(), used to map all parameters into one parent, which is then mounted as scope_parent for code object pushed at the top of the stack.

There is already some caching going on, to not to create empty parents with parameters, as it massively slows the interpreter when tail call optimization is working, but in case that this is not a recursive call, it still creates new objects for parameters. This is usually not an issue, but in while loops, new blocks are initialized and new objects for parameters are created in each loop.

Obvious solution is to cache the parameter object map for each block / code object call and create only clone.

So I've implemented it, but measuring the speed by time did not result into conclusive results. It looked like nothing improved.

I've realized, that I need some better tool than just measuring with time, because there is really a lot of stuff running in the background. I would like to see some more "scientific" results, like for example number of instructions that was executed.

Quick search on the google showed me that I can use perf stat.

Results for code without the new optimization:

$ sudo perf stat ./tSelf tests/scripts/simple_while_benchmark.self 

 Performance counter stats for './tSelf tests/scripts/simple_while_benchmark.self':

       4639,594198      task-clock (msec)         #    0,987 CPUs utilized          
                55      context-switches          #    0,012 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             9 863      page-faults               #    0,002 M/sec                  
    12 882 386 678      cycles                    #    2,777 GHz                    
    28 716 919 724      instructions              #    2,23  insn per cycle         
     5 525 961 270      branches                  # 1191,044 M/sec                  
        17 527 128      branch-misses             #    0,32% of all branches        

       4,700498870 seconds time elapsed