#lang:metadata tags: tinyself, selflang, diy_programming_language, debugging, rpython, python, pypy description: Description of how I gained performance in my programming language tinySelf. Rpython profiling and optimizations.
tinySelf is my pet programming language inspired by Self, which I am writing in my spare time. It was slow, so I've decided to speed it up a little. Last time (‣) was a massive success in term of speed gains, so I've had high hopes.
First of all, I've decided to run another profiling with callgrind. Here is the result from the profiler:
Unsurprisingly, it can be seen from the first glance, that 73.45% of the time is spent in the
42% takes slot lookup, which is not much surprising. 34% of that takes parent lookup. Still, no surprise there.
A lot takes
_create_intermediate_params_obj(). That kind of surprised me, because it was already optimized once. I will have to devise some method of caching of the parameters, maybe in the code context or something.
⚠️ I am using several computers for development, from my notebook to my desktop, so don't look at the time in
perf outputs, but at the number of crunched instructions.
This was the first optimization which I've begun to implement. It has to do with mechanism how are parameters mapped to namespace of method calls.
Each time the
_do_send() method is invoked and the message send results in code object,
_push_code_obj_for_interpretation() method is called, which in turn calls Tail Call Optimization check and method
_create_intermediate_params_obj(), used to map all parameters into one parent, which is then mounted as
scope_parent for code object pushed at the top of the stack.
There is already some caching going on, to not to create empty parents with parameters, as it massively slows the interpreter when tail call optimization is working, but in case that this is not a recursive call, it still creates new objects for parameters. This is usually not an issue, but in while loops, new blocks are initialized and new objects for parameters are created in each loop.
Obvious solution is to cache the parameter object map for each block / code object call and create only clone.
So I've implemented it, but measuring the speed by time did not result into conclusive results. It looked like nothing improved.
I've realized, that I need some better tool than just measuring with
time, because there is really a lot of stuff running in the background. I would like to see some more "scientific" results, like for example number of instructions that was executed.
Quick search on the google showed me that I can use
Results for code without the new optimization:
$ sudo perf stat ./tSelf tests/scripts/simple_while_benchmark.self 4.676559 Performance counter stats for './tSelf tests/scripts/simple_while_benchmark.self': 4639,594198 task-clock (msec) # 0,987 CPUs utilized 55 context-switches # 0,012 K/sec 0 cpu-migrations # 0,000 K/sec 9 863 page-faults # 0,002 M/sec 12 882 386 678 cycles # 2,777 GHz 28 716 919 724 instructions # 2,23 insn per cycle 5 525 961 270 branches # 1191,044 M/sec 17 527 128 branch-misses # 0,32% of all branches 4,700498870 seconds time elapsed