draft for new post about register allocation

2020-03-22 22:50:24 +02:00
parent 760690718a
commit ab9365f423
3 changed files with 160 additions and 0 deletions
--- a/14
+++ b/14
@@ -4,3 +4,17 @@
 require_relative 'config/application'

 Rails.application.load_tasks
+
+task :create_post  do
+  args = ARGV.dup
+  args.shift
+  args.each { |a| task a.to_sym do ; end }
+
+  today = Time.now
+  args.unshift today.day.to_s.rjust(2 , "0")
+  args.unshift today.month.to_s.rjust(2 , "0")
+
+  file = "app/views/posts/#{today.year}/_" + args.join("-") + ".haml"
+  f = File.open(file , "w")
+  f << "%p start"
+end
--- a/app/assets/images/register_alloc.png
+++ b/app/assets/images/register_alloc.png
--- a/app/views/posts/2020/_03-22-register-allocation-in-a-ruby-compiler.haml
+++ b/app/views/posts/2020/_03-22-register-allocation-in-a-ruby-compiler.haml
@@ -0,0 +1,146 @@
+%p
+  I had put off proper register allocation until now, and somehow the 100 commits
+  it took verfies that i wasn't completely wrong. I think there are quite a
+  few differences to the normal problem and solution, to warrant an explanation,
+  so here it goes: How RubyX does Static Single Assignent and Register Allocation
+  in a ruby compiler.
+
+%h2 Simple register allocation, the old way
+
+%p
+  To understand the problem, i'll explain the previous, much simpler, approach and it's
+  problems. Then i'll explain the new way, with reference to what i read the "normal"
+  way is and the differences i think we have.
+
+%p
+  Simple register allocation is quite simply hardcoding at least some register names
+  into the  generation of assembler-like code, and having very simple ways to deal with
+  the rest. In RubyX the assembler-like layer of the code
+  is called Risc, and that is basically a simplified ARM. ARM has registers from r0
+  through to r15.
+%p
+  Early on i made two decisions, the current Message object would live in r0. This was
+  one constant hardcoded throughout. The other descision was to design Slot level
+  instructions such, that each instruction had use of all registers. To use registers
+  i implemented a simple stack, but that was reset after every Slot level instruction.
+
+%h2 Motiviation for change
+%p
+  There are two major problems with the simple solution outlined above. The first is
+  sub-optimal now, the second restrictive in the future.
+%p
+  The current problem, is actually many-fold. There is the obvious difficulty of keeping
+  track of registers as they are used and returned. While i thought that that the by hand
+  approach was not too bad, after finishing i checked the automatic way uses only half the
+  aamount of registers. This is nevertheless peanuts compared to the second problem,
+  which means that the code is forever locked into a subset of the SlotMachine, ie
+  no optimisations can be done across SlotMachine Instruction borders. That is quite
+  serious, and sort of biids in with the third problem. Namely there were some implicit
+  register assumptions being made, but there were implicit, ie could not be checked
+  and would only show up if broken.
+%p
+  But the real motiviation for this work came from the future, or the lack of
+  expandability with this approach. Basically i found from benchmarking that inlining
+  would have to happen quite soon. Even it would *only* be with more Macros at first.
+  Inlining with this super simple allocation would not only be super hard, but also
+  much less efficient than could be. After all there are 10+ registers than one can
+  keep things in, thus avoiding reloading, and i already noticed constant loading
+  was a thing.
+
+%h2 SSA and Register Allocation
+.container
+  %p.full_width
+    =image_tag "register_alloc.png"
+%p
+  So then we come to the way that it is coded now. The basic idea (followed by most
+  compilers) is to assign new names for every new usage of a register. I'll talk
+  about that first, then the second step is something called liveliness analysis,
+  basically determining when register are not used anymore, and the third is the
+  allocation.
+
+%h3 Static Single Assignment
+%p
+  A
+  =ext_link "Static Single Assignent form" , "https://en.wikipedia.org/wiki/Static_single_assignment_form"
+  of the Instructions is one where every register
+  name is assigned only once. Hence the Single Assignent. The static part means that
+  this single sssignment is only true for a static analysis, so at run time the code
+  may assign many times. But it would be the **same** code doing several assignemnts.
+%p
+  SSA is often achieved by first naming registers according to the variables they hold,
+  and to derive subsequent names by increasing an subsript index. This did not sound
+  very fitting. For one, RubyX does not really have "floating" variables (that later
+  may get popped on some stack), rather every variable is an instance variable.
+  Assigning to a variable does not create a new register value, but needs to be stored
+  in memory. In a concurrent environment it is not save to bypass that.
+%p
+  New variables may be created by "traversing" into a instane of the object (if type is
+  known off course). This lead to a dot syntax naming convention, where almost every
+  variable starts off from *message* or as a constant, eg "message.return_value". This
+  is as "single" as we need it to be (i think), and implementing this naming scheme
+  was about half the work.
+%p
+  The great benefit from this renaming of registers is that even the risc code is still
+  quite readable, which is great for both debugging and tests. This is because
+  the registers now have meaningful name (instance variable names), and it is always
+  clear where it came from.
+
+%h3 Liveliness
+%p
+  the next big step was to determine liveliness of registers. This is something i have not
+  found good literature on, and in documents about Register Allocation it is often
+  taken as the starting point.
+%p
+  Basic reasoning lead me to believe that a simple backward scan is at least a safe
+  estimate. If you imagine going trough the list of instructions and marking the
+  first occurence of ay register (name) use. By going backwards through the list
+  you thus get the last useage, and that is the point where we can recycle that
+  register.
+%p
+  I spent some time trying to figure ot if backward branches changes the fact that
+  you can release the register, but could not come to a conclusion (brain melt
+  every time i tried). Intuitively i think that you can, because on the first
+  run through such a loop you could not use results from a register that because of
+  the ssa would have had to be created later, but there you go. Even rereading that
+  hurts. My final argument was that a backward jump is a while loop, and a ruby
+  while loop would have to store it's data in ruby variables and not new registers
+  or so i hope).
+%p
+  I did read about phi nodes in ssa's and i did not implement that. Phi nodes are a way to
+  ensure that different brnaches of an if produce the same registers, or the same
+  registers are meaningfully filled after the merge of an if. My hope is
+  that the ruby variable argument gets us out of that, and for some risc function
+  i added some transfers to ensure things work as they should.
+
+%h3 Register Allocation
+%p
+  The actual
+  =ext_link "Register Allocation", "https://en.wikipedia.org/wiki/Register_allocation"
+  is not substantially more complicated now, than before. But there is a good
+  base now to make more analysis and optimisations.
+%p
+
+  So basically we go through the instruction sequence and assign registers in
+  order. But because of the liveliness analysis, we can release registers after their
+  last use, and reuse them immediately off course. I noticed that this results in
+  surprisingly many registers being used only for a single instruction.
+  And the total number of registers used went **down by half**.
+
+%h2 The future
+%p
+  As i said that this was mostly for the future, what is the future going to hold?
+%p
+  Well, **inlining** is number one, that's for sure.
+%p
+  But also there is something called escape analysis. This basically means reclaming
+  objects that get created in a method, but never passed out. A sort of immediate GC,
+  thus not only saving on GC time, but also on allocation. Mostly though it would
+  allow to run larger benchmarks, because there is no GC and integers get created
+  a lot before a meaningfull number of milliseconds has elapsed.
+%p
+  On the register front, some low hanging fruits are redundant transfer elimination and
+  double load elimination. Since methods have not grown to exhaust registers,
+  unloading register  is undone and thus is looming. Which brings with it code cost
+  analysis.
+%p
+  So much more fun to be had!!