unification strikes

2017-01-10 12:44:06 +02:00
parent 1fc8169a59
commit 06220e3735
3 changed files with 135 additions and 35 deletions
--- a/_posts/2017-01-10-integer-unification.md
+++ b/_posts/2017-01-10-integer-unification.md
@ -0,0 +1,122 @@
 ---
 layout: news
 author: Torsten
 ---
 I just read mri 2.4 "unifies" Fixnum and Integer. This, it turns out, is something quite
 different from what i though, mostly about which class names are returned.
 And that it is ok to have two implementations for the same class, Integer.
 But even it wasn't what i thought, it did spark an idea, and i hope a solution to a problem
 that i have seen lurking ahead. Strangely the solution maybe even more radical than the
 cross function jumps it replaces.
 ## A problem lurking ahead
 As i have been thinking more about what happens when a type changes, i noticed something:
 An object may change it's type in one method (A), but may be used in a method (B), far up the call
 stack. How does B know to treat the object different. Specifically, the calls B makes
 on the object are determined by the type before the change. So they will be wrong after the change,
 and so B needs to know about the type change.
 Such a type change was supposed to be handled by a cross method jump, thus fixing the problem
 in A. But the propagation to B is cumbersome, there can be just so many of them.
 Anything that i though of is quite a bit too involved. And this is before even thinking about closures.
 ## A step back
 Looking at this from a little higher vantage there are maybe one too many things i have been trying
 to avoid.
 The first one was the bit-tagging. The ruby (and smalltalk) way of tagging an integer
 with a marker bit. Thus loosing a bit and gaining a gazillion type checks. In mri c land
 an object is a VALUE, and a VALUE is either a tagged integer or a pointer to an object struct.
 So on **every** operation the bit has to be checked. Both of these i've been trying to avoid.
 So that lead to a system with no explicit information in the lowest level representation and
 thus a large dance to have that information in an external type system and keeping that type
 information up to date.
 Off course the elephant in the room here is that i have also be trying to avoid making integers and
 floats objects. Ie keeping their c, or machine representation, just like anyone else before me.
 Too wasteful to even think otherwise.
 ## And a step forward
 The inspiration that came by reading about the unification of integers was exactly that:
 **to unify integers** . Unifying with objects, ie **making integers objects**
 I have been struggling with the dichotomy between integer and objects for a long time. There always
 seemed something so fundamentally wrong there. Ok, maybe if the actual hardware would do the tagging
 and that continuous checking, then maybe. But otherwise: one is a direct, the other an indirect
 value. It just seemed wrong.
 Making Integers (and floats etc) first class citizens, objects with a type, resolves the chasm
 very nicely. Off course it does so at a price, but i think it will be worth it.
 ## The price of Unification
 Initially i wanted to make all objects the size of a cache line or multiples thereof. This is
 something i'll have to let go of: Integer objects should naturally be 2 words, namely the type
 and the actual value.
 So this is doubling the amount of ram used to represent integers. But maybe worse, it makes them
 subject to garbage collection. Both can probably be alleviated by having the first 256 pinned, ie
 a fixed array, but still.
 Also using a dedicated memory manager for them and keeping a pool of unused as a linked list
 should make it quick. And off course the main hope lies in the fact that your average program
 nowadays (especially oo) does not really use integers all that much.
 ## OO to the rescue
 Off course this is not the first time my thought have strayed that way. There are two reasons why
 they quickly scuttled back home to known territory before. The first was the automatic optimization
 reflex: why use 2 words for something that can be done in one, and all that gc on top.
 But the second was probably even more important: If we then have the value inside the object
 (as a sort of instance variable or array element), then when return it then we have the "naked"
 integer wreaking havoc in our system, as the code expects objects everywhere.
 And if we don't return it, then how do operations happen, since machines only operate on values.
 The thing that i had not considered is that that line of thinking is mixing up the levels
 of abstraction. It assumes a lower level than one needs: What is needed is that the system
 knows about integer objects (in a similar way that the other ways assumes knowledge of integer
 values.)
 Concretely the "machine", or compiler, needs to be able to perform the basic Integer operations,
 on the Integer objects. This is really not so different from it knowing how to perform the
 operations on two values. It just involves getting the actual values from the object and
 putting them back.
 OO helps in another way that never occurred to me. **Data hiding:** we never actually pass out
 the value. The value is private to the object and not accessible from the outside. In fact it not
 even accessible from the inside to the object itself. Admittedly this means more functionality in
 the compiler, but since that is a solved problem (see builtin), it's ok.
 ## Unified method caching
 So having gained this unification, we can now determine the type of an object very very easily.
 The type will *always* be the first word of the memory that the object occupies. We don't have
 immediate values anymore, so always is always.
 This is *very* handy, since we have given up being god and thus knowing everything at any time.
 In concrete terms this means that in a method, we can *not* know what type an object is.
 In fact it's worse, we can't even say what type it is, even if we have checked it, but after we
 have passed it as an argument to another method.
 Luckily programs are not random, and it quite rare for an object to change type, and so a given
 object will usually have one of a very small set of types. This can be used to do method caching.
 Instead of looking up the method statically and calling it unconditionally at run-time, we will
 need some kind of lookup at run-time.
 The lookup tables can be objects that the method carries. A small table (3 entries) with pairs of
 type vs jump address. A little assembler to go through the list and jump, or in case of a miss
 jump to some handler that does a real lookup in the type.
 In a distant future a smaller version may be created. For the case where the type has been
 checked already during the method, a further check may be inlined completely into the code and
 only revert to the table in case of a miss. But that's down the road a bit.
 Next question: How does this work with Parfait. Or the interpreter??
--- a/index.html
+++ b/index.html
@ -7,7 +7,7 @@ layout: site
 		<div>
 	    <p class="center">
 				<span>
-					Interpreting code is like checking a map at every step: It can really slow you down.
+					Putting wings on ruby to let you fly (may take X years).
 	   		</span>
 			</p>
 		</div>
@ -18,12 +18,12 @@ layout: site
  <div class="span4">
    <h2 class="center">Goal</h2>
 		<p>
-			The goal is to execute (not interpret) object oriented code without external dependencies, on modern hardware.
+			The goal is to execute (not interpret) object oriented code without external dependencies,
 			on modern hardware.
    </p>
 		<p>
-			This means compiling dynamic code into binary. Using several intermediate representations it
+			This means compiling dynamic code into binary. Using type knowledge at run-time we
-			is possible to keep track of type changes and switch between differently typed, but
+			optimise and cache method dispatch for know types.
 			logically equivalent, versions of methods.
 			As the system is 100% in ruby, the ultimate goal is to carry on the compilation at run-time,
 			ie after the program has started.
@ -44,7 +44,7 @@ layout: site
 			<a href="https://github.com/whitequark/parser"> ruby parser</a> to create:
 			<ul>
 				<li> An Object model of  <a href="/typed/parfait.html">classes, types</a>, methods and basic types </li>
-				<li> Several strongly typed method versions for every ruby instance method </li>
+				<li> Methods for every type (may be several per class) </li>
 			</ul>
 		</p>
 		<p>
@ -52,8 +52,8 @@ layout: site
 			While it has well known typed language data semantics, it introduces several new concept:
 			<ul>
 				<li> Object based memory (no global memory) </li>
 				<li> Multiple implementations per function based on type  </li>
 				<li> Object oriented calling semantics (not stack based) </li>
 				<li> Inline method caching.  </li>
 				<li> <a href="https://github.com/ruby-x/ruby/tree/master/lib/register" target="_blank">Register machine abstraction</a></li>
 				<li> Extensible instruction set, with arm implementations
 			</ul>
--- a/rubyx/layers.md
+++ b/rubyx/layers.md
@ -25,7 +25,7 @@ Top down the layers are:
 - **Melon** , compiling ruby code into typed layer and includes bootstrapping code
 - **Typed intermediate layer:** Statically typed object oriented with object oriented
-call semantics.
+  call semantics.
 - **Risc register machine abstraction** provides a level of machine abstraction, but
              as the name says, quite a simple one.
@ -40,21 +40,17 @@ a difficult task, it has already been implemented in pure ruby
 [here](https://github.com/whitequark/parser). The output of the parser is again
 an ast, which needs to be compiled to the typed layer.
-The dynamic aspects of ruby are actually reltively easy to handle, once the whole system is
+The dynamic aspects of ruby are actually relatively easy to handle, once the whole system is
 in place, because the whole system is written in ruby without external dependencies.
 Since (when finished) it can compile ruby, it can do so to produce a binary. This binary can
 then contain the whole of the system, and so the resulting binary will be able to produce
 binary code when it runs. With small changes to the linking process (easy in ruby!) it can
 then extend itself.
-The type aspect is more tricky: Ruby is not typed and but the typed layer is after all. And
+The type aspect is more tricky: Ruby is not typed but the typed layer is after all.
-if everything were objects (as we like to pretend in ruby) we could just do a lot of
+But since everything is object (yes, also integers and floats are first class citizens)
-dynamic checking, possibly later introduce some caching. But everything is not an object,
+we know the type on any object at any time and can check it easily.
-minimally integers are not, but maybe also floats and other values.
+Easy checks also make inline method jump tables relatively easy.
 The distinction between what is an integer and what an object has sprouted an elaborate
 type system, which is (by necessity) present in the typed layer.
 ### Typed intermediate layer
@ -68,26 +64,8 @@ In broad strokes it consists off:
                  create a binary with the required information to be dynamic
 - **Builtin:**  A very small set of primitives that are impossible to express in ruby
 The idea is to have different methods for different types, but implementing the same ruby
 logic. In contrast to the usual 1-1 relationship between a ruby method and it's binary
 definition, there is a 1-n.
 The typed layer defines the Type class and BasicTypes and also lets us return to different
 places from a function. By using this, we can
 compile a single ruby method into several typed functions. Each such function is typed, ie all
 arguments and variables are of known type. According to these types we can call functions according
 to their signatures. Also we can autognerate error methods for unhandled types, and predict
 that only a fraction of the possible combinations will actually be needed.
 Just to summarize a few of typed layer features that are maybe unusual:
 - **Message based calling:** Calling is completely object oriented (not stack based)
                              and uses Message and Frame objects.
 - **Return addresses:**  A method call may return to several addresses, according
                          to type, and in case of exception
 - **Cross method jumps** When a type switch is detected, a method may jump into the middle
                            of another method.
 ### Register Machine