185 lines
9.5 KiB
Plaintext
185 lines
9.5 KiB
Plaintext
%p
|
|
Off course the
|
|
=link_to "architecture" , "/rubyx/layers.html"
|
|
gives a good overview of the system as it is. But it does not explain how we got
|
|
there. And sometimes knowing the journey makes it easier to understand where
|
|
one is. So i shall try to highlight the four or five main
|
|
|
|
%h2 Macbook + Ruby == Rasperry Pi
|
|
|
|
%p.full_width
|
|
=image_tag "mac_plus.png"
|
|
When i bought my first 30Euro Pi i noticed that ruby is unusable on it.
|
|
Looking at how slow ruby actually is, it occurred to me that ruby just about turns
|
|
the Pi into my first 286 laptop (running at 6MHz), which is the same as turning my
|
|
MacBook Pro into a Pi.
|
|
%p
|
|
Off course, while working on web-apps, which can be parallelized so easily, and with
|
|
a company paying both developer and hardware, the std ruby argument holds.
|
|
But since i wanted to use my pi for demanding projects something had to be done.
|
|
|
|
%h2 Judy, the importance of cpu cache
|
|
|
|
%p
|
|
=ext_link "Judy" , "http://judy.sourceforge.net/"
|
|
is a really really fast digital tree, kind of hash. I actually built a memory
|
|
database with it that was also really really fast. When connecting it to rails i
|
|
ran into the above problem, the niceties of ActiveRecord (ruby) brought performance
|
|
of my extension (c) down by a factor of 40.
|
|
%p
|
|
But anyway, the point is that Judy's speed is based on a radical optimisation for
|
|
cache lines (and key compression). This means all data structures are exactly a cpu
|
|
cache line big. As i learned, cpu's do not access memory in word sizes, but instead, always
|
|
a cache line at a time. This basically lead to ruby-x's memory model, which is
|
|
fixed sized objects, multiples of a cache-line.
|
|
|
|
%h3 Microkernel
|
|
%p
|
|
As a young engineer, i thought, as my peers, that Linux (then 0.93) was the greatest
|
|
thing. Only much later did i learn that it is just a copy really, and the reason
|
|
it got popular was not technical, but licensing (Same reason it is in Android i
|
|
believe). The reason it stayed popular is inertia, in other words writing device
|
|
drivers is hard.
|
|
%p
|
|
=ext_link "Synthesis," , "https://en.wikipedia.org/wiki/Self-modifying_code#Massalin's_Synthesis_kernel"
|
|
=ext_link "L4,", "https://en.wikipedia.org/wiki/L4_microkernel_family"
|
|
and
|
|
=ext_link "Minix," , "http://www.minix3.org/"
|
|
are good proof that the superior architecture is the Microkernel. Eg L4 can run
|
|
another OS as an application with about 4% performance degradation. Or Minix can
|
|
recover from a device driver failure.
|
|
%p
|
|
This, plus the fact that we have bundler, brought me to the approach that:
|
|
If you can leave it out, do. Much of the functionality that is in ruby (mri),
|
|
will never be in RubyX, but rather supplied by gems.
|
|
|
|
%h3 System interrupts
|
|
%p
|
|
In the beginning i was off course contemplating how much of c based systems i would use.
|
|
Like LLVM, which is off course a great tool, though made for c-ish applications.
|
|
Or libc, which again is really for c apps to access the kernel.
|
|
%p
|
|
The sheer size of the functionality one inherits almost swayed me. Even i had long
|
|
since determined that one of ruby's biggest flaws, it's std-lib, came from modelling
|
|
and using libc.
|
|
%p
|
|
Then i learned assembler and looked at libc implementations and learned what i
|
|
believe made the decision: Kernel calls are not really calls at all. They are
|
|
software interrupts, which basically means you fill some registers, flick the
|
|
switch, and the next instruction you can collect the result in a specified
|
|
register. This may look like a call, and off course, by using libc it is presented
|
|
as a call, but it is not. It is a very simple set of assembler instructions.
|
|
%p
|
|
For me this meant there is very very little benefit in using c, either in it's
|
|
libc form, or assembler/linker (i had found a ruby gem to do that easily),
|
|
or, maybe most importantly, the c calling convention. All of these things
|
|
are great for c programs, but they are just not made for dynamic languages
|
|
and that would have brought a whole sloth of problems.
|
|
|
|
%h3 Return address is a parameter
|
|
%p
|
|
In C calling (probably other languages too), the return address is determined
|
|
in the callee, usually by pushing the pc to the stack. But Arm has a different
|
|
way, an instruction called Branch With Link, that actually stores the pc in a
|
|
separate register called Link.
|
|
%p
|
|
And this made me realise, that really, the return address is always a parameter
|
|
to a function. Like other parameters it uses a register. It is the C way to
|
|
hide this implicit parameter, much in the same way it is the oo way of hiding
|
|
the self parameter.
|
|
%p
|
|
By this time i was already coding some rudimentary calling convention and it
|
|
did not take long to verify this in code. It is in fact quite easy to determine
|
|
the return address at compile time and pass it explicitly. (Easy if one does not
|
|
use a c linker that is)
|
|
|
|
%h3 OO calling convention
|
|
%p
|
|
Another thing that deterred me from C is the way they use the stack. It is so
|
|
completely not oo and cryptic. It is in other words very difficult to unwind,
|
|
and almost impossible to implement closures.
|
|
%p
|
|
Since the assembly had progressed easily, i made performance tests with an oo
|
|
calling convention, and determined that the price would be
|
|
=link_to "about 50%." , "/misc/soml_benchmarks.html"
|
|
Since currently the gap is more than an order of magnitude, this seemed ok,
|
|
given that it would make the compilation process so much easier.
|
|
%p
|
|
The resulting calling convention uses normal Message objects that form a linked
|
|
list, rather than a stack. Since they are completely standard objects, manipulation
|
|
both at run and compile time is totally integrated.
|
|
%p
|
|
Function calling has been working for years, but recently i cracked dynamic method
|
|
dispatch too, which was not that hard really. Currently the work is progressing to
|
|
blocks, and the clear structure does help a lot.
|
|
And while exceptions (or bindings) are not started, i think they will come with
|
|
relative ease (compared to the c way), since the structures are very simple.
|
|
|
|
%h2 Decisions that affect the future
|
|
|
|
%h3 Metasm
|
|
%p
|
|
I gave
|
|
=ext_link "Metasm" , "https://github.com/jjyg/metasm/"
|
|
several long looks. After all it has assembler and disassembler for at least
|
|
10 cpu's, and support for several binary formats, including elf. The
|
|
reason not to use it was not that it is big (including much we don't need).
|
|
But rather that it is unmaintained and unresponsive.
|
|
%p
|
|
It would be great to split all that code into several gems, a core and one
|
|
per cpu / binary format / assembly, disassembly. Only the core would need
|
|
to be integrated into rubyx, and one could just use the platform specific
|
|
gems. But I am not the one to do this work, was the decision.
|
|
|
|
%h3 Lock free Concurrency
|
|
%p
|
|
Concurrency will have to be part of the core, even if it is just to get a gc
|
|
working. The work that
|
|
=ext_link "Massalin did" , "http://valerieaurora.org/synthesis/SynthesisOS/abs.html"
|
|
already showed how effective lock free
|
|
concurrency is, but Dr Cliff took it into the modern (java) world by
|
|
publishing a
|
|
=ext_link "lock free hash" , "https://www.youtube.com/watch?v=HJ-719EGIts"
|
|
that he later run on some crazy machine with 800 cpus.
|
|
%p
|
|
I am not sure whether it will be better to port the java code, or try a
|
|
=ext_link "diy" , "https://preshing.com/20130605/the-worlds-simplest-lock-free-hash-table/"
|
|
version. And off course to even get started on this rubyx will need the
|
|
compare and swap primitives that underly the lock free approach.
|
|
But all in due time.
|
|
%p
|
|
The actual concurrency i am envisioning as two os-threads per core. One for kernel
|
|
interaction and one for normal operation. Kernel calls
|
|
would never be executed on the second, but always queued on dedicated kernel
|
|
threads. The non kernel threads would be used to run fibers.
|
|
If we insert some little check into the calling, switching could happen very often
|
|
and because of the linked list approach would be very very fast. And because of
|
|
the offloading of kennel calls would never stall (completely). This way one can
|
|
achieve the sort of millions of fibers erlang is known for.
|
|
|
|
%h3 House keeping and garbage collection
|
|
%p
|
|
Often, in systems that are designed to be collected, the base object has some
|
|
field to support this. This was deliberately left out. RubyX only has objects,
|
|
so the field would have to be an Object, which is too much overhead.
|
|
Or there would have to be dedicated instruction to deal with a raw data word
|
|
which is too much overhead in another way.
|
|
%p
|
|
Gc will be a completely external gem, so experimenting will be easy and
|
|
encouraged. Gc implementers will just have to use their own structures to keep
|
|
track of the state that they need. Judy style digital trees can do this by actually
|
|
using less memory than a field would use, but handcrafted bitfields will also be good.
|
|
%p
|
|
The actual marking phase should be relatively easy, as the world is known completely.
|
|
There are no grey stack areas where one has to guess, as all objects are typed
|
|
and the type determines which slots are objects. Not even registers are grey
|
|
area, as we switch cooperatively; only the Message register is ever valid.
|
|
%p
|
|
In fact, all this makes even moving objects relatively easy. Though there is off
|
|
course the effort of going through the world to find all backlinks. But if that
|
|
done during a mark, it comes at relatively low cost.
|
|
%p
|
|
All in all a very interesting topic, and surely someone will come up with some
|
|
great idea. And off course we there will have e to be the most rudimentary from
|
|
the start, just enough to work and give someone motivation to improve it.
|