move md to haml

This commit is contained in:
Torsten Ruger 2018-04-10 19:50:07 +03:00
parent 4b927c4f29
commit b61bc7c7ad
121 changed files with 3301 additions and 8572 deletions

View File

@ -0,0 +1,80 @@
%hr/
%p
layout: site
author: Torsten
%p
Part of what got me started on this project was the intuition that our programming model is in some way broken and so by
good old programmers logic: you havent understood it till you programmed it, I started to walk into the fog.
%h3#fpgas FPGAs
%p
Dont ask me why they should be called Field Programmable Gate Arrays, but they have fascinated me for years,
because off course they offer the “ultimate” in programming. Do away with fixed cpu instruction sets and get the program in silicon. Yeah!
%p
But several attempts at learning the black magic have left me only little the wiser.
Verlilog or VHDL are the languages that make up 80-90% of what is used and they so not object oriented,
or in any way user friendly. So that has been on the long
list, until i bumped into
%a{:href => "http://pshdl.org/"} pshdl
by way of Karstens
= succeed "." do
%a{:href => "https://www.youtube.com/watch?v=Er9luiBa32k"} excellent video on it
%p
But what struck me is something he said. That in hardware programming its all about getting your design/programm to fit into
the space you have, and make the timing of the gates work.
%p
And i realized that is what is missing from our programming model: time and space. There is no time, as calls happen
sequentially / always immediately. And there is no space as we have global memory with random access, unlimited by virtual
memory. But the world we live in is governed by time and space, and that governs the way our brain works.
%h3#active-objects-vs-threads Active Objects vs threads
%p
That is off course not soo new, and the actor model has been created to fix that. And while i havent used it much,
i believe it does, especially in non techie problems. And
%a{:href => "http://celluloid.io/"} Celluloid
seems to be a great
implementation of that idea.
%p
Off course Celluloid needs native threads, so youll need to run rubinius or jruby. Understandibly. And so we have
a fix for the problem, if we use celluloid.
%p
But it is a fix, it is not part of the system. The system has sequetial calls per thread and threads. Threads are evil as
i explain (rant about?)
= succeed "," do
%a{:href => "/rubyx/threads.html"} here
%h3#messaging-with-inboxes Messaging with inboxes
%p
If you read the rant (it is a little older) youll se that it established the problem (shared global memory) but does not propose a solution as such. The solution came from a combination of the rant,
the
%a{:href => "/2014/07/17/framing.html"} previous post
and the fpga physical perspective.
%p
A physical view would be that we have a fixed number of object places on the chip (like a cache) and
as the previous post explains, sending is creating a message (yet another object) and transferring
control. Now in a physical view control is not in one place like in a cpu. Any gate can switch at
any cycle, so any object could be “active” at every cycle (without going into any detail about what that means).
%p
But it got me thinking how that would be coordinated, because one object doing two things may lead
to trouble. But one of the Sythesis ideas was
%a{:href => "http://valerieaurora.org/synthesis/SynthesisOS/ch5.html"} lock free synchronisation
by use of a test-and-swap primitive.
%p
So if every object had an inbox, in a similar way that each object has a class now, we could create
the message and put it there. And by default we would expect it to be empty, and test that and if
so put our message there. Otherwise we queue it.
%p
From a sender perspective the process is: create a new Message, fill it with data, put it to
receivers inbox. From a receivers perspective its check you inbox, if empty do nothing,
otherwise do what it says. Do what it says could easily include the ruby rules for finding methods.
Ie check if your yourself have a method by that name, send to super if not etc.
%p
In a fpga setting this would be even nicer, as all lookups could be implemented by associative memory
and thus happen in one cycle. Though there would be some manager needed to manage which objects are
on the chip and which could be hoisted off. Nothing more complicated than a virtual memory manager though.
%p
The inbox idea represents a solution to the thread problem and has the added benefit of being easy to understand and
possibly even to implement. It should also make it safe to run several kernel threads, though i prefer the idea of
only having one or two kernel threads that do exclusively system calls and the rest with green threads that use
home grown scheduling.
%p
This approach also makes one way messaging very natural though one would have to invent a syntax for
that. And futures should come easy too.

View File

@ -1,72 +0,0 @@
---
layout: site
author: Torsten
---
Part of what got me started on this project was the intuition that our programming model is in some way broken and so by
good old programmers logic: you haven't understood it till you programmed it, I started to walk into the fog.
### FPGA's
Don't ask me why they should be called Field Programmable Gate Arrays, but they have fascinated me for years,
because off course they offer the "ultimate" in programming. Do away with fixed cpu instruction sets and get the program in silicon. Yeah!
But several attempts at learning the black magic have left me only little the wiser.
Verlilog or VHDL are the languages that make up 80-90% of what is used and they so not object oriented,
or in any way user friendly. So that has been on the long
list, until i bumped into [pshdl](http://pshdl.org/) by way of Karstens [excellent video on it](https://www.youtube.com/watch?v=Er9luiBa32k). Pshdl aim to be simple and indeed looks it. Also simulation is exact
and fast. Definitely the way to go Karsten!
But what struck me is something he said. That in hardware programming it's all about getting your design/programm to fit into
the space you have, and make the timing of the gates work.
And i realized that is what is missing from our programming model: time and space. There is no time, as calls happen
sequentially / always immediately. And there is no space as we have global memory with random access, unlimited by virtual
memory. But the world we live in is governed by time and space, and that governs the way our brain works.
### Active Objects vs threads
That is off course not soo new, and the actor model has been created to fix that. And while i haven't used it much,
i believe it does, especially in non techie problems. And [Celluloid](http://celluloid.io/) seems to be a great
implementation of that idea.
Off course Celluloid needs native threads, so you'll need to run rubinius or jruby. Understandibly. And so we have
a fix for the problem, if we use celluloid.
But it is a fix, it is not part of the system. The system has sequetial calls per thread and threads. Threads are evil as
i explain (rant about?) [here](/rubyx/threads.html), mainly because of the shared global memory.
### Messaging with inboxes
If you read the rant (it is a little older) you'll se that it established the problem (shared global memory) but does not propose a solution as such. The solution came from a combination of the rant,
the [previous post](/2014/07/17/framing.html) and the fpga physical perspective.
A physical view would be that we have a fixed number of object places on the chip (like a cache) and
as the previous post explains, sending is creating a message (yet another object) and transferring
control. Now in a physical view control is not in one place like in a cpu. Any gate can switch at
any cycle, so any object could be "active" at every cycle (without going into any detail about what that means).
But it got me thinking how that would be coordinated, because one object doing two things may lead
to trouble. But one of the Sythesis ideas was [lock free synchronisation](http://valerieaurora.org/synthesis/SynthesisOS/ch5.html)
by use of a test-and-swap primitive.
So if every object had an inbox, in a similar way that each object has a class now, we could create
the message and put it there. And by default we would expect it to be empty, and test that and if
so put our message there. Otherwise we queue it.
From a sender perspective the process is: create a new Message, fill it with data, put it to
receivers inbox. From a receivers perspective it's check you inbox, if empty do nothing,
otherwise do what it says. Do what it says could easily include the ruby rules for finding methods.
Ie check if your yourself have a method by that name, send to super if not etc.
In a fpga setting this would be even nicer, as all lookups could be implemented by associative memory
and thus happen in one cycle. Though there would be some manager needed to manage which objects are
on the chip and which could be hoisted off. Nothing more complicated than a virtual memory manager though.
The inbox idea represents a solution to the thread problem and has the added benefit of being easy to understand and
possibly even to implement. It should also make it safe to run several kernel threads, though i prefer the idea of
only having one or two kernel threads that do exclusively system calls and the rest with green threads that use
home grown scheduling.
This approach also makes one way messaging very natural though one would have to invent a syntax for
that. And futures should come easy too.

View File

@ -0,0 +1,93 @@
%hr/
%p
layout: site
author: Torsten
%p
As noted in previous posts, differentiating between compile- and run-time is one of the more
difficult things in doing the vm. That is because the computing that needs to happen is so similar,
in other words almost all of the vm - level is available at run-time too.
%p But off course we try to do as much as possible at compile-time.
%p
One hears or reads that exactly this is a topic causing (also) other vms problems.
Specifically how one assures that what is compiled at compile-time and and run-time are
identical or at least compatible.
%h2#inlining Inlining
%p
The obvious answer seems to me to
= succeed ".In" do
%strong use the same code
%p
Lets take a simple example of accessing an instance variable. This is off course available at
run-time through the function
%em instance_variable_get
, which could go something like:
%pre
%code
:preserve
def instance_variable_get name
index = @layout.index name
return nil unless index
at_index(index)
end
%p
Lets assume the
%em builtin
at_index function and take the layout to be an array like structure.
As noted in previous posts, when this is compiled we get a Method with Blocks, and exactly one
Block will initiate the return. The previous post detailed how at that time the return value will
be in the ReturnSlot.
%p
So then we get to the idea of how: We “just” need to take the blocks from the method and paste
them where the instance variable is accessed. Following code will pick the value from the ReturnSlot
as it would any other value and continue.
%p
The only glitch in this plan is that the code will assume a new message and frame. But if we just
paste it it will use message/frame/self from the enclosing method. So that is where the work is:
translating slots from the inner, inlined function to the outer one. Possibly creating new frame
entries.
%h2#inlining-what Inlining what
%p
But lets take a step back from the mechanics and look at what it is we need to inline. Above
example seems to suggest we inline code. Code, as in text, is off course impossible to inline.
Thats because we have no information about it and so the argument passing and returning cant
possibly work. Quite apart from the tricky possibility of shadow variables, ie the inlined code
assigning to variables of the outside function.
%p
Ok, so then we just take our parsed code, the abstract syntax tree. There we have all the
information we need to do the magic, at least it looks like that.
But, we may not have the ast!
%p
The idea is to be able to make the step to a language independent system. Hence the sof (salama
object file), even it has no reader yet. The idea being that we store object files of any
language in sof and the vm would read those.
%p
To do that we need to inline at the vm instruction level. Which in turn means that we will need
to retain enough information at that level to be able to do that. What that entails in detail
is unclear at the moment, but it gives a good direction.
%h2#a-rough-plan A rough plan
%p
To recap the function calling at the instruction level. Btw it should be clear that we can
not inline method sends, as we dont know which function is being called. But off course the
actual send method may be inlined and that is in fact part of the aim.
%p
To call a function, a NewMessage is created, loaded with args and stuff, then the FunctionCall is
issued. Upon entering a new frame may be created for local and temporary variables and at the
end the function returns. When it returns the return value will be in the Return slot and the
calling method will grab it if interested and swap the Message back to what it was before the call.
%p
From that (and at that level) it becomes clearer what needs to be done, and it starts with the
the caller, off course. In the caller there needs to be a way to make the decision whether to
inline or not. For the run-time stuff we need a list for “always inline”, later a complexity
analysis, later a run-time analysis. When the decision goes to inline, the message setup will
be skipped. Instead a mapping needs to be created from the called functions argument names to
the newly created (unique) local variables.
Then, going through the instructions, references to arguments must be exchanged with references
to the new variables. A similar process needs to replace reference to local variables in the
called method to local variables in the calling method. Similarly the return and self slots need
to be mapped.
%p
After the final instruction of the called method, the reassigned return must be moved to the real
return and the calling function may commence. And while this may sound a lot, one must remember
that the instruction set of the machine is quite small, and further refinement
(abstracting base classes for example) can be done to make the work easier.

View File

@ -1,90 +0,0 @@
---
layout: site
author: Torsten
---
As noted in previous posts, differentiating between compile- and run-time is one of the more
difficult things in doing the vm. That is because the computing that needs to happen is so similar,
in other words almost all of the vm - level is available at run-time too.
But off course we try to do as much as possible at compile-time.
One hears or reads that exactly this is a topic causing (also) other vms problems.
Specifically how one assures that what is compiled at compile-time and and run-time are
identical or at least compatible.
## Inlining
The obvious answer seems to me to **use the same code**.In a way that "just" moves the question
around a bit, becuase then one would have to know how to do that. I'll go into that below,
but find that the concept is worth exploring first.
Let's take a simple example of accessing an instance variable. This is off course available at
run-time through the function *instance_variable_get* , which could go something like:
def instance_variable_get name
index = @layout.index name
return nil unless index
at_index(index)
end
Let's assume the *builtin* at_index function and take the layout to be an array like structure.
As noted in previous posts, when this is compiled we get a Method with Blocks, and exactly one
Block will initiate the return. The previous post detailed how at that time the return value will
be in the ReturnSlot.
So then we get to the idea of how: We "just" need to take the blocks from the method and paste
them where the instance variable is accessed. Following code will pick the value from the ReturnSlot
as it would any other value and continue.
The only glitch in this plan is that the code will assume a new message and frame. But if we just
paste it it will use message/frame/self from the enclosing method. So that is where the work is:
translating slots from the inner, inlined function to the outer one. Possibly creating new frame
entries.
## Inlining what
But lets take a step back from the mechanics and look at what it is we need to inline. Above
example seems to suggest we inline code. Code, as in text, is off course impossible to inline.
That's because we have no information about it and so the argument passing and returning can't
possibly work. Quite apart from the tricky possibility of shadow variables, ie the inlined code
assigning to variables of the outside function.
Ok, so then we just take our parsed code, the abstract syntax tree. There we have all the
information we need to do the magic, at least it looks like that.
But, we may not have the ast!
The idea is to be able to make the step to a language independent system. Hence the sof (salama
object file), even it has no reader yet. The idea being that we store object files of any
language in sof and the vm would read those.
To do that we need to inline at the vm instruction level. Which in turn means that we will need
to retain enough information at that level to be able to do that. What that entails in detail
is unclear at the moment, but it gives a good direction.
## A rough plan
To recap the function calling at the instruction level. Btw it should be clear that we can
not inline method sends, as we don't know which function is being called. But off course the
actual send method may be inlined and that is in fact part of the aim.
To call a function, a NewMessage is created, loaded with args and stuff, then the FunctionCall is
issued. Upon entering a new frame may be created for local and temporary variables and at the
end the function returns. When it returns the return value will be in the Return slot and the
calling method will grab it if interested and swap the Message back to what it was before the call.
From that (and at that level) it becomes clearer what needs to be done, and it starts with the
the caller, off course. In the caller there needs to be a way to make the decision whether to
inline or not. For the run-time stuff we need a list for "always inline", later a complexity
analysis, later a run-time analysis. When the decision goes to inline, the message setup will
be skipped. Instead a mapping needs to be created from the called functions argument names to
the newly created (unique) local variables.
Then, going through the instructions, references to arguments must be exchanged with references
to the new variables. A similar process needs to replace reference to local variables in the
called method to local variables in the calling method. Similarly the return and self slots need
to be mapped.
After the final instruction of the called method, the reassigned return must be moved to the real
return and the calling function may commence. And while this may sound a lot, one must remember
that the instruction set of the machine is quite small, and further refinement
(abstracting base classes for example) can be done to make the work easier.

View File

@ -11,7 +11,8 @@ gem "haml-rails"
gem "susy" , "2.2.12"
gem 'high_voltage'
gem "kramdown"
gem "maruku"
group :development, :test do
# Call 'byebug' anywhere in the code to stop execution and get a debugger console
gem 'byebug', platforms: [:mri, :mingw, :x64_mingw]

View File

@ -84,6 +84,7 @@ GEM
ruby_parser (~> 3.5)
i18n (1.0.0)
concurrent-ruby (~> 1.0)
kramdown (1.16.2)
launchy (2.4.3)
addressable (~> 2.3)
listen (3.1.5)
@ -97,6 +98,7 @@ GEM
mini_mime (>= 0.1.1)
marcel (0.3.2)
mimemagic (~> 0.3.2)
maruku (0.7.3)
method_source (0.9.0)
mimemagic (0.3.2)
mini_mime (1.0.0)
@ -209,7 +211,9 @@ DEPENDENCIES
capybara-screenshot
haml-rails
high_voltage
kramdown
listen (>= 3.0.5, < 3.2)
maruku
puma (~> 3.11)
rails
rspec-rails

View File

@ -1,23 +0,0 @@
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>{% if page.title %}{{ page.title | escape }}{% else %}{{ site.title | escape }}{% endif %}</title>
<meta name="description" content="{% if page.excerpt %}{{ page.excerpt | strip_html | strip_newlines | truncate: 160 }}{% else %}{{ site.description }}{% endif %}">
{% assign user_url = site.url | append: site.baseurl %}
{% assign full_base_url = user_url | default: site.github.url %}
<link rel="stylesheet" href="{{ "/assets/css/style.css" | prepend: full_base_url }}">
<link rel="stylesheet" href="/assets/css/site.css">
<link rel="canonical" href="{{ page.url | replace:'index.html','' | prepend: site.baseurl | prepend: site.url }}">
<link rel="alternate" type="application/rss+xml" title="{{ site.title }}" href="{{ "/feed.xml" | prepend: site.baseurl | prepend: site.url }}">
<script async defer src="https://buttons.github.io/buttons.js"></script>
<!-- Font Include -->
<link href='http://fonts.googleapis.com/css?family=Roboto:400,300,100,100italic,300italic,500,700' rel='stylesheet' type='text/css'>
{% seo %}
</head>

View File

@ -1,21 +0,0 @@
---
layout: site
---
<div class="row">
<div>
<h1 class="center">{{page.title}}</h2>
<p class="center"><span> {{page.sub-title}} </span></p>
<ul class="nav">
<li><a href="/arm/overview.html">Overview</a> </li>
<li><a href="/arm/qemu.html">Virtual Pi</a> </li>
<li><a href="/arm/remote_pi.html">Remote pi</a> </li>
<li><a href="/arm/target.html" target="sspec">Small Spec(html)</a> </li>
<li><a href="/arm/arm_inst.pdf" target="pspec">Small Spec(pdf)</a> </li>
<li><a href="/arm/big_spec.pdf" target="bspec">Huge spec</a> </li>
</ul>
</div>
<div>
{{content}}
</div>
</div>

View File

@ -1,24 +0,0 @@
---
layout: site
---
<div class="row">
<div>
<h1 class="center">{{page.title}}</h2>
<p class="center"><span> Written by {{page.author}} on {{page.date | date_to_string}}. </span></p>
</div>
<div>
{{content}}
</div>
</div>
<div class="row">
<h2 class="center">Older</h2>
<div>
<ul class="nav">
{% for post in site.posts %}
<li><a href="{{ post.url }}">{{ post.title }} <small>{{ post.date | date: "%d.%m.%y" }} </small></a>
</li>
{% endfor %}
</ul>
</div>
</div>

View File

@ -1,19 +0,0 @@
---
layout: site
---
<div class="row">
<div>
<h1 class="center">{{page.title}}</h2>
<p class="center"><span> {{page.sub-title}} </span></p>
<ul class="nav">
<li><a href="/project/motivation.html">Motivation</a> </li>
<li><a href="/project/ideas.html">Ideas</a> </li>
<li><a href="/project/history.html">History</a> </li>
<li><a href="/project/contribute.html">Contribute</a> </li>
</ul>
</div>
<div>
{{content}}
</div>
</div>

View File

@ -1,19 +0,0 @@
---
layout: site
---
<div class="row">
<div>
<h1 class="center">{{page.title}}</h2>
<p class="center"><span> {{page.sub-title}} </span></p>
<ul class="nav">
<li><a href="/rubyx/layers.html">Layers of RubyX</a> </li>
<li><a href="/rubyx/memory.html">Memory</a> </li>
<li><a href="/rubyx/threads.html">Threads</a> </li>
<li><a href="/rubyx/optimisations.html">Optimisation ideas</a> </li>
</ul>
</div>
<div>
{{content}}
</div>
</div>

View File

@ -1,50 +0,0 @@
<!DOCTYPE html>
<!--[if IE 8]> <html class="no-js lt-ie9" lang="en"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]-->
{% include head.html %}
<body>
<header>
<div class="container">
<a href="https://github.com/ruby-x">
<img style="position: absolute; top: 0; left: 0; border: 0;" src="https://camo.githubusercontent.com/8b6b8ccc6da3aa5722903da7b58eb5ab1081adee/68747470733a2f2f73332e616d617a6f6e6177732e636f6d2f6769746875622f726962626f6e732f666f726b6d655f6c6566745f6f72616e67655f6666373630302e706e67" alt="Fork me on GitHub" data-canonical-src="https://s3.amazonaws.com/github/ribbons/forkme_left_orange_ff7600.png">
</a>
<ul class="nav">
<li> <a href="/">Home</a> </li>
<li> <a href="/rubyx/layers.html">Architecture</a> </li>
<li> <a href="/typed/typed.html">Typed layer</a></li>
<li> <a href="/arm/overview.html">Arm Resources</a> </li>
<li><a href="/project/motivation.html">About</a> </li>
<li> <a href="{{site.posts.first.url}}">News</a> </li>
</ul>
<a href="https://github.com/ruby-x">
<img style="position: absolute; top: 15px; right: 15px; border: 0; width: 70px"
src="/assets/images/x-small.png" alt="Logo" >
</a>
</div>
</header>
<div class="container">
{{ content }}
</div>
<footer>
<div class="container">
<div class="row center">
<p>&copy; Copyright Torsten Ruger 2013-7</p>
</div>
</div>
</footer>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
var the_id = 'UA-61481839-1';
ga('create', the_id.replace("-1" , "-2") , 'auto');
ga('send', 'pageview');
</script>
</body>
</html>

View File

@ -1,20 +0,0 @@
---
layout: site
---
<div class="row">
<div>
<h1 class="center">{{page.title}}</h2>
<p class="center"><span> {{page.sub-title}} </span></p>
<ul class="nav">
<li><a href="/typed/typed.html">Typed</a> </li>
<li><a href="/typed/parfait.html">Parfait</a> </li>
<li><a href="/typed/benchmarks.html">Performance</a> </li>
<li><a href="/typed/debugger.html">Debugger</a> </li>
<li><a href="/typed/syntax.html">Syntax (obsolete)</a> </li>
</ul>
</div>
<div>
{{content}}
</div>
</div>

View File

@ -0,0 +1,30 @@
%p
Well, it has been a good holiday, two months in Indonesia, Bali and diving Komodo. It brought
clarity, and so i have to start a daunting task.
%p
When i learned programming at University, they were still teaching Pascal. So when I got to choose
c++ in my first bigger project that was a real step up. But even i wrestled templates, it was
Smalltalk that took my heart immediately when i read about it. And I read quite a bit, including the Blue Book about the implementation of it.
%p
The next distinct step up was Java, in 1996, and then ruby in 2001. Until i mostly stopped coding
in 2004 when i moved to the country side and started our
%a{:href => "http://villataika.fi/en/index.html"} B&amp;B
But then we needed web-pages, and before long a pos for our shop, so i was back on the keyboard.
And since it was a thing i had been wanting to do, I wrote a database.
%p
Purple was my current idea of an ideal data-store. Save by reachability, automatic loading by
traversal and schema-free any ruby object saving. In memory, based on Judy, it did about 2000
transaction per second. Alas, it didnt have any searching.
%p
So i bit the bullet and implemented an sql interface to it. After a failed attempt with rails 2
and after 2 major rewrites i managed to integrate what by then was called warp into Arel (rails3).
But while raw throughput was still about the same, when it had to go through Arel it crawled to 50
transactions per second, about the same as sqlite.
%p
This was maybe 2011, and there was no doubt anymore. Not the database, but ruby itself was the
speed hog. I aborted.
%p
In 2013 I bought a Raspberry Pi and off course I wanted to use it with ruby. Alas… Slow pi + slow ruby = nischt gut.
I gave up.
%p So then the clarity came with the solution, build your own ruby. I started designing a bit on the beach already.
%p Still, daunting. But maybe just possible….

View File

@ -1,30 +0,0 @@
Well, it has been a good holiday, two months in Indonesia, Bali and diving Komodo. It brought
clarity, and so i have to start a daunting task.
When i learned programming at University, they were still teaching Pascal. So when I got to choose
c++ in my first bigger project that was a real step up. But even i wrestled templates, it was
Smalltalk that took my heart immediately when i read about it. And I read quite a bit, including the Blue Book about the implementation of it.
The next distinct step up was Java, in 1996, and then ruby in 2001. Until i mostly stopped coding
in 2004 when i moved to the country side and started our [B&amp;B](http://villataika.fi/en/index.html)
But then we needed web-pages, and before long a pos for our shop, so i was back on the keyboard.
And since it was a thing i had been wanting to do, I wrote a database.
Purple was my current idea of an ideal data-store. Save by reachability, automatic loading by
traversal and schema-free any ruby object saving. In memory, based on Judy, it did about 2000
transaction per second. Alas, it didn't have any searching.
So i bit the bullet and implemented an sql interface to it. After a failed attempt with rails 2
and after 2 major rewrites i managed to integrate what by then was called warp into Arel (rails3).
But while raw throughput was still about the same, when it had to go through Arel it crawled to 50
transactions per second, about the same as sqlite.
This was maybe 2011, and there was no doubt anymore. Not the database, but ruby itself was the
speed hog. I aborted.
In 2013 I bought a Raspberry Pi and off course I wanted to use it with ruby. Alas... Slow pi + slow ruby = nischt gut.
I gave up.
So then the clarity came with the solution, build your own ruby. I started designing a bit on the beach already.
Still, daunting. But maybe just possible....

View File

@ -0,0 +1,29 @@
%h2#the-c-machine The c machine
%p Software engineers have clean brains, scrubbed into full c alignment through decades. A few rebels (klingons?) remain on embedded systems, but of those most strive towards posix compliancy too.
%p In other words, since all programming ultimately boils down to c, libc makes the bridge to the kernel/machine. All …. all but a small village in the northern (cold) parts of europe (Antskog) where …
%p So i had a look what we are talking about.
%h2#the-issue The issue
%p
Many, especially embedded guys, have noticed that your standard c library has become quite heavy
(2 Megs). Since it provides a defined api (posix) and large functionality on a plethora of systems (oss) and cpus. Even for different ABIs (application binary interfaces) and compilers/linkers it is no wonder.
%p ucLibc or dietLibc get the size down, especially diet quite a bit (130k). So thats ok then. Or is it?
%p Then i noticed that the real issue is not the size. Even my pi has 512 Mb, and of course even libc gets paged.
%p The real issue is the step into the C world. So, extern functions, call marshelling, and the question is for what.
%p Afer all the c library was created to make it easier for c programs to use the kernel. And i have no intention of coding any more c.
%h2#ruby-corestd-lib ruby core/std-lib
%p Off course the ruby-core and std libs were designed to do for ruby what libc does for c. Unfortunately they are badly designed and suffer from above brainwash (designed around c calls)
%p
Since salama is pure ruby there is a fair amount of functionality that would be nicer to provide straight in ruby. As gems off course, for everybody to see and fix.
For example, even if there were to be a printf (which i dislike) , it would be easy to code in ruby.
%p What is needed is the underlying write to stdout.
%h2#solution Solution
%p To get salama up and running, ie to have a “ruby” executable, there are really very few kernel calls needed. File open, read and stdout write, brk.
%p So the way this will go is to write syscalls where needed.
%p Having tried to reverse engineer uc, diet and musl, it seems best to go straight to the source.
%p
Most of that is off course for intel, but eax goes to r7 and after that the args are from r0 up, so not too bad. The definite guide for arm is here
%a{:href => "http://sourceforge.net/p/strace/code/ci/master/tree/linux/arm/syscallent.h"} http://sourceforge.net/p/strace/code/ci/master/tree/linux/arm/syscallent.h
But doesnt include arguments (only number of them), so
%a{:href => "http://syscalls.kernelgrok.com/"} http://syscalls.kernelgrok.com/
can be used.
%p So there, getting more metal by the minute. But the time from writing this to a hello world was 4 hours.

View File

@ -1,46 +0,0 @@
The c machine
-------------
Software engineers have clean brains, scrubbed into full c alignment through decades. A few rebels (klingons?) remain on embedded systems, but of those most strive towards posix compliancy too.
In other words, since all programming ultimately boils down to c, libc makes the bridge to the kernel/machine. All .... all but a small village in the northern (cold) parts of europe (Antskog) where ...
So i had a look what we are talking about.
The issue
----------
Many, especially embedded guys, have noticed that your standard c library has become quite heavy
(2 Megs). Since it provides a defined api (posix) and large functionality on a plethora of systems (os's) and cpu's. Even for different ABI's (application binary interfaces) and compilers/linkers it is no wonder.
ucLibc or dietLibc get the size down, especially diet quite a bit (130k). So that's ok then. Or is it?
Then i noticed that the real issue is not the size. Even my pi has 512 Mb, and of course even libc gets paged.
The real issue is the step into the C world. So, extern functions, call marshelling, and the question is for what.
Afer all the c library was created to make it easier for c programs to use the kernel. And i have no intention of coding any more c.
ruby core/std-lib
------------
Off course the ruby-core and std libs were designed to do for ruby what libc does for c. Unfortunately they are badly designed and suffer from above brainwash (designed around c calls)
Since salama is pure ruby there is a fair amount of functionality that would be nicer to provide straight in ruby. As gems off course, for everybody to see and fix.
For example, even if there were to be a printf (which i dislike) , it would be easy to code in ruby.
What is needed is the underlying write to stdout.
Solution
--------
To get salama up and running, ie to have a "ruby" executable, there are really very few kernel calls needed. File open, read and stdout write, brk.
So the way this will go is to write syscalls where needed.
Having tried to reverse engineer uc, diet and musl, it seems best to go straight to the source.
Most of that is off course for intel, but eax goes to r7 and after that the args are from r0 up, so not too bad. The definite guide for arm is here [http://sourceforge.net/p/strace/code/ci/master/tree/linux/arm/syscallent.h](http://sourceforge.net/p/strace/code/ci/master/tree/linux/arm/syscallent.h)
But doesn't include arguments (only number of them), so [http://syscalls.kernelgrok.com/](http://syscalls.kernelgrok.com/) can be used.
So there, getting more metal by the minute. But the time from writing this to a hello world was 4 hours.

View File

@ -0,0 +1,12 @@
%p Part of the reason why i even thought this was possible was because i had bumped into Metasm.
%p
Metasm creates native code in 100% ruby. Either from Assembler or even C (partially). And for many cpus too.
It also creates many binary formats, elf among them.
%p
Still, i wanted something small that i could understand easily as it was clear it would have to be changed to fit.
As there was no external assembler file format planned, the whole approach from parsing was inappropriate.
%p
I luckily found a small library, as, that did arm only and was just a few files. After removing not needed parts
like parsing and some reformatting i added an assembler like dsl.
%p This layer (arm subdirectory) said hello after about 2 weeks of work.
%p I also got qemu to work and can thus develop without the actual pi.

View File

@ -1,14 +0,0 @@
Part of the reason why i even thought this was possible was because i had bumped into Metasm.
Metasm creates native code in 100% ruby. Either from Assembler or even C (partially). And for many cpu's too.
It also creates many binary formats, elf among them.
Still, i wanted something small that i could understand easily as it was clear it would have to be changed to fit.
As there was no external assembler file format planned, the whole approach from parsing was inappropriate.
I luckily found a small library, as, that did arm only and was just a few files. After removing not needed parts
like parsing and some reformatting i added an assembler like dsl.
This layer (arm subdirectory) said hello after about 2 weeks of work.
I also got qemu to work and can thus develop without the actual pi.

View File

@ -0,0 +1,24 @@
%p Both “ends”, parsing and machine code, were relatively clear cut. Now it is into unknown territory.
%p I had ported the Kaleidoscope llvm tutorial language to ruby-llvm last year, so thee were some ideas floating.
%p
The idea of basic blocks, as the smallest unit of code without branches was pretty clear. Using those as jump
targets was also straight forward. But how to get from the AST to arm Intructions was not, and took some trying out.
%p
In the end, or rather now, it is the AST layer that “compiles” itself into the Vm layer. The Vm layer then assembles
itself into Instructions.
%p
General instructions are part of the Vm layer, but the code picks up derived classes and thus makes machine
dependent code possible. So far so ok.
%p
Register allocation was (and is) another story. Argument passing and local variables do work now, but there is definitely
room for improvement there.
%p
To get anything out of a running program i had to implement putstring (easy) and putint (difficult). Surprisingly
division is not easy and when pinned to 10 (divide by 10) quite strange. Still it works. While i was at writing
assembler i found a fibonachi in 10 or so instructions.
%p
To summarise, function definition and calling (including recursion) works.
If and and while structures work and also some operators and now its easy to add more.
%p
So we have a Fibonacchi in ruby using a while implementation that can be executed by salama and outputs the
correct result. After a total of 7 weeks this is much more than expected!

View File

@ -1,25 +0,0 @@
Both "ends", parsing and machine code, were relatively clear cut. Now it is into unknown territory.
I had ported the Kaleidoscope llvm tutorial language to ruby-llvm last year, so thee were some ideas floating.
The idea of basic blocks, as the smallest unit of code without branches was pretty clear. Using those as jump
targets was also straight forward. But how to get from the AST to arm Intructions was not, and took some trying out.
In the end, or rather now, it is the AST layer that "compiles" itself into the Vm layer. The Vm layer then assembles
itself into Instructions.
General instructions are part of the Vm layer, but the code picks up derived classes and thus makes machine
dependent code possible. So far so ok.
Register allocation was (and is) another story. Argument passing and local variables do work now, but there is definitely
room for improvement there.
To get anything out of a running program i had to implement putstring (easy) and putint (difficult). Surprisingly
division is not easy and when pinned to 10 (divide by 10) quite strange. Still it works. While i was at writing
assembler i found a fibonachi in 10 or so instructions.
To summarise, function definition and calling (including recursion) works.
If and and while structures work and also some operators and now it's easy to add more.
So we have a Fibonacchi in ruby using a while implementation that can be executed by salama and outputs the
correct result. After a total of 7 weeks this is much more than expected!

View File

@ -0,0 +1,29 @@
%p Parsing is a difficult, the theory incomprehensible and older tools cryptic. At least for me.
%p
And then i heard recursive is easy and used by even llvm. Formalised as peg parsing libraries exists, and in ruby
they have dsls and are suddenly quite understandable.
%p
Off the candidates i had first very positive experiences with treetop. Upon continuing i found the code
generation aspect not just clumsy (after all you can define methods in ruby), but also to interfere unneccessarily
with code control. On top of that conversion into an AST was not easy.
%p After looking around i found Parslet, which pretty much removes all those issues. Namely
%ul
%li It does not generate code, it generates methods. And has a nice dsl.
%li
It transforms to ruby basic types and has the notion on a transformation.
So an easy and clean way to create an AST
%li One can use ruby modules to partition a larger parser
%li Minimal dependencies (one file).
%li Active use and development.
%p
So i was sold, and i got up to speed quite quickly. But i also found out how fiddly such a parser is in regards
to ordering and whitespace.
%p
I spent some time to make quite a solid test framework, testing the different rules separately and also the
stages separately, so things would not break accidentally when growing.
%p
After about another 2 weeks i was able to parse functions, both calls and definitions, ifs and whiles and off course basic
types of integers and strings.
%p
With the great operator support it was a breeze to create all 15 ish binary operators. Even Array and Hash constant
definition was very quick. All in all surprisingly painless, thanks to Kasper!

View File

@ -1,29 +0,0 @@
Parsing is a difficult, the theory incomprehensible and older tools cryptic. At least for me.
And then i heard recursive is easy and used by even llvm. Formalised as peg parsing libraries exists, and in ruby
they have dsl's and are suddenly quite understandable.
Off the candidates i had first very positive experiences with treetop. Upon continuing i found the code
generation aspect not just clumsy (after all you can define methods in ruby), but also to interfere unneccessarily
with code control. On top of that conversion into an AST was not easy.
After looking around i found Parslet, which pretty much removes all those issues. Namely
- It does not generate code, it generates methods. And has a nice dsl.
- It transforms to ruby basic types and has the notion on a transformation.
So an easy and clean way to create an AST
- One can use ruby modules to partition a larger parser
- Minimal dependencies (one file).
- Active use and development.
So i was sold, and i got up to speed quite quickly. But i also found out how fiddly such a parser is in regards
to ordering and whitespace.
I spent some time to make quite a solid test framework, testing the different rules separately and also the
stages separately, so things would not break accidentally when growing.
After about another 2 weeks i was able to parse functions, both calls and definitions, ifs and whiles and off course basic
types of integers and strings.
With the great operator support it was a breeze to create all 15 ish binary operators. Even Array and Hash constant
definition was very quick. All in all surprisingly painless, thanks to Kasper!

View File

@ -0,0 +1,44 @@
%p Its such a nice name, crystal. My first association is clarity, and that is exactly what i am trying to achieve.
%p But ive been struggling a bit to achieve any clarity on the topic of system boundary: where does OO stop. I mean i cant very well define method lookup in ruby syntax, as that involves method lookups. But tail recursion is so booring, it just never stops!
%h4#kernel Kernel
%p In the design phase (yes there was one!), i had planned to use lambdas. A little naive maybe, as they are off course objects. Thus calling them means a method resolution.
%p So im settling for Module methods. I say settling because that off course always makes the module object available, though i dont see any use for it. A waste in space (one register) and time (loading it), but no better ideas are forthcoming.
%p The place for these methods, and ill go into it a little which in a second, is the Kernel. And finally the name makes sense too. That is its original (pre 1.9) place, as a module that Object includes, ie “below” even Object.
%p So Kernel is the place for methods that are needed to build the system, and may not be called on objects. Simple.
%p In other words, anything that can be coded on normal objects, should. But when that stops being possible, Kernel is the place.
%p And what are these functions? get_instance_variable or set too. Same for functions. Strangley these may in turn rely on functions that can be coded in ruby, but at the heart of the matter is an indexed operation ie object[2].
%p This functionality, ie getting the nth data in an object, is essential, but c makes such a good point of of it having no place in a public api. So it needs to be implemented in a “private” part and used in a save manner. More on the layers emerging below.
%p The Kernel is a module in salama that defines functions which return function objects. So the code is generated, instead of parsed. An essential distinction.
%h4#system System
%p
Its an important side note on that Kernel definition above, that it is
%em not
the same as system access function. These are in their own Module and may (or must) use the kernel to implement their functionality. But not the same.
%p Kernel is the VMs “core” if you want.
%p System is the access to the operating system functionality.
%h4#layers Layers
%p So from that Kernel idea have now emerged 3 Layers, 3 ways in which code is created.
%h5#machine Machine
%p The lowest layer is the Machine layer. This Layer generates Instructions or sequences thereof. So off course there is an Instruction class with derived classes, but also Block, the smallest, linear, sequences of Instructions.
%p Also there is an abstract RegisterMachine that is mostly a mediator to the current implementation (ArmMachine). The machine has functions that create Instructions
%p Some few machine functions return Blocks, or append their instructions to blocks. This is really more a macro layer. Usually they are small, but div10 for example is a real 10 instruction beauty.
%h5#kernel-1 Kernel
%p The Kernel functions return function objects. Kernel functions have the same name as the function they implement, so Kernel::putstring defines a function called putstring. Function objects (Vm::Function) carry entry/exit/body code, receiver/return/argument types and a little more.
%p The important thing is that these functions are callable from ruby code. Thus they form the glue from the next layer up, which is coded in ruby, to the machine layer. In a way the Kernel “exports” the machine functionality to salama.
%h5#parfait Parfait
%p Parfait is a thin layer implementing a mini-minimal OO system. Sure, all your usual suspects of string and integers are there, but they only implement what is really really necessary. For example strings mainly have new equals and put.
%p Parfait is heavy on Object/Class/Metaclass functionality, object instance and method lookup. All things needed to make an OO system OO. Not so much “real” functionality here, more creating the ability for that.
%p Stdlib would be the next layer up, implementing the whole of ruby functionality in terms of what Parfait provides.
%p The important thing here is that Parfait is written completely in ruby. Meaning it gets parsed by salama like any other code, and then transformed into executable form and written.
%p Any executable that salama generates will have Parfait in it. But only the final version of salama as a ruby vm, will have the whole stdlib and parser along.
%h4#salama Salama
%p
Salama uses the Kernel and Machine layers straight when creating code. Off course.
The closest equivalent to salama would be a compiler and so it is its job to create code (machine layer objects).
%p But it is my intention to keep that as small as possible. And the good news is its all ruby :-)
%h5#extensions Extensions
%p I just want to mention the idea of extensions that is a logical step for a minimal system. Off course they would be gems, but the interesting thing is they (like salama) could:
%ul
%li use salamas existing kernel/machine abstraction to define new functionality that is not possible in ruby
%li define new machine functionality, adding kernel type apis, to create wholly new, possibly hardware specific functionality
%p I am thinking graphic acceleration, GPU usage, vector apis, that kind of thing. In fact i aim to implement the whole floating point functionality as an extensions (as it clearly not essential for OO).

View File

@ -1,75 +0,0 @@
It's such a nice name, crystal. My first association is clarity, and that is exactly what i am trying to achieve.
But i've been struggling a bit to achieve any clarity on the topic of system boundary: where does OO stop. I mean i can't very well define method lookup in ruby syntax, as that involves method lookups. But tail recursion is so booring, it just never stops!
#### Kernel
In the design phase (yes there was one!), i had planned to use lambdas. A little naive maybe, as they are off course objects. Thus calling them means a method resolution.
So i'm settling for Module methods. I say settling because that off course always makes the module object available, though i don't see any use for it. A waste in space (one register) and time (loading it), but no better ideas are forthcoming.
The place for these methods, and i'll go into it a little which in a second, is the Kernel. And finally the name makes sense too. That is it's original (pre 1.9) place, as a module that Object includes, ie "below" even Object.
So Kernel is the place for methods that are needed to build the system, and may not be called on objects. Simple.
In other words, anything that can be coded on normal objects, should. But when that stops being possible, Kernel is the place.
And what are these functions? get_instance_variable or set too. Same for functions. Strangley these may in turn rely on functions that can be coded in ruby, but at the heart of the matter is an indexed operation ie object[2].
This functionality, ie getting the n'th data in an object, is essential, but c makes such a good point of of it having no place in a public api. So it needs to be implemented in a "private" part and used in a save manner. More on the layers emerging below.
The Kernel is a module in salama that defines functions which return function objects. So the code is generated, instead of parsed. An essential distinction.
#### System
It's an important side note on that Kernel definition above, that it is _not_ the same as system access function. These are in their own Module and may (or must) use the kernel to implement their functionality. But not the same.
Kernel is the VM's "core" if you want.
System is the access to the operating system functionality.
#### Layers
So from that Kernel idea have now emerged 3 Layers, 3 ways in which code is created.
##### Machine
The lowest layer is the Machine layer. This Layer generates Instructions or sequences thereof. So off course there is an Instruction class with derived classes, but also Block, the smallest, linear, sequences of Instructions.
Also there is an abstract RegisterMachine that is mostly a mediator to the current implementation (ArmMachine). The machine has functions that create Instructions
Some few machine functions return Blocks, or append their instructions to blocks. This is really more a macro layer. Usually they are small, but div10 for example is a real 10 instruction beauty.
##### Kernel
The Kernel functions return function objects. Kernel functions have the same name as the function they implement, so Kernel::putstring defines a function called putstring. Function objects (Vm::Function) carry entry/exit/body code, receiver/return/argument types and a little more.
The important thing is that these functions are callable from ruby code. Thus they form the glue from the next layer up, which is coded in ruby, to the machine layer. In a way the Kernel "exports" the machine functionality to salama.
##### Parfait
Parfait is a thin layer implementing a mini-minimal OO system. Sure, all your usual suspects of string and integers are there, but they only implement what is really really necessary. For example strings mainly have new equals and put.
Parfait is heavy on Object/Class/Metaclass functionality, object instance and method lookup. All things needed to make an OO system OO. Not so much "real" functionality here, more creating the ability for that.
Stdlib would be the next layer up, implementing the whole of ruby functionality in terms of what Parfait provides.
The important thing here is that Parfait is written completely in ruby. Meaning it get's parsed by salama like any other code, and then transformed into executable form and written.
Any executable that salama generates will have Parfait in it. But only the final version of salama as a ruby vm, will have the whole stdlib and parser along.
#### Salama
Salama uses the Kernel and Machine layers straight when creating code. Off course.
The closest equivalent to salama would be a compiler and so it is it's job to create code (machine layer objects).
But it is my intention to keep that as small as possible. And the good news is it's all ruby :-)
##### Extensions
I just want to mention the idea of extensions that is a logical step for a minimal system. Off course they would be gems, but the interesting thing is they (like salama) could:
- use salamas existing kernel/machine abstraction to define new functionality that is not possible in ruby
- define new machine functionality, adding kernel type api's, to create wholly new, possibly hardware specific functionality
I am thinking graphic acceleration, GPU usage, vector api's, that kind of thing. In fact i aim to implement the whole floating point functionality as an extensions (as it clearly not essential for OO).

View File

@ -0,0 +1,56 @@
%p
I was just reading my ruby book, wondering about functions and blocks and the like, as one does when implementing
a vm. Actually the topic i was struggling with was receivers, the pesty self, when i got the exception.
%p And while they say two steps forward, one step back, this goes the other way around.
%h3#one-step-back One step back
%p
As I just learnt assembler, it is the first time i am really considering how functions are implemented, and how the stack is
used in that. Sure i heard about it, but the details were vague.
%p
Off course a function must know where to return to. I mean the memory-address, as this cant very
well be fixed at compile time. In effect this must be passed to the function. But as programmers we
dont want to have to do that all the time and so it is passed implicitly.
%h5#the-missing-link The missing link
%p
The arm architecture makes this nicely explicit. There, a call is actually called branch with link.
This almost rubbed me for a while as it struck me as an exceedingly bad name. Until i “got it”,
that is. The link is the link back, well that was simple. But the thing is that the “link” is
put into the link register.
%p
This never struck me as meaningful, until now. Off course it means that “leaf” functions do not
need to touch it. Leaf functions are functions that do not call other functions, though they may
do syscalls as the kernel restores all registers. In other cpus the return address is pushed on
the stack, but in arm you have to do that yourself. Or not and save the instruction if youre so inclined.
%h5#the-hidden-argument The hidden argument
%p
But the point here is, that this makes it very explicit. The return address is in effect just
another argument. It usually gets passed automatically by compiler generated code, but never
the less. It is an argument.
%p
The “step back” is to make this argument explicit in the vm code. Thus making its handling,
ie passing or saving explicit too. And thus having less magic going on, because you cant
understand magic (you gotta believe it).
%h3#two-steps-forward Two steps forward
%p And so the thrust becomes clear i hope. We are talking about exceptions after all.
%p
Because to those who have not read the windows calling convention on exception handling or even
heard of the dwarf specification thereof, i say dont. It melts the brain.
You have to be so good at playing computer in your head, its not healthy.
%p
Instead, we make things simple and explicit. An exception is after all just a different way for
a function to return. So we need an address for it to return too.
%p
And as we have just made the normal return address an explicit argument, we just make the
exception return address and argument too. And presto.
%p
Even just the briefest of considerations of how we generate those exception return addresses
(landing pads? what a strange name), leads to the conclusion that if a function does not do
any exception handling, it just passes the same address on, that it got itself. Thus a
generated exception would jump clear over such a function.
%p
Since we have now got the exceptions to be normal code (alas with an exceptional name :-)) control
flow to and from it becomes quite normal too.
%p
To summarize each function has now a minimum of three arguments: the self, the return address and
the exception address.
%p We have indeed taken a step forward.

View File

@ -1,62 +0,0 @@
I was just reading my ruby book, wondering about functions and blocks and the like, as one does when implementing
a vm. Actually the topic i was struggling with was receivers, the pesty self, when i got the exception.
And while they say two steps forward, one step back, this goes the other way around.
### One step back
As I just learnt assembler, it is the first time i am really considering how functions are implemented, and how the stack is
used in that. Sure i heard about it, but the details were vague.
Off course a function must know where to return to. I mean the memory-address, as this can't very
well be fixed at compile time. In effect this must be passed to the function. But as programmers we
don't want to have to do that all the time and so it is passed implicitly.
##### The missing link
The arm architecture makes this nicely explicit. There, a call is actually called branch with link.
This almost rubbed me for a while as it struck me as an exceedingly bad name. Until i "got it",
that is. The link is the link back, well that was simple. But the thing is that the "link" is
put into the link register.
This never struck me as meaningful, until now. Off course it means that "leaf" functions do not
need to touch it. Leaf functions are functions that do not call other functions, though they may
do syscalls as the kernel restores all registers. In other cpu's the return address is pushed on
the stack, but in arm you have to do that yourself. Or not and save the instruction if you're so inclined.
##### The hidden argument
But the point here is, that this makes it very explicit. The return address is in effect just
another argument. It usually gets passed automatically by compiler generated code, but never
the less. It is an argument.
The "step back" is to make this argument explicit in the vm code. Thus making it's handling,
ie passing or saving explicit too. And thus having less magic going on, because you can't
understand magic (you gotta believe it).
### Two steps forward
And so the thrust becomes clear i hope. We are talking about exceptions after all.
Because to those who have not read the windows calling convention on exception handling or even
heard of the dwarf specification thereof, i say don't. It melts the brain.
You have to be so good at playing computer in your head, it's not healthy.
Instead, we make things simple and explicit. An exception is after all just a different way for
a function to return. So we need an address for it to return too.
And as we have just made the normal return address an explicit argument, we just make the
exception return address and argument too. And presto.
Even just the briefest of considerations of how we generate those exception return addresses
(landing pads? what a strange name), leads to the conclusion that if a function does not do
any exception handling, it just passes the same address on, that it got itself. Thus a
generated exception would jump clear over such a function.
Since we have now got the exceptions to be normal code (alas with an exceptional name :-)) control
flow to and from it becomes quite normal too.
To summarize each function has now a minimum of three arguments: the self, the return address and
the exception address.
We have indeed taken a step forward.

View File

@ -0,0 +1,44 @@
%p I am not stuck. I know im not. Just because there is little visible progress doesnt mean im stuck. It may just feel like it though.
%p But like little cogwheels in the clock, i can hear the background process ticking away and sometimes there is a gong.
%p What i wasnt stuck with, is where to draw the layer for the vm.
%h3#layers Layers
%p
Software engineers like layers. Like the onion boy. You can draw boxes, make presentation and convince your boss.
They help us to reason about the software.
%p
In this case the model was to go from ast layer to a vm layer. Via a compile method, that could just as well have been a
visitor.
%p
That didnt work, too big astep and so it was from ast, to vm, to neumann. But i couldnt decide
on the abstraction of the virtual machine layer. Specifically, when you have a send (and you have
soo many sends in ruby), do you:
%ul
%li model it as a vm instruction (a bit like java)
%li implement it in a couple instructions like resolve, a loop and call
%li go to a version that is clearly translatable to neumann, say without the value type implementation
%p
Obviously the third is where we need to get to, as the next step is the neumann layer and somewhow
we need to get there. In effect one could take those three and present them as layers, not
as alternatives like i have.
%h3#passes Passes
%p
And then the little cob went click, and the idea of passes resurfaced. LLvm has these passes on
the code tree, is probably where it surfaced from.
%p
So we can have as high of a degree of abstraction as possible when going from ast to code.
And then have as many passes over that as we want / need.
%p
Passes can be order dependent, and create more and more detail. To solve the above layer
conundrum, we just do a pass for each of those options.
%p The two main benefits that come from this are:
%p
1 - At each point, ie after and during each pass we can analyse the data. Imagine for example
that we would have picked the second layer option, that means there would never have been a
representation where the sends would have been explicit. Thus any analysis of them would be impossible or need reverse engineering (eg call graph analysis, or class caching)
%p
2 - Passes can be gems or come from other sources. The mechanism can be relatively oblivious to
specific passes. And they make the transformation explicit, ie easier to understand.
In the example of having picked the second layer level, one would have to patch the
implementation of that transformation to achieve a different result. With passes it would be
a matter of replacing a pass, thus explicitly stating “i want a non-standard send implementation”
%p Actually a third benefit is that it makes testing simpler. More modular. Just test the initial ast-&gt;code and then mostly the results of passes.

View File

@ -1,50 +0,0 @@
I am not stuck. I know i'm not. Just because there is little visible progress doesn't mean i'm stuck. It may just feel like it though.
But like little cogwheels in the clock, i can hear the background process ticking away and sometimes there is a gong.
What i wasn't stuck with, is where to draw the layer for the vm.
### Layers
Software engineers like layers. Like the onion boy. You can draw boxes, make presentation and convince your boss.
They help us to reason about the software.
In this case the model was to go from ast layer to a vm layer. Via a compile method, that could just as well have been a
visitor.
That didn't work, too big astep and so it was from ast, to vm, to neumann. But i couldn't decide
on the abstraction of the virtual machine layer. Specifically, when you have a send (and you have
soo many sends in ruby), do you:
- model it as a vm instruction (a bit like java)
- implement it in a couple instructions like resolve, a loop and call
- go to a version that is clearly translatable to neumann, say without the value type implementation
Obviously the third is where we need to get to, as the next step is the neumann layer and somewhow
we need to get there. In effect one could take those three and present them as layers, not
as alternatives like i have.
### Passes
And then the little cob went click, and the idea of passes resurfaced. LLvm has these passes on
the code tree, is probably where it surfaced from.
So we can have as high of a degree of abstraction as possible when going from ast to code.
And then have as many passes over that as we want / need.
Passes can be order dependent, and create more and more detail. To solve the above layer
conundrum, we just do a pass for each of those options.
The two main benefits that come from this are:
1 - At each point, ie after and during each pass we can analyse the data. Imagine for example
that we would have picked the second layer option, that means there would never have been a
representation where the sends would have been explicit. Thus any analysis of them would be impossible or need reverse engineering (eg call graph analysis, or class caching)
2 - Passes can be gems or come from other sources. The mechanism can be relatively oblivious to
specific passes. And they make the transformation explicit, ie easier to understand.
In the example of having picked the second layer level, one would have to patch the
implementation of that transformation to achieve a different result. With passes it would be
a matter of replacing a pass, thus explicitly stating "i want a non-standard send implementation"
Actually a third benefit is that it makes testing simpler. More modular. Just test the initial ast->code and then mostly the results of passes.

View File

@ -0,0 +1,77 @@
%p In a picture, or when taking a picture, the frame is very important. It sets whatever is in the picture into context.
%p
So it is a bit strange that having a
%strong frame
had the same sort of effect for me in programming.
I made the frame explicit, as an object, with functions and data, and immediately the whole
message sending became a whole lot clearer.
%p
You read about frames in calling conventions, or otherwise when talking about the machine stack.
It is the area a function uses for storing data, be it arguments, locals or temporary data.
Often a frame pointer will be used to establish a frames dynamic size and things like that.
But since its all so implicit and handled by code very few programmers ever see it was
all a bit muddled for me.
%p My frame has: return and exceptional return address, self, arguments, locals, temps
%p and methods to: create a frame, get a value to or from a slot or args/locals/tmps , return or raise
%h3#the-divide-compile-and-runtime The divide, compile and runtime
%p
I saw
%a{:href => "http://codon.com/compilers-for-free"} Toms video on free compilers
and read the underlying
book on
%a{:href => "http://www.itu.dk/people/sestoft/pebook/jonesgomardsestoft-a4.pdf"} Partial Evaluation
a bit, and it helped to make the distinctions clearer. As did the Layers and Passes post.
And the explicit Frame.
%p
The explicit frame established the vm explicitly too, or much better. All actions of the vm happen
in terms of the frame. Sending is creating a new one, loading it, finding the method and branching
there. Getting and setting variables is just indexing into the frame at the right index and so on.
Instance variables are a send to self, and on it goes.
%p
The great distinction is at the end quite simple, it is compile-time or run-time. And the passes
idea helps in that i start with most simple implementation against my vm. Then i have a data structure and can keep expanding it to “implement” more detail. Or i can analyse it to save
redundancies, ie optimize. But the point is in both cases i can just think about data structures
and what to do with them.
%p
And what i can do with my data (which is off course partially instruction sequences, but thats beside the point) really always depends on the great question: compile time vs run-time.
What is constant, can i do immediately. Otherwise leave for later. Simple.
%p
An example, attribute accessor: a simple send. I build a frame, set the self. Now a fully dynamic
implementation would leave it at that. But i can check if i know the type, if its not
reference (ie integer) we can raise immediately. Also the a reference tags the class for when
that is known at compile time. If so i can determine the layout at compile time and inline the
gets implementation. If not i could cache, but thats for later.
%p
As a further example on this, when one function has two calls on the same object, the layout
must only be retrieved once. ie in the sequences getType, determine method, call, the first
step can be omitted for the second call as a layout is constant.
%p
And as a final bonus of all this clarity, i immediately spotted the inconsistency in my own design: The frame i designed holds local variables, but the caller needs to create it. The caller can
not possibly know the number of local variables as that is decided by the invoked method,
which is only known at run-time. So we clearly need a two level thing here, one
that the caller creates, and one that the receiver creates.
%h3#messaging-and-slots Messaging and slots
%p It is interesting to relate what emerges to concepts learned over the years:
%p
There is this idea of message passing, as opposed to function calling. Everyone i know has learned
an imperative language as the first language and so message passing is a bit like vegetarian
food, all right for some. But off course there is a distinct difference in dynamic languages as
one does not know the actual method invoked beforehand. Also exceptions make the return trickier
and default values even the argument passing which then have to be augmented by the receiver.
%p
One main difficulty i had in with the message passing idea has always been what the message is.
But now i have the frame, i know exactly what it is: it is the frame, nothing more nothing less.
(Postscript: Later introduced the Message object which gets created by the caller, and the Frame
is what is created by the callee)
%p
Another interesting observation is the (hopefully) golden path this design goes between smalltalk
and self. In smalltalk (like ruby and…) all objects have a class. But some of the smalltalk researchers went on to do
= succeed "," do
%a{:href => "http://en.wikipedia.org/wiki/Self_(programming_language)"} Self
%p
Now in ruby, any object can have any variables anyway, but they incur a dynamic lookup. Types on
the other hand are like slots, and keeping each Type constant (while an object can change layouts)
makes it possible to have completely dynamic behaviour (smalltalk/ruby)
%strong and
use a slot-like (self) system with constant lookup speed. Admittedly the constancy only affects cache hits, but
as most systems are not dynamic most of the time, that is almost always.

View File

@ -1,73 +0,0 @@
In a picture, or when taking a picture, the frame is very important. It sets whatever is in the picture into context.
So it is a bit strange that having a **frame** had the same sort of effect for me in programming.
I made the frame explicit, as an object, with functions and data, and immediately the whole
message sending became a whole lot clearer.
You read about frames in calling conventions, or otherwise when talking about the machine stack.
It is the area a function uses for storing data, be it arguments, locals or temporary data.
Often a frame pointer will be used to establish a frames dynamic size and things like that.
But since it's all so implicit and handled by code very few programmers ever see it was
all a bit muddled for me.
My frame has: return and exceptional return address, self, arguments, locals, temps
and methods to: create a frame, get a value to or from a slot or args/locals/tmps , return or raise
### The divide, compile and runtime
I saw [Tom's video on free compilers](http://codon.com/compilers-for-free) and read the underlying
book on [Partial Evaluation](http://www.itu.dk/people/sestoft/pebook/jonesgomardsestoft-a4.pdf) a bit, and it helped to make the distinctions clearer. As did the Layers and Passes post.
And the explicit Frame.
The explicit frame established the vm explicitly too, or much better. All actions of the vm happen
in terms of the frame. Sending is creating a new one, loading it, finding the method and branching
there. Getting and setting variables is just indexing into the frame at the right index and so on.
Instance variables are a send to self, and on it goes.
The great distinction is at the end quite simple, it is compile-time or run-time. And the passes
idea helps in that i start with most simple implementation against my vm. Then i have a data structure and can keep expanding it to "implement" more detail. Or i can analyse it to save
redundancies, ie optimize. But the point is in both cases i can just think about data structures
and what to do with them.
And what i can do with my data (which is off course partially instruction sequences, but that's beside the point) really always depends on the great question: compile time vs run-time.
What is constant, can i do immediately. Otherwise leave for later. Simple.
An example, attribute accessor: a simple send. I build a frame, set the self. Now a fully dynamic
implementation would leave it at that. But i can check if i know the type, if it's not
reference (ie integer) we can raise immediately. Also the a reference tags the class for when
that is known at compile time. If so i can determine the layout at compile time and inline the
get's implementation. If not i could cache, but that's for later.
As a further example on this, when one function has two calls on the same object, the layout
must only be retrieved once. ie in the sequences getType, determine method, call, the first
step can be omitted for the second call as a layout is constant.
And as a final bonus of all this clarity, i immediately spotted the inconsistency in my own design: The frame i designed holds local variables, but the caller needs to create it. The caller can
not possibly know the number of local variables as that is decided by the invoked method,
which is only known at run-time. So we clearly need a two level thing here, one
that the caller creates, and one that the receiver creates.
### Messaging and slots
It is interesting to relate what emerges to concepts learned over the years:
There is this idea of message passing, as opposed to function calling. Everyone i know has learned
an imperative language as the first language and so message passing is a bit like vegetarian
food, all right for some. But off course there is a distinct difference in dynamic languages as
one does not know the actual method invoked beforehand. Also exceptions make the return trickier
and default values even the argument passing which then have to be augmented by the receiver.
One main difficulty i had in with the message passing idea has always been what the message is.
But now i have the frame, i know exactly what it is: it is the frame, nothing more nothing less.
(Postscript: Later introduced the Message object which gets created by the caller, and the Frame
is what is created by the callee)
Another interesting observation is the (hopefully) golden path this design goes between smalltalk
and self. In smalltalk (like ruby and...) all objects have a class. But some of the smalltalk researchers went on to do [Self](http://en.wikipedia.org/wiki/Self_(programming_language)), which
has no classes only objects. This was supposed to make things easier and faster. Slots were a bit like instance variables, but there were no classes to rule them.
Now in ruby, any object can have any variables anyway, but they incur a dynamic lookup. Types on
the other hand are like slots, and keeping each Type constant (while an object can change layouts)
makes it possible to have completely dynamic behaviour (smalltalk/ruby) **and** use a slot-like (self) system with constant lookup speed. Admittedly the constancy only affects cache hits, but
as most systems are not dynamic most of the time, that is almost always.

View File

@ -0,0 +1,49 @@
%p It has been a bit of a journey, but now we have arrived: Salama is officially named.
%h3#salama Salama
%p
Salama is a
= succeed "," do
%strong real word
%p
It is a word of my
%strong home-country
Finland, a finnish word (double plus)
%p
Salama means
%strong lightning
(or flash), and that is fast (double double plus) and bright.
%p
As some may have noticed in most places my nick is
= succeed "." do
%strong dancinglightning
%p
Also
%strong my wife
suggested it, so it always reminds me of her.
%h4#journey Journey
%p I started with crystal, which i liked. It speaks of clarity. It is related to ruby. All was good.
%p
But I was not the first to have this thought: The name is taken, as i found out by
chance. Ary Borenszweig started the
%a{:href => "http://crystal-lang.org/"} project
already two
years ago and they not only have a working system, but even compile themselves.
%p
Alas, Ary started out with the idea of ruby on rockets (ie fast), but when the
dynamic aspects came (as they have for me a month ago), he went for speed, to be
precise for a static system, not for ruby.
So his crystal is now its own language with ruby-ish style, but not semantics.
%p
That is why i had not found it. But when i did we talked, all was friendly, and we
agreed i would look for a new name.
%p
And so i did and many were taken. Kide (crystal in finish) was a step on the way,
as was ruby in ruby. And many candidates were explored and discarded, like broom
(basic ruby object oriented machine), or som (simple object machine), even ahimsa.
%h4#official Official
%p But then i found it, or rather we did, as it was a suggestion from my wife: Salama.
%p
After i found the name i made sure to claim it: I published first versions of gems
for salama and sub-modules. They dont work off course, but at least the name is
taken in rubygems too. Off course the github name is too.
%p So now i can get on with things at lightning speed :-)

View File

@ -1,44 +0,0 @@
It has been a bit of a journey, but now we have arrived: Salama is officially named.
### Salama
Salama is a **real word**, not made up or an acronym (plus).
It is a word of my **home-country** Finland, a finnish word (double plus)
Salama means **lightning** (or flash), and that is fast (double double plus) and bright.
As some may have noticed in most places my nick is **dancinglightning**. Nice :-)
Also **my wife** suggested it, so it always reminds me of her.
#### Journey
I started with crystal, which i liked. It speaks of clarity. It is related to ruby. All was good.
But I was not the first to have this thought: The name is taken, as i found out by
chance. Ary Borenszweig started the [project](http://crystal-lang.org/) already two
years ago and they not only have a working system, but even compile themselves.
Alas, Ary started out with the idea of ruby on rockets (ie fast), but when the
dynamic aspects came (as they have for me a month ago), he went for speed, to be
precise for a static system, not for ruby.
So his crystal is now it's own language with ruby-ish style, but not semantics.
That is why i had not found it. But when i did we talked, all was friendly, and we
agreed i would look for a new name.
And so i did and many were taken. Kide (crystal in finish) was a step on the way,
as was ruby in ruby. And many candidates were explored and discarded, like broom
(basic ruby object oriented machine), or som (simple object machine), even ahimsa.
#### Official
But then i found it, or rather we did, as it was a suggestion from my wife: Salama.
After i found the name i made sure to claim it: I published first versions of gems
for salama and sub-modules. They don't work off course, but at least the name is
taken in rubygems too. Off course the github name is too.
So now i can get on with things at lightning speed :-)

View File

@ -0,0 +1,81 @@
%p
While trying to figure out what i am coding i had to attack this storage format before i wanted to. The
immediate need is for code dumps, that are concise but readable. I started with yaml but that just takes
too many lines, so its too difficult to see what is going on.
%p
I just finished it, its a sort of condensed yaml i call sof (salama object file), but i want to take the
moment to reflect why i did this, what the bigger picture is, where sof may go.
%h3#program-lifecycle Program lifecycle
%p
Lets take a step back to mother smalltalk: there was the image. The image was/is the state of all the
objects in the system. Even threads, everything. Absolute object thinking taken to the ultimate.
A great idea off course, but doomed to ultimately fail because no man is an island (so no vm is either).
%h4#development Development
%p
Software development is a team sport, a social activity at its core. This is not always realised,
when the focus is too much on the outcome, but when you look at it, everything is done in teams.
%p
The other thing not really taken into account in the standard developemnt model is that it is a process in
time that really only gets jucy with a first customer released version. Then you get into branches for bugs
and features, versions with major and minor and before long your in a jungle of code.
%h4#code-centered Code centered
%p
But all that effort is concentrated on code. Ok nowadays schema evlolution is part of the game, so the
existance of data is acknowledged, but only as an external thing. Nowhere near that smalltalk model.
%p
But off course a truely object oriented program is not just code. Its data too. Maybe currently “just”
configuration and enums/constants and locales, but that is exactly my point.
%p
The lack of defined data/object storage is holding us back, making all our programs fruit-flies.
I mean it lives a short time and dies. A program has no way of “learning”, of accumulating data/knowledge
to use in a next invocation.
%h4#optimisation-example Optimisation example
%p
Lets take optimisation as an example. So a developer runs tests (rubyprof/valgrind or something)
with some output and makes program changes accordingly. But there are two obvious problems.
Firstly the data is collected in development not production. Secondly, and more importantly, a person is
needed.
%p
Of course a program could quite easily monitor itself, possibly over a long time, possibly only when
not at epak load. And surely some optimisations could be automated, a bit like the O1 .. On compiler
switches, more and more effort could be exerted on critical regions. Possibly all the way to
super-optimisation.
%p
But even if we did this, and a program would improve/jit itself, the fruits of this work are only usable
during that run of that program. Future invocations, just like future versions of that program do not
benefit. And thus start again, just like in Groundhog day.
%h3#storage Storage
%p
So to make that optimisation example work, we would need a storage: Theoretically we could make the program
change its own executable/object files, in ruby even its source. Theoretically, as we have no
representation of the code to work on.
%p
In salama we do have an internal representation, both at the code level (ast) and the compiled code
(CompiledMethod, Intructions and friends).
%h4#storage-format Storage Format
%p
Going back to the Image we can ask why was it doomed to fail: because of the binary,
proprietary implementation. Not because of the idea as such.
%p
Binary data needs either a rigourous specification and/or software to work on it. Work, what work?
We need to merge the data between installations, maintain versions and branches. That sounds a lot like
version control, because it basically is. Off course this “could” have been solved by the smalltalk
people, but wasnt. I think its fair to say that git was the first system to solve that problem.
%p
And git off course works with diff, and so for a 3-way merge to be successful we need a text format.
Which is why i started with yaml, and which is why also sof is text-based.
%p The other benefit is off course human readability.
%p
So now we have an object file * format in text, and we have git. What we do with it is up to us.
(* well, i only finished the writer. reading/parsing is “left as an excercise for the reader”:-)
%h4#sof-as-object-file-format Sof as object file format
%p
Ok, ill sketch it a little: Salama would use sof as its object file format, and only sof would ever be
stored in git. For developers to work, tools would create source and when that is edited compile it to sof.
%p
A program would be a repository of sof and resource files. Some convention for load order would be helpful
and some “area” where programs may collect data or changes to the program. Some may off course alter the
sofs directly.
%p
How, when and how automatically changes are merged (via git) is up to developer policy . But it is
easily imaginable that data in program designated areas get merged back into the “mainstream” automatically.

View File

@ -1,88 +0,0 @@
While trying to figure out what i am coding i had to attack this storage format before i wanted to. The
immediate need is for code dumps, that are concise but readable. I started with yaml but that just takes
too many lines, so it's too difficult to see what is going on.
I just finished it, it's a sort of condensed yaml i call sof (salama object file), but i want to take the
moment to reflect why i did this, what the bigger picture is, where sof may go.
### Program lifecycle
Let's take a step back to mother smalltalk: there was the image. The image was/is the state of all the
objects in the system. Even threads, everything. Absolute object thinking taken to the ultimate.
A great idea off course, but doomed to ultimately fail because no man is an island (so no vm is either).
#### Development
Software development is a team sport, a social activity at it's core. This is not always realised,
when the focus is too much on the outcome, but when you look at it, everything is done in teams.
The other thing not really taken into account in the standard developemnt model is that it is a process in
time that really only gets jucy with a first customer released version. Then you get into branches for bugs
and features, versions with major and minor and before long you'r in a jungle of code.
#### Code centered
But all that effort is concentrated on code. Ok nowadays schema evlolution is part of the game, so the
existance of data is acknowledged, but only as an external thing. Nowhere near that smalltalk model.
But off course a truely object oriented program is not just code. It's data too. Maybe currently "just"
configuration and enums/constants and locales, but that is exactly my point.
The lack of defined data/object storage is holding us back, making all our programs fruit-flies.
I mean it lives a short time and dies. A program has no way of "learning", of accumulating data/knowledge
to use in a next invocation.
#### Optimisation example
Let's take optimisation as an example. So a developer runs tests (rubyprof/valgrind or something)
with some output and makes program changes accordingly. But there are two obvious problems.
Firstly the data is collected in development not production. Secondly, and more importantly, a person is
needed.
Of course a program could quite easily monitor itself, possibly over a long time, possibly only when
not at epak load. And surely some optimisations could be automated, a bit like the O1 .. On compiler
switches, more and more effort could be exerted on critical regions. Possibly all the way to
super-optimisation.
But even if we did this, and a program would improve/jit itself, the fruits of this work are only usable
during that run of that program. Future invocations, just like future versions of that program do not
benefit. And thus start again, just like in Groundhog day.
### Storage
So to make that optimisation example work, we would need a storage: Theoretically we could make the program
change it's own executable/object files, in ruby even it's source. Theoretically, as we have no
representation of the code to work on.
In salama we do have an internal representation, both at the code level (ast) and the compiled code
(CompiledMethod, Intructions and friends).
#### Storage Format
Going back to the Image we can ask why was it doomed to fail: because of the binary,
proprietary implementation. Not because of the idea as such.
Binary data needs either a rigourous specification and/or software to work on it. Work, what work?
We need to merge the data between installations, maintain versions and branches. That sounds a lot like
version control, because it basically is. Off course this "could" have been solved by the smalltalk
people, but wasn't. I think it's fair to say that git was the first system to solve that problem.
And git off course works with diff, and so for a 3-way merge to be successful we need a text format.
Which is why i started with yaml, and which is why also sof is text-based.
The other benefit is off course human readability.
So now we have an object file * format in text, and we have git. What we do with it is up to us.
(* well, i only finished the writer. reading/parsing is "left as an excercise for the reader":-)
#### Sof as object file format
Ok, i'll sketch it a little: Salama would use sof as it's object file format, and only sof would ever be
stored in git. For developers to work, tools would create source and when that is edited compile it to sof.
A program would be a repository of sof and resource files. Some convention for load order would be helpful
and some "area" where programs may collect data or changes to the program. Some may off course alter the
sof's directly.
How, when and how automatically changes are merged (via git) is up to developer policy . But it is
easily imaginable that data in program designated areas get merged back into the "mainstream" automatically.

View File

@ -0,0 +1,71 @@
%p The time of introspection is coming to an end and i am finally producing executables again. (hurrah)
%h3#block-and-exception Block and exception
%p
Even neither ruby blocks or exceptions are implemented i have figured out how to do it, which is sort of good news.
Ill see off course when the day comes, but a plan is made and it is this:
%p No information lives on the machine stack.
%p
Maybe its easier to understand this way: All objects live in memory primarily. Whatever gets moved onto the machine
stack is just a copy and, for purposes of the gc, does not need to be considered.
%h3#objects-4-registers 4 Objects, 4 registers
%p As far as i have determined the vm needs internal access to exactly four objects. These are:
%ul
%li Message: the currently received one, ie the one that in a method led to the method being called
%li Self: this is an instance variable of the message
%li Frame: local and temporary variables of the method. Also part of the message.
%li NewMessage: where the next call is prepared
%p And, as stated above, all these objects live in memory.
%h3#single-set-instruction Single Set Instruction
%p
Self and frame are duplicated information, because then it is easier to transfer. After inital trying, i settle on a
single Instruction to move data around in the vm, Set. It can move instance variables from any of the objects to any
other of the 4 objects.
%p
The implementation of Set ensures that any move to the self slot in Message gets duplicated into the Self register. Same
for the frame, but both are once per method occurances, and both are read only afterwards, so dont need updating later.
%p Set, like other instructions may use any other variables at any time. Those registers (r4 and up) are scratch.
%h3#simple-call Simple call
%p
This makes calling relatively simple and thus easy to understand. To make a call we must be in a method, ie Message,
Self and Frame have been set up.
%p
The method then produces values for the call. This involves operations and the result of that is stored in a variable
(tmp/local/arg). When all values have been calculated a NewMessage is created and all data moved there (see Set)
%p
A Call is then quite simple: because of the duplication of Self and Frame, we only need to push the Message to the
machine stack. Then we move the NewMessage to Message, unroll (copy) the Self into its register and assign a new
Frame.
%p
Returning is also not overly complicated: Remembering that the return value is an instance variable in the
Message object. So when the method is done, the value is there, not for example in a dedicated register.
So we need to undo the above: move the current Message to NewMessage, pop the previously pushed message from the
machine stack and unroll the Self and Frame copies.
%p
The caller then continues and can pick up the return from its NewMessage if it is used for further calculation.
Its like it did everything to built the (New)Message and immediately the return value was filled in.
%p
As I said, often we need to calculate the values for the call, so we need to make calls. This happens in exacly the same
way, and the result is shuffled to a Frame slot (local or temporary variable).
%h3#message-creation Message creation
%p
Well, i hear, that sounds good and almost too easy. But …. (always one isnt there) what about the Message and Frame
objects, where do you get those from ?
%p
And this is true: in c the Message does not exist, its just data in registers and the Frame is created on the stack if
needed.
%p And unfortunately we cant really make a call to get/create these objects as that would create an endless loop. Hmm
%p We need a very fast way to create and reuse these objects: a bit like a stack. So lets just use a Stack :-)
%p
Off course not the machine stack, but a Stack object. An array to which we append and take from.
It must be global off course, or rather accessible from compiling code. And fast may be that we use assembler, or
if things work out well, we can use the same code as what makes builtin arrays tick.
%p
Still, this is a different problem and the full solution will need a bit time. But clearly it is solvable and does
not impact above register usage convention.
%h3#the-fineprint The fineprint
%p
Just for the sake of completeness: The assumtion i made a the beginning of the Simple Call section, can off course not
possibly be always true.
%p
To boot the vm, we must create the first message by “magic” and place it and the Self (Kernel module reference).
As it can be an empty Message for now, this is not difficult, just one of those little gotachs.

View File

@ -1,83 +0,0 @@
The time of introspection is coming to an end and i am finally producing executables again. (hurrah)
### Block and exception
Even neither ruby blocks or exceptions are implemented i have figured out how to do it, which is sort of good news.
I'll see off course when the day comes, but a plan is made and it is this:
No information lives on the machine stack.
Maybe it's easier to understand this way: All objects live in memory primarily. Whatever get's moved onto the machine
stack is just a copy and, for purposes of the gc, does not need to be considered.
### 4 Objects, 4 registers
As far as i have determined the vm needs internal access to exactly four objects. These are:
- Message: the currently received one, ie the one that in a method led to the method being called
- Self: this is an instance variable of the message
- Frame: local and temporary variables of the method. Also part of the message.
- NewMessage: where the next call is prepared
And, as stated above, all these objects live in memory.
### Single Set Instruction
Self and frame are duplicated information, because then it is easier to transfer. After inital trying, i settle on a
single Instruction to move data around in the vm, Set. It can move instance variables from any of the objects to any
other of the 4 objects.
The implementation of Set ensures that any move to the self slot in Message gets duplicated into the Self register. Same
for the frame, but both are once per method occurances, and both are read only afterwards, so don't need updating later.
Set, like other instructions may use any other variables at any time. Those registers (r4 and up) are scratch.
### Simple call
This makes calling relatively simple and thus easy to understand. To make a call we must be in a method, ie Message,
Self and Frame have been set up.
The method then produces values for the call. This involves operations and the result of that is stored in a variable
(tmp/local/arg). When all values have been calculated a NewMessage is created and all data moved there (see Set)
A Call is then quite simple: because of the duplication of Self and Frame, we only need to push the Message to the
machine stack. Then we move the NewMessage to Message, unroll (copy) the Self into it's register and assign a new
Frame.
Returning is also not overly complicated: Remembering that the return value is an instance variable in the
Message object. So when the method is done, the value is there, not for example in a dedicated register.
So we need to undo the above: move the current Message to NewMessage, pop the previously pushed message from the
machine stack and unroll the Self and Frame copies.
The caller then continues and can pick up the return from it's NewMessage if it is used for further calculation.
It's like it did everything to built the (New)Message and immediately the return value was filled in.
As I said, often we need to calculate the values for the call, so we need to make calls. This happens in exacly the same
way, and the result is shuffled to a Frame slot (local or temporary variable).
### Message creation
Well, i hear, that sounds good and almost too easy. But .... (always one isn't there) what about the Message and Frame
objects, where do you get those from ?
And this is true: in c the Message does not exist, it's just data in registers and the Frame is created on the stack if
needed.
And unfortunately we can't really make a call to get/create these objects as that would create an endless loop. Hmm
We need a very fast way to create and reuse these objects: a bit like a stack. So let's just use a Stack :-)
Off course not the machine stack, but a Stack object. An array to which we append and take from.
It must be global off course, or rather accessible from compiling code. And fast may be that we use assembler, or
if things work out well, we can use the same code as what makes builtin arrays tick.
Still, this is a different problem and the full solution will need a bit time. But clearly it is solvable and does
not impact above register usage convention.
### The fineprint
Just for the sake of completeness: The assumtion i made a the beginning of the Simple Call section, can off course not
possibly be always true.
To boot the vm, we must create the first message by "magic" and place it and the Self (Kernel module reference).
As it can be an empty Message for now, this is not difficult, just one of those little gotachs.

View File

@ -0,0 +1,100 @@
%p The register machine abstraction has been somewhat thin, and it is time to change that
%h3#current-affairs Current affairs
%p
When i started, i started from the assembler side, getting arm binaries working and off course learning the arm cpu
instruction set in assembler memnonics.
%p
Not having
%strong any
experience at this level i felt that arm was pretty sensible. Much better than i expected. And
so i abtracted the basic instruction classes a little and had the arm instructions implement them pretty much one
to one.
%p
Then i tried to implement any ruby logic in that abstraction and failed. Thus was born the virtual machine
abstraction of having Message, Frame and Self objects. This in turn mapped nicely to registers with indexed
addressing.
%h3#addressing Addressing
%p
I just have to sidestep here a little about addressing: the basic problem is off course that we have no idea at
compile-time at what address the executable will end up.
%p
The problem first emerged with calling functions. Mostly because that was the only objects i had, and so i was
very happy to find out about pc relative addressing, in which you jump or call relative to your current position
(
%strong> p
rogram
= succeed "ounter)." do
%strong c
%p
Then came the first strings and the aproach can be extended: instead of grabbing some memory location, ie loading
and address and dereferencing, we calculate the address in relation to pc and then dereference. This is great and
works fine.
%p
But the smug smile is wiped off the face when one tries to store references. This came with the whole object
aproach, the bootspace holding references to
%strong all
objects in the system. I even devised a plan to always store
relative addresses. Not relative to pc, but relative to the self that is storing. This im sure would have
worked fine too, but it does mean that the running program also has to store those relative addresses (or have
different address types, shudder). That was a runtime burden i was not willing to accept.
%p
So there are two choices as far as i see: use elf relocation, or relocate in init code. And yet again i find myself
biased to the home-growm aproach. Off course i see that this is partly because i dont want to learn the innards of
elf as something very complicated that does a simple thing. But also because it is so simple i am hoping it isnt
such a big deal. Most of the code for it, object iteration, type testing, layout decoding, will be useful and
neccessary later anyway.
%h3#concise-instruction-set Concise instruction set
%p
So that addressing aside was meant to further the point of a need for a good register instruction set (to write the
relocation in). And the code that i have been writing to implement the vm instructions clearly shows a need for
a better model at the register model.
%p
On the other hand, the idea of Passes will make it very easy to have a completely sepeate register machine layer.
We just transfor the vm to that, and then later from that to arm (or later intel). So there are three things that i
am looking for with the new register machine instruction set:
%ul
%li easy to understand the model (ie register machine, pc, ..), free of real machine quirks
%li small set of instructions that is needed for our vm
%li better names for instructions
%p
Especially the last one: all the mvn and ldr is getting to me. Its so 50s, as if we didnt have the space to spell
out move or load. And even those are not good names, at least i am always wondering what is a move and what a load.
And as i explained above in the addressing, if i wanted to load an address of an object into a register with relative
addressing, i would actually have to do an add. But when reading an add instruction it is not an intuative
conclusion that a load is meant. And since this is a fresh effort i would rather change these things now and make
it easier for others to learn sensible stuff than me get used to cryptics only to have everyone after me do the same.
%p
So i will have instructions like RegisterMove, ConstantLoad, Branch, which will translate to mov, ldr and b in arm. I still like to keep the arm level with the traditional names, so people who actually know arm feel right at home.
But the extra register layer will make it easier for everyone who has not programmed assembler (and me!),
which i am guessing is quite a lot in the
%em ruby
community.
%p
In implementation terms it is a relatively small step from the vm layer to the register layer. And an even smaller
one to the arm layer. But small steps are good, easy to take, easy to understand, no stumbling.
%h3#extra-benefits Extra Benefits
%p
As i am doing this for my own sanity, any additional benefits are really extra, for free as it were. And those extra
benefits clearly exist.
%h5#clean-interface-for-cpu-specific-implementation Clean interface for cpu specific implementation
%p
That really says it all. That interface was a bit messy, as the RegisterMachine was used in Vm code, but was actually
an Arm implementation. So no seperation. Also as mentioned the instruction set was arm heavy, with the quirks
even arm has.
%p
So in the future any specific cpu implementation can be quite self sufficient. The classes it uses dont need to
derive from anything specific and need only implement the very small code interface (position/length/assemble).
And to hook in, all that is needed is to provide a translation from RegisterMachine instructions, which can be
done very nicely by providing a Pass for every instruction. So that layer of code is quite seperate from the actual
assembler, so it should be easy to reuse existing code (like wilson or metasm).
%h5#reusable-optimisations Reusable optimisations
%p
Clearly the better seperation allows for better optimisations. Concretely Passes can be written to optimize the
RegiterMachines workings. For example register use, constant extraction from loops, or folding of double
moves (when a value is moved from reg1 to reg2, and then from reg2 to reg3, and reg2 never being used).
%p
Such optimisations are very general and should then be reusable for specific cpu implementations. They are still
usefull at RegiterMachine level mind, as the code is “cleaner” there and it is easier to detect fluff. But the same
code may be run after a cpu translation, removing any “fluff” the translation introduced. Thus the translation
process may be kept simpler too, as that doesnt need to check for possible optimisations at the same time
as translating. Everyone wins :-)

View File

@ -1,96 +0,0 @@
The register machine abstraction has been somewhat thin, and it is time to change that
### Current affairs
When i started, i started from the assembler side, getting arm binaries working and off course learning the arm cpu
instruction set in assembler memnonics.
Not having **any** experience at this level i felt that arm was pretty sensible. Much better than i expected. And
so i abtracted the basic instruction classes a little and had the arm instructions implement them pretty much one
to one.
Then i tried to implement any ruby logic in that abstraction and failed. Thus was born the virtual machine
abstraction of having Message, Frame and Self objects. This in turn mapped nicely to registers with indexed
addressing.
### Addressing
I just have to sidestep here a little about addressing: the basic problem is off course that we have no idea at
compile-time at what address the executable will end up.
The problem first emerged with calling functions. Mostly because that was the only objects i had, and so i was
very happy to find out about pc relative addressing, in which you jump or call relative to your current position
(**p**rogram **c**ounter). Since the relation is not changed by relocation all is well.
Then came the first strings and the aproach can be extended: instead of grabbing some memory location, ie loading
and address and dereferencing, we calculate the address in relation to pc and then dereference. This is great and
works fine.
But the smug smile is wiped off the face when one tries to store references. This came with the whole object
aproach, the bootspace holding references to **all** objects in the system. I even devised a plan to always store
relative addresses. Not relative to pc, but relative to the self that is storing. This i'm sure would have
worked fine too, but it does mean that the running program also has to store those relative addresses (or have
different address types, shudder). That was a runtime burden i was not willing to accept.
So there are two choices as far as i see: use elf relocation, or relocate in init code. And yet again i find myself
biased to the home-growm aproach. Off course i see that this is partly because i don't want to learn the innards of
elf as something very complicated that does a simple thing. But also because it is so simple i am hoping it isn't
such a big deal. Most of the code for it, object iteration, type testing, layout decoding, will be useful and
neccessary later anyway.
### Concise instruction set
So that addressing aside was meant to further the point of a need for a good register instruction set (to write the
relocation in). And the code that i have been writing to implement the vm instructions clearly shows a need for
a better model at the register model.
On the other hand, the idea of Passes will make it very easy to have a completely sepeate register machine layer.
We just transfor the vm to that, and then later from that to arm (or later intel). So there are three things that i
am looking for with the new register machine instruction set:
- easy to understand the model (ie register machine, pc, ..), free of real machine quirks
- small set of instructions that is needed for our vm
- better names for instructions
Especially the last one: all the mvn and ldr is getting to me. It's so 50's, as if we didn't have the space to spell
out move or load. And even those are not good names, at least i am always wondering what is a move and what a load.
And as i explained above in the addressing, if i wanted to load an address of an object into a register with relative
addressing, i would actually have to do an add. But when reading an add instruction it is not an intuative
conclusion that a load is meant. And since this is a fresh effort i would rather change these things now and make
it easier for others to learn sensible stuff than me get used to cryptics only to have everyone after me do the same.
So i will have instructions like RegisterMove, ConstantLoad, Branch, which will translate to mov, ldr and b in arm. I still like to keep the arm level with the traditional names, so people who actually know arm feel right at home.
But the extra register layer will make it easier for everyone who has not programmed assembler (and me!),
which i am guessing is quite a lot in the *ruby* community.
In implementation terms it is a relatively small step from the vm layer to the register layer. And an even smaller
one to the arm layer. But small steps are good, easy to take, easy to understand, no stumbling.
### Extra Benefits
As i am doing this for my own sanity, any additional benefits are really extra, for free as it were. And those extra
benefits clearly exist.
##### Clean interface for cpu specific implementation
That really says it all. That interface was a bit messy, as the RegisterMachine was used in Vm code, but was actually
an Arm implementation. So no seperation. Also as mentioned the instruction set was arm heavy, with the quirks
even arm has.
So in the future any specific cpu implementation can be quite self sufficient. The classes it uses don't need to
derive from anything specific and need only implement the very small code interface (position/length/assemble).
And to hook in, all that is needed is to provide a translation from RegisterMachine instructions, which can be
done very nicely by providing a Pass for every instruction. So that layer of code is quite seperate from the actual
assembler, so it should be easy to reuse existing code (like wilson or metasm).
##### Reusable optimisations
Clearly the better seperation allows for better optimisations. Concretely Passes can be written to optimize the
RegiterMachine's workings. For example register use, constant extraction from loops, or folding of double
moves (when a value is moved from reg1 to reg2, and then from reg2 to reg3, and reg2 never being used).
Such optimisations are very general and should then be reusable for specific cpu implementations. They are still
usefull at RegiterMachine level mind, as the code is "cleaner" there and it is easier to detect fluff. But the same
code may be run after a cpu translation, removing any "fluff" the translation introduced. Thus the translation
process may be kept simpler too, as that doesn't need to check for possible optimisations at the same time
as translating. Everyone wins :-)

View File

@ -0,0 +1,28 @@
%p As before the original start of the project, i was 6 weeks on holiday. The distance and lack of computer really helps.
%h3#review Review
%p So i printed most of the code and the book and went over it. And apart from abismal spelling i found especially one mistake.
%p I had been going at the thing from the angle of producing binaries. Wrong aproach.
%h4#ruby-is-dynamic Ruby is Dynamic
%p In fact ruby is so dynamic it is hard to think of anything that you need to do at compile time that you cant do at runtime.
%p
In other words,
%em all
functionality is available at run-time. Ie it needs to be available in ruby, and since it then is available in ruby, one should reuse it. I had just sort of tried to avoid this, as it seemed so big.
%p In fact it is quite easy to express what needs to happed for eg. a method call, in ruby. The hard thing is to use that code at compile time.
%h4#inlining Inlining
%p When i say hard, i mean hard to code. Actually it is quite easy to understand. One “just” needs to inline the code, easy actually. Off course i had known that inlining would be neccessary in the end, i had just thought later would be fine. Well, it isnt. Off course, is it ever!
%p Inlining is making the functionality happen, without initializing a method call and return. Off course this is only possible for known function calls, but thats enough. The objects/classes we use during method dispatch are well known, so everything can be resolved at compile time. Hunky dory. Just how?
%p As a first step we change the self, while saving the old self to a tmp. Then we have to deal with how the called function accesses variables (arguments or locals). We know it does this through the Message and Frame objects. But since those are different for an inlined function, we have to make them explicit arguments. So instead of the normal eg. Message, we can create an InlineMessage for inlined function. When resolving a variable name, this InlinedMessage will look up in the parents variables and arrange access to that.
%h4#changes Changes
%p So some of the concrete changes that will come once ive done all cosmetic fixes:
%ul
%li much more parfait classes / functionality
%li remove all duplication in vm (that is now parfait)
%li change of compile, using explicit message/frames
%li explicit logic type (alongside integer + reference)
%p I also decided it would be cleaner to use the visitor pattern for compiling the ast to vm. In fact the directory should be named compile.
%p And i noticed that what i have called Builtin up to now is actually part of the Register machine layer (not vm), so it needs to move there.
%h3#some-publicity Some publicity
%p
I have now given lightning talk on Frozen Rails 2014 and Ruby Bath 2015.
As 5 Minutes is clearly now enough i will work on a longer presentation.

View File

@ -1,41 +0,0 @@
As before the original start of the project, i was 6 weeks on holiday. The distance and lack of computer really helps.
### Review
So i printed most of the code and the book and went over it. And apart from abismal spelling i found especially one mistake.
I had been going at the thing from the angle of producing binaries. Wrong aproach.
#### Ruby is Dynamic
In fact ruby is so dynamic it is hard to think of anything that you need to do at compile time that you can't do at runtime.
In other words, *all* functionality is available at run-time. Ie it needs to be available in ruby, and since it then is available in ruby, one should reuse it. I had just sort of tried to avoid this, as it seemed so big.
In fact it is quite easy to express what needs to happed for eg. a method call, in ruby. The hard thing is to use that code at compile time.
#### Inlining
When i say hard, i mean hard to code. Actually it is quite easy to understand. One "just" needs to inline the code, easy actually. Off course i had known that inlining would be neccessary in the end, i had just thought later would be fine. Well, it isn't. Off course, is it ever!
Inlining is making the functionality happen, without initializing a method call and return. Off course this is only possible for known function calls, but that's enough. The objects/classes we use during method dispatch are well known, so everything can be resolved at compile time. Hunky dory. Just how?
As a first step we change the self, while saving the old self to a tmp. Then we have to deal with how the called function accesses variables (arguments or locals). We know it does this through the Message and Frame objects. But since those are different for an inlined function, we have to make them explicit arguments. So instead of the normal eg. Message, we can create an InlineMessage for inlined function. When resolving a variable name, this InlinedMessage will look up in the parents variables and arrange access to that.
#### Changes
So some of the concrete changes that will come once i've done all cosmetic fixes:
- much more parfait classes / functionality
- remove all duplication in vm (that is now parfait)
- change of compile, using explicit message/frames
- explicit logic type (alongside integer + reference)
I also decided it would be cleaner to use the visitor pattern for compiling the ast to vm. In fact the directory should be named compile.
And i noticed that what i have called Builtin up to now is actually part of the Register machine layer (not vm), so it needs to move there.
### Some publicity
I have now given lightning talk on Frozen Rails 2014 and Ruby Bath 2015.
As 5 Minutes is clearly now enough i will work on a longer presentation.

View File

@ -0,0 +1,71 @@
%p
Since i got the ideas of Slots and the associated instruction Set, i have been wondering how that
fits in with the code generation.
%p
I moved the patched AST compiler methods to a Compiler, ok. But still what do all those compile
methods return.
%h2#expression Expression
%p
In ruby, everything is an expression. To recap “Expressions have a value, while statements do not”,
or statements represent actions while expressions represent values.
%p
So in ruby everything represents a value, also statements, or functions. There is no such thing
as the return void in C. Even loops and ifs result in a value, for a loop the last computed value
and for an if the value of the branch taken.
%p
Having had a vague grasp of this concept i tried to sort of haphazardly return the kind of value
that i though appropriate. Sometimes literals, sometimes slots. Sometimes “Return” , a slot
representing the return value of a function.
%h2#return-slot Return slot
%p Today i realized that the Slot representing the return value is special.
%p It does not hold the value that is returned, but rather the other way around.
%p A function returns what is in the Return slot, at the time of return.
%p
From there it is easy to see that it must be the Return that holds the last computed value.
A function can return at any time after all.
%p
The last computed value is the Expression that is currently evaluated. So the compile, which
initiates the evaluation, returns the Return slot. Always. Easy, simple, nice!
%h2#example Example
%p Constants: say the expression
%pre
%code
:preserve
true
%p would compile to a
%pre
%code
:preserve
ConstantLoad(ReturnSlot , TrueConstant)
%p While
%pre
%code
:preserve
2 + 4
%p would compile to
%pre
%code
:preserve
ConstantLoad(ReturnSlot , IntegerConstant(2))
Set(ReturnSlot , OtherSlot)
ConstantLoad(ReturnSlot , IntegerConstant(4))
Set(ReturnSlot , EvenOtherSlot)
MethodCall() # unspecified details here
%h2#optimisations Optimisations
%p
But but but i hear that is so totally inefficient. All the time we move data around, to and from
that one Return slot, just so that the return is simple. Yes but no.
%p
It is very easy to optimize the trivial extra away. Many times the expression moves a value to Return
just to move it away in the next Instruction. A sequence like in above example
%pre
%code
:preserve
ConstantLoad(ReturnSlot , IntegerConstant(2))
Set(ReturnSlot , OtherSlot)
%p can easily be optimized into
%pre
%code
:preserve
ConstantLoad(OtherSlot , IntegerConstant(2))
%p tbc

View File

@ -1,72 +0,0 @@
Since i got the ideas of Slots and the associated instruction Set, i have been wondering how that
fits in with the code generation.
I moved the patched AST compiler methods to a Compiler, ok. But still what do all those compile
methods return.
## Expression
In ruby, everything is an expression. To recap "Expressions have a value, while statements do not",
or statements represent actions while expressions represent values.
So in ruby everything represents a value, also statements, or functions. There is no such thing
as the return void in C. Even loops and ifs result in a value, for a loop the last computed value
and for an if the value of the branch taken.
Having had a vague grasp of this concept i tried to sort of haphazardly return the kind of value
that i though appropriate. Sometimes literals, sometimes slots. Sometimes "Return" , a slot
representing the return value of a function.
## Return slot
Today i realized that the Slot representing the return value is special.
It does not hold the value that is returned, but rather the other way around.
A function returns what is in the Return slot, at the time of return.
From there it is easy to see that it must be the Return that holds the last computed value.
A function can return at any time after all.
The last computed value is the Expression that is currently evaluated. So the compile, which
initiates the evaluation, returns the Return slot. Always. Easy, simple, nice!
## Example
Constants: say the expression
true
would compile to a
ConstantLoad(ReturnSlot , TrueConstant)
While
2 + 4
would compile to
ConstantLoad(ReturnSlot , IntegerConstant(2))
Set(ReturnSlot , OtherSlot)
ConstantLoad(ReturnSlot , IntegerConstant(4))
Set(ReturnSlot , EvenOtherSlot)
MethodCall() # unspecified details here
## Optimisations
But but but i hear that is so totally inefficient. All the time we move data around, to and from
that one Return slot, just so that the return is simple. Yes but no.
It is very easy to optimize the trivial extra away. Many times the expression moves a value to Return
just to move it away in the next Instruction. A sequence like in above example
ConstantLoad(ReturnSlot , IntegerConstant(2))
Set(ReturnSlot , OtherSlot)
can easily be optimized into
ConstantLoad(OtherSlot , IntegerConstant(2))
tbc

View File

@ -0,0 +1,57 @@
%p
Quite long ago i
%a{:href => "/2014/06/27/an-exceptional-thought.html"} had already determined
that return
addresses and exceptional return addresses should be explicitly stored in the message.
%p
It was also clear that Message would have to be a linked list. Just managing that list at run-time
in Register Instructions (ie almost assembly) proved hard. Not that i was creating Message objects
but i did shuffle their links about. I linked and unlinked messages by setting their next/prev fields
at runtime.
%h2#the-list-is-static The List is static
%p
Now i realized that touching the list structure in any way at runtime is not necessary.
The list is completely static, ie created at compile time and never changed.
%p
To be more precise: I created the Messages at compile time and set them up as a forward linked list.
Each Item had
%em caller
field (a backlink) which i then filled at run-time. I was keeping the next
message to be used as a variable in the Space, and because that is basically global it was
relatively easy to update when making a call.
But i noticed when debugging that when i updated the messages next field, it was already set to
the value i was setting it to. And that made me stumble and think. Off course!
%p
It is the data
%strong in
the Messages that changes. But not the Message, nor the call chain.
%p
As programmer one has the call graph in mind and as that is a graph, i was thinking that the
Message list changes. But no. When working on one message, it is always the same message one sends
next. Just as one always returns to the same one that called.
%p It is the addresses and Method arguments that change, not the message.
%p
The best analogy i can think of is when calling a friend. Whatever you say, it is alwas the same
number you call.
%p
Or in C terms, when using the stack (push/pop), it is not the stack memory that changes, only the
pointer to the top. A stack is an array, right, so the array stays the same,
even its size stays the same. Only the used part of it changes.
%h2#simplifies-call-model Simplifies call model
%p
Obviously this simplifies the way one thinks about calls. Just stick the data into the pre-existing
Message objects and go.
%p
When i first had the
%a{:href => "/2014/06/27/an-exceptional-thought.html"} return address as argument
idea,
i was thinking that in case of exception one would have to garbage collect Messages.
In the same way that i was thinking that they need to be dynamically managed.
%p
Wrong again. The message chain (double linked list to be precise) stays. One just needs to clear
the data out from them, so that garbage does get collected. Anyway, its all quite simple and thats
nice.
%p
As an upshot from this new simplicity we get
= succeed "." do
%strong speed

View File

@ -1,53 +0,0 @@
Quite long ago i [had already determined](/2014/06/27/an-exceptional-thought.html) that return
addresses and exceptional return addresses should be explicitly stored in the message.
It was also clear that Message would have to be a linked list. Just managing that list at run-time
in Register Instructions (ie almost assembly) proved hard. Not that i was creating Message objects
but i did shuffle their links about. I linked and unlinked messages by setting their next/prev fields
at runtime.
## The List is static
Now i realized that touching the list structure in any way at runtime is not necessary.
The list is completely static, ie created at compile time and never changed.
To be more precise: I created the Messages at compile time and set them up as a forward linked list.
Each Item had *caller* field (a backlink) which i then filled at run-time. I was keeping the next
message to be used as a variable in the Space, and because that is basically global it was
relatively easy to update when making a call.
But i noticed when debugging that when i updated the message's next field, it was already set to
the value i was setting it to. And that made me stumble and think. Off course!
It is the data **in** the Messages that changes. But not the Message, nor the call chain.
As programmer one has the call graph in mind and as that is a graph, i was thinking that the
Message list changes. But no. When working on one message, it is always the same message one sends
next. Just as one always returns to the same one that called.
It is the addresses and Method arguments that change, not the message.
The best analogy i can think of is when calling a friend. Whatever you say, it is alwas the same
number you call.
Or in C terms, when using the stack (push/pop), it is not the stack memory that changes, only the
pointer to the top. A stack is an array, right, so the array stays the same,
even it's size stays the same. Only the used part of it changes.
## Simplifies call model
Obviously this simplifies the way one thinks about calls. Just stick the data into the pre-existing
Message objects and go.
When i first had the [return address as argument](/2014/06/27/an-exceptional-thought.html) idea,
i was thinking that in case of exception one would have to garbage collect Messages.
In the same way that i was thinking that they need to be dynamically managed.
Wrong again. The message chain (double linked list to be precise) stays. One just needs to clear
the data out from them, so that garbage does get collected. Anyway, it's all quite simple and that's
nice.
As an upshot from this new simplicity we get **speed**. As the method enter and exit codes are
3-4 (arm) instructions, we are on par with c. Oh and i forgot to mention Frames. Don't need to
generate those at run-time either. Every message gets a static Frame. Done. Up to the method
what to do with it. Ie don't use it or use it as array, or create an array to store more than
fits into the static frame.

View File

@ -0,0 +1,60 @@
%hr/
%p
After almost a year of rewrite:
%strong Hello World
is back.
%p
%strong Working executables again
%p
So much has changed in the last year it is almost impossible to recap.
Still a little summary:
%h3#register-machine Register Machine
%p
The whole layer of the
%a{:href => "/2014/09/30/a-better-register-machine.html"} Register Machine
as an
abstraction was not there. Impossible is was to see what was happening.
%h3#passes Passes
%p
In the beginning i was trying to
= succeed "." do
%em just do it
%a{:href => "/2014/07/05/layers-vs-passes.html"} implemented Passes
to go between them.
%h3#the-virtual-machine-design The virtual machine design
%p
Thinking about what objects makes up a virtual machine has brought me to a clear understanding
of the
= succeed "." do
%a{:href => "/2014/09/12/register-allocation-reviewed.html"} objects needed
%a{:href => "/2014/06/27/an-exceptional-thought.html"} stopped using the machine stack
altogether and am using a linked list instead.
Recently is has occurred to me that that linked list
%a{:href => "/06/20/the-static-call-chain.html"}> doesnt even change
, so it is very simple indeed.
%h3#smaller-though-not-small-changes Smaller, though not small, changes
%ul
%li
The
%a{:href => "/2014/08/19/object-storage.html"} Salma Object File
format was created.
%li
The
%a{:href => "http://dancinglightning.gitbooks.io/the-object-machine/content/"} Book
was started
%li I gave lightning talks at Frozen Rails 2014, Helsinki and Bath Ruby 2015
%li I presented at Munich and Zurich user groups, lots to take home from all that
%h3#future Future
%p
The mountain is still oh so high, but at last there is hope again. The second dip into arm
(gdb) debugging has made it very clear that a debugger is needed. Preferably visual, possibly 3d,
definitely browser based. So either Opal or even Volt.
%p Already more clarity in upcoming fields has arrived:
%ul
%li inlining is high on the list, to code in higher language
%li
the difference between
%a{:href => "/2015/05/20/expression-is-slot.html"} statement and expression
helped
to structure code.
%li hopefully the debugger / interpreter will help to write better tests too.

View File

@ -1,49 +0,0 @@
---
After almost a year of rewrite: **Hello World** is back.
**Working executables again**
So much has changed in the last year it is almost impossible to recap.
Still a little summary:
### Register Machine
The whole layer of the [Register Machine](/2014/09/30/a-better-register-machine.html) as an
abstraction was not there. Impossible is was to see what was happening.
### Passes
In the beginning i was trying to *just do it*. Just compile the vm down to arm instructions.
But the human brain (or possibly just mine) is not made to think in terms of process.
I think much better in terms of Structure. So i made vm and register instructions and
[implemented Passes](/2014/07/05/layers-vs-passes.html) to go between them.
### The virtual machine design
Thinking about what objects makes up a virtual machine has brought me to a clear understanding
of the [objects needed](/2014/09/12/register-allocation-reviewed.html).
In fact things got even simpler as stated in that post, as i have
[stopped using the machine stack](/2014/06/27/an-exceptional-thought.html)
altogether and am using a linked list instead.
Recently is has occurred to me that that linked list
[doesn't even change](/06/20/the-static-call-chain.html), so it is very simple indeed.
### Smaller, though not small, changes
- The [Salma Object File](/2014/08/19/object-storage.html) format was created.
- The [Book](http://dancinglightning.gitbooks.io/the-object-machine/content/) was started
- I gave lightning talks at Frozen Rails 2014, Helsinki and Bath Ruby 2015
- I presented at Munich and Zurich user groups, lots to take home from all that
### Future
The mountain is still oh so high, but at last there is hope again. The second dip into arm
(gdb) debugging has made it very clear that a debugger is needed. Preferably visual, possibly 3d,
definitely browser based. So either Opal or even Volt.
Already more clarity in upcoming fields has arrived:
- inlining is high on the list, to code in higher language
- the difference between [statement and expression](/2015/05/20/expression-is-slot.html) helped
to structure code.
- hopefully the debugger / interpreter will help to write better tests too.

View File

@ -0,0 +1,72 @@
%p
It really is like
%a{:href => "http://worrydream.com/#!/InventingOnPrinciple"} Bret Victor
says in his video:
good programmers are the ones who play computer in their head well.
%p Why? Because you have to, to program. And off course thats what im doing.
%p
But when it got to debugging, it got a bit much. Using gdb for non C code, i mean its bad enough
for c code.
%h2#the-debugger The debugger
%p
The process of getting my “hello world” to work was quite hairy, what with debugging with gdb
and checking registers and stuff. Brr.
%p
The idea for a “solution”, my own debugger, possibly graphical, came quite quickly. But the effort seemed a
little big. It took a little, but then i started.
%p
I fiddled a little with fancy 2 or even 3d representations but couldnt get things to work.
Also getting used to running ruby in the browser, with opal, took a while.
%p
But now there is a
%a{:href => "https://github.com/ruby-x/salama-debugger"} basic frame
up,
and i can see registers swishing around and ideas of what needs
to be visualized and partly even how, are gushing. Off course its happening in html,
but that ok for now.
%p
And the best thing: I found my first serious
%strong bug
visually. Very satisfying.
%p
I do so hope someone will pick this up and run with it. Ill put it on the site as soon as the first
program runs through.
%h2#interpreter Interpreter
%p
Off course to have a debugger i needed to start on an interpreter.
Now it wasnt just the technical challenge, but some resistance against interpreting, since the whole
idea of salama was to compile. But in the end it is a very different level that the interpreter
works at. I chose to put it at the register level (not the arm), so it would be useful for future
cpus, and because the register to arm mapping is mainly about naming, not functionality. Ie it is
pretty much one to one.
%p
But off course (he says after the fact), the interpreter solves a large part of the testing
issue. Because i wasnt really happy with tests, and that was because i didnt have a good
idea how to test. Sure unit tests, fine. But to write all the little unit tests and hope the
total will result in what you want never struck me as a good plan.
%p
Instead i tend to write system tests, and drop down to unit tests to find the bugs in system tests.
But i had no good system tests, other than running the executable. But
= succeed "." do
%strong now i do
%p
So two flies with one (oh i dont know how this goes, im not english), better test, and visual
feedback, both driving the process at double speed.
%p
Now i “just” need a good way to visualize a static and running program. (implement breakpoints,
build a class and object inpector, recompile on edit . . .)
%h2#debugger-rewritten-thrice Debugger rewritten, thrice
%p
Update: after trying around with a
%a{:href => "https://github.com/orbitalimpact/opal-pixi"} 2d graphics
implementation i have rewritten the ui in
%a{:href => "https://github.com/catprintlabs/react.rb"} react
,
%a{:href => "https://github.com/voltrb/volt"} Volt
and
= succeed "." do
%a{:href => "https://github.com/opal/opal-browser"} OpalBrowser
%p
The last is what got the easiest to understand code. Also has the least dependencies, namely
only opal and opal-browser. Opal-browser is a small opal wrapper around the browsers
javascript functionality.

View File

@ -1,63 +0,0 @@
It really is like [Bret Victor](http://worrydream.com/#!/InventingOnPrinciple) says in his video:
good programmers are the ones who play computer in their head well.
Why? Because you have to, to program. And off course that's what i'm doing.
But when it got to debugging, it got a bit much. Using gdb for non C code, i mean it's bad enough
for c code.
## The debugger
The process of getting my "hello world" to work was quite hairy, what with debugging with gdb
and checking registers and stuff. Brr.
The idea for a "solution", my own debugger, possibly graphical, came quite quickly. But the effort seemed a
little big. It took a little, but then i started.
I fiddled a little with fancy 2 or even 3d representations but couldn't get things to work.
Also getting used to running ruby in the browser, with opal, took a while.
But now there is a [basic frame](https://github.com/ruby-x/salama-debugger) up,
and i can see registers swishing around and ideas of what needs
to be visualized and partly even how, are gushing. Off course it's happening in html,
but that ok for now.
And the best thing: I found my first serious **bug** visually. Very satisfying.
I do so hope someone will pick this up and run with it. I'll put it on the site as soon as the first
program runs through.
## Interpreter
Off course to have a debugger i needed to start on an interpreter.
Now it wasn't just the technical challenge, but some resistance against interpreting, since the whole
idea of salama was to compile. But in the end it is a very different level that the interpreter
works at. I chose to put it at the register level (not the arm), so it would be useful for future
cpu's, and because the register to arm mapping is mainly about naming, not functionality. Ie it is
pretty much one to one.
But off course (he says after the fact), the interpreter solves a large part of the testing
issue. Because i wasn't really happy with tests, and that was because i didn't have a good
idea how to test. Sure unit tests, fine. But to write all the little unit tests and hope the
total will result in what you want never struck me as a good plan.
Instead i tend to write system tests, and drop down to unit tests to find the bugs in system tests.
But i had no good system tests, other than running the executable. But **now i do**.
I can just run the Interpreter on a program and
see if it produced the right output. And by right output i really just mean stdout.
So two flies with one (oh i don't know how this goes, i'm not english), better test, and visual
feedback, both driving the process at double speed.
Now i "just" need a good way to visualize a static and running program. (implement breakpoints,
build a class and object inpector, recompile on edit . . .)
## Debugger rewritten, thrice
Update: after trying around with a [2d graphics](https://github.com/orbitalimpact/opal-pixi)
implementation i have rewritten the ui in [react](https://github.com/catprintlabs/react.rb) ,
[Volt](https://github.com/voltrb/volt) and [OpalBrowser](https://github.com/opal/opal-browser).
The last is what got the easiest to understand code. Also has the least dependencies, namely
only opal and opal-browser. Opal-browser is a small opal wrapper around the browsers
javascript functionality.

View File

@ -0,0 +1,143 @@
%p
It is the
%strong one
thing i said i wasnt going to do: Write a language.
There are too many languages out there already, and just because i want to write a vm,
doesnt mean i want to add to the language jungle.
%strong But
%h2#the-gap The gap
%p
As it happens in life, which is why they say never to say never, it happens just like it
i didnt want. It turns out the semantic gap of what i have is too large.
%p
There is the
%strong register level
, which is approximately assembler, and there is the
%strong vm level
which is more or less the ruby level. So my head hurts from trying to implement ruby in assembler,
no wonder.
%p
Having run into this wall, which btw is the same wall that crystal ran into, one can see the sense
in what others have done more clearly: Why rubinus uses c++ underneath. Why crystal does not
implement ruby, but a statically typed language. And ultimately why there is no ruby compiler.
The gap is just too large to bridge.
%h2#the-need-for-a-language The need for a language
%p
As I have the architecture of passes, i was hoping to get by with just another layer in the
architecture. A tried an tested approach after all. And while i wont say that that isnt a
possibility, i just dont see it. I think it may be one of those where hindsight will be perfect.
%p
I can see as far as this: If i implement a language, that will mean a parser, ast and compiler.
The target will be my register layer. So a reasonable step up is a sort of object c, that has
basic integer maths and object access. Ill detail that more below, but the point is, if i have
that, i can start writing a vm implementation in that language.
%p
Off course the vm implementation involves a parser, an ast and a compiler, unless we go to the free
compilers (see below). And so implementing the vm in a new language is in essence swapping nodes of
the higher level tree with nodes of the lower level (c-ish) one. Ie parsing should not strictly
speaking be necessary. This node swapping is after all what the pass architecture was designed
to do. But, as i said, i just cant see that happening (yet?).
%h3#trees-vs-blocks Trees vs. Blocks
%p
Speaking of the Pass architecture: I flopped. Well, maybe not so much with the actual Passes, but
with the Method representation. Blocks holding Instructions, and being in essence a list.
Misinformed copying from llvm, misinformed by the final outcome. Off course the final binary
has a linear address space, but that is where the linearity ends. The natural structure of code
is a tree, not a list, as demonstrated by the parse
= succeed "." do
%em tree
%h2#soml---salama-object-machine-language Soml - Salama Object Machine Language
%h3#typed Typed
%p
Quite a while before crystallizing into the idea of a new language, i already saw the need for a type
system. Off course, and this dates back to the first memory layouts. But i mean the need for a
%em strong typing
system, or maybe its even clearer to call it compile time typing. The type that c
and c++ have. It is essential (mentally, this is off course all for the programmer, not the computer)
to be able to think in a static type system, and then extend that and make it dynamic.
Or possibly use it in a dynamic way.
%p
This is a good example of this too big gap, where one just steps on quicksand if everything is
all the time dynamic.
%p
The way i had the implementation figured was to have different versions of the same function. In
each function we would have compile time types, everything known. Ill probably still do that,
just written in Soml.
%h3#machine-language Machine language
%p
Soml is a machine language for the Salama machine. As i tried to implement without this layer, i was
essentially implementing in assembler. Too much.
%p
There are two main feature we need from the machine language, one is typed a typed oo memory model,
the other an oo call model.
%h3#object-c Object c
%p
The language needs to be object based, off course. Just because its typed and not dynamic
and closer to assembler, doesnt mean we need to give up objects. In fact we mustnt. Soml
should be a little bit like c++, ie compile time known variable arrangement and types,
objects. But no classes (or inheritance), more like structs, with full access to everything.
So a struct.variable syntax would mean grab that variable at that address, no functions, no possible
override, just get it. This is actually already implemented as i needed it for the slot access.
%p So objects without encapsulation or classes. A lower level object orientation.
%h3#whitequark Whitequark
%p
This new approach (and more experience) shed a new light on ruby parsing. The previous idea was to
start small, write the necessary stuff in the parsable subset and with time expand that set.
%p
Alas . . ruby is a beast to parse, and because of the
%strong semantic gap
writing the system,
even in a subset, is not viable. And it turns out the brave warriors of the ruby community have
already produced a pure, production ready,
= succeed "." do
%a{:href => "https://github.com/whitequark/parser"} ruby parser
%h3#interoperability Interoperability
%p
The system code needs to be callable from the higher level, and possibly the other way around.
This probably means the same or compatible calling mechanism and data model. The data model is
quite simple as the at the system level all is just machine words, but in object sized
packets. As for the calling it will probably mean that the same message object needs to be used
and what is now called calling at the machine level is supported. Sending off course wont be.
%h3#still-missing-a-piece Still missing a piece
%p
How the level below calling can be represented is still open. It is clear though that it does need
to be present, as otherwise any kind of concurrency is impossible to achieve. The question ties
in with the still open question of
= succeed "." do
%a{:href => "http://valerieaurora.org/synthesis/SynthesisOS/ch4.html"} Quajects
%h3#start-small Start small
%p The first next step is to wrap the functionality i have in the Passes as a language.
%p Then to expand that language, by writing increasingly more complex programs in it.
%p
And then to re-attack ruby using the whitequark parser, that probably means jumping on the
mspec train.
%p All in all, no biggie :-)
%h2#compilers-are-not-free Compilers are not free
%p
Oh and i re-read and re-watched Toms
%a{:href => "http://codon.com/compilers-for-free"} compilers for free
talk,
which did make quite an impression on me the first time. But when i really thought about actually
going down that road (who doest enjoy a free beer), i got into the small print.
%p
The second biggest of which is that writing a partial evaluator is just about as complicated
as writing a compiler.
%p
But the biggest problem is that the (free) compiler you could get, has the implementation language
of the evaluator, as its
= succeed "." do
%strong output
%em for
c, not for ruby.
%p
Ok, maybe it is not quite as bad as that makes it sound. As i do have the register layer ready
and will be writing a c-ish language, it may even be possible to write an interpreter
= succeed "," do
%strong in soml
%strong for soml
too.
%p
I will nevertheless go the straighter route for now, ie write a compiler, and maybe return to the
promised freebie later. It does feel like a lot of what the partial evaluator is, would be called
compiler optimization in another lingo. So may be road will lead there naturally.

View File

@ -1,144 +0,0 @@
It is the **one** thing i said i wasn't going to do: Write a language.
There are too many languages out there already, and just because i want to write a vm,
doesn't mean i want to add to the language jungle.
**But** ...
## The gap
As it happens in life, which is why they say never to say never, it happens just like it
i didn't want. It turns out the semantic gap of what i have is too large.
There is the **register level** , which is approximately assembler, and there is the **vm level**
which is more or less the ruby level. So my head hurts from trying to implement ruby in assembler,
no wonder.
Having run into this wall, which btw is the same wall that crystal ran into, one can see the sense
in what others have done more clearly: Why rubinus uses c++ underneath. Why crystal does not
implement ruby, but a statically typed language. And ultimately why there is no ruby compiler.
The gap is just too large to bridge.
## The need for a language
As I have the architecture of passes, i was hoping to get by with just another layer in the
architecture. A tried an tested approach after all. And while i won't say that that isn't a
possibility, i just don't see it. I think it may be one of those where hindsight will be perfect.
I can see as far as this: If i implement a language, that will mean a parser, ast and compiler.
The target will be my register layer. So a reasonable step up is a sort of object c, that has
basic integer maths and object access. I'll detail that more below, but the point is, if i have
that, i can start writing a vm implementation in that language.
Off course the vm implementation involves a parser, an ast and a compiler, unless we go to the free
compilers (see below). And so implementing the vm in a new language is in essence swapping nodes of
the higher level tree with nodes of the lower level (c-ish) one. Ie parsing should not strictly
speaking be necessary. This node swapping is after all what the pass architecture was designed
to do. But, as i said, i just can't see that happening (yet?).
### Trees vs. Blocks
Speaking of the Pass architecture: I flopped. Well, maybe not so much with the actual Passes, but
with the Method representation. Blocks holding Instructions, and being in essence a list.
Misinformed copying from llvm, misinformed by the final outcome. Off course the final binary
has a linear address space, but that is where the linearity ends. The natural structure of code
is a tree, not a list, as demonstrated by the parse *tree*. Flattening it just creates navigational
problems. Also as a metal model it is easier, as it is easy to imagine swapping out subtrees,
expanding or collapsing nodes etc.
## Soml - Salama Object Machine Language
### Typed
Quite a while before crystallizing into the idea of a new language, i already saw the need for a type
system. Off course, and this dates back to the first memory layouts. But i mean the need for a
*strong typing* system, or maybe it's even clearer to call it compile time typing. The type that c
and c++ have. It is essential (mentally, this is off course all for the programmer, not the computer)
to be able to think in a static type system, and then extend that and make it dynamic.
Or possibly use it in a dynamic way.
This is a good example of this too big gap, where one just steps on quicksand if everything is
all the time dynamic.
The way i had the implementation figured was to have different versions of the same function. In
each function we would have compile time types, everything known. I'll probably still do that,
just written in Soml.
### Machine language
Soml is a machine language for the Salama machine. As i tried to implement without this layer, i was
essentially implementing in assembler. Too much.
There are two main feature we need from the machine language, one is typed a typed oo memory model,
the other an oo call model.
### Object c
The language needs to be object based, off course. Just because it's typed and not dynamic
and closer to assembler, doesn't mean we need to give up objects. In fact we mustn't. Soml
should be a little bit like c++, ie compile time known variable arrangement and types,
objects. But no classes (or inheritance), more like structs, with full access to everything.
So a struct.variable syntax would mean grab that variable at that address, no functions, no possible
override, just get it. This is actually already implemented as i needed it for the slot access.
So objects without encapsulation or classes. A lower level object orientation.
### Whitequark
This new approach (and more experience) shed a new light on ruby parsing. The previous idea was to
start small, write the necessary stuff in the parsable subset and with time expand that set.
Alas . . ruby is a beast to parse, and because of the **semantic gap** writing the system,
even in a subset, is not viable. And it turns out the brave warriors of the ruby community have
already produced a pure, production ready, [ruby parser](https://github.com/whitequark/parser).
That can obviously read itself and anything else, so the start small approach is doubly out.
### Interoperability
The system code needs to be callable from the higher level, and possibly the other way around.
This probably means the same or compatible calling mechanism and data model. The data model is
quite simple as the at the system level all is just machine words, but in object sized
packets. As for the calling it will probably mean that the same message object needs to be used
and what is now called calling at the machine level is supported. Sending off course won't be.
### Still missing a piece
How the level below calling can be represented is still open. It is clear though that it does need
to be present, as otherwise any kind of concurrency is impossible to achieve. The question ties
in with the still open question of [Quajects](http://valerieaurora.org/synthesis/SynthesisOS/ch4.html).
Meaning, what is the yin in the yin and yang of object oriented programming. The normal yang way sees
the code as active and the data as passive. By normal i mean oo implementations in which blocks and
closures just fall from the sky and have no internal structure. There is obviously a piece of
the puzzle missing that Alexia was onto.
### Start small
The first next step is to wrap the functionality i have in the Passes as a language.
Then to expand that language, by writing increasingly more complex programs in it.
And then to re-attack ruby using the whitequark parser, that probably means jumping on the
mspec train.
All in all, no biggie :-)
## Compilers are not free
Oh and i re-read and re-watched Toms [compilers for free](http://codon.com/compilers-for-free) talk,
which did make quite an impression on me the first time. But when i really thought about actually
going down that road (who does't enjoy a free beer), i got into the small print.
The second biggest of which is that writing a partial evaluator is just about as complicated
as writing a compiler.
But the biggest problem is that the (free) compiler you could get, has the implementation language
of the evaluator, as it's **output**. You need a compiler to start with, in other words.
Also the interpreter would have to be written in the same compilable language.
So writing a ruby compiler by writing a ruby interpreter would mean
writing the interpreter in c, and (worse) writing the partial evaluator *for* c, not for ruby.
Ok, maybe it is not quite as bad as that makes it sound. As i do have the register layer ready
and will be writing a c-ish language, it may even be possible to write an interpreter **in soml**,
and then it would be ok to write an evaluator **for soml** too.
I will nevertheless go the straighter route for now, ie write a compiler, and maybe return to the
promised freebie later. It does feel like a lot of what the partial evaluator is, would be called
compiler optimization in another lingo. So may be road will lead there naturally.

View File

@ -0,0 +1,72 @@
%p
Ok, that was surprising: I just wrote a language in two months. Parser, compiler, working binaries
and all.
%p
Then i
%a{:href => "/typed/typed.html"} documented it
, detailed the
%a{:href => "/typed/syntax.html"} syntax
and even did
some
= succeed "." do
%a{:href => "/typed/benchmarks.html"} benchmarking
%p
So, the good news: it
%strong it works
%p
Working means: calling works, if, while, assignment, class and method definition. The benchmarks
were hello world and fibonacci, both recursive and by looping.
%p
I even updated the
%a{:href => "/book.html"}
%strong whole book
to be up to date. Added a Soml section, updated
parfait, rewrote the register level . . .
%h3#it-all-clicked-into-place It all clicked into place
%p
To be fair, i dont think anyone writes a language that isnt a toy in 2 months, and it was only
possible because a lot of the stuff was there already.
%ul
%li
%a{:href => "/typed/parfait.html"} Parfait
was pretty much there. Just consolidated it as it is all just adapter.
%li
The
%a{:href => "/typed/debugger.html"} Register abstraction
(bottom) was there.
%li Using the ast library made things easier.
%li
A lot of the
%a{:href => "https://github.com/ruby-x/salama-reader"} parser
could be reused.
%p And off course the second time around everything is easier (aka hindsight is perfect).
%p
One of the better movie lines comes to mind,
(
%a{:href => "http://www.imdb.com/title/tt1341188/quotes"}> paraphrased
) “We are all just one small
adjustment away from making our code work”. It was a step sideways in the head which brought a leap
forward in terms of direction. Not where i was going but where i wanted to go.
%h3#open-issues Open issues
%p
Clearly i had wobbled on the parfait front. Now its clear it will have to be recoded in soml,
and then re-translated into ruby. But it was good to have it there in ruby all the time for the
concepts to solidify.
%p
Typing is not completely done, and negative tests for types are non existant. Also exceptions and
the machinery for the returns.
%p
I did a nice framework for testing the binaries on a remote machine, would be nice to have it
on travis. But my image is over 2Gb.
%h3#and-onto-the-next-compiler And onto the next compiler
%p
The ideas about how to compile ruby into soml have been percolating and are waiting to be put to
action.
%a{:href => "http://book.salama-vm.org/object/dynamic_types.html"} The theory
looks good,but one has
to see it to believe it.
%p
The first steps are quite clear though. Get the
%a{:href => "https://github.com/whitequark/parser"} ruby parser
integrated, get the compiler up, start with small tests. Work the types at the same time.
%p And let the adventure continue.

View File

@ -1,57 +0,0 @@
Ok, that was surprising: I just wrote a language in two months. Parser, compiler, working binaries
and all.
Then i [documented it](/typed/typed.html) , detailed the [syntax](/typed/syntax.html) and even did
some [benchmarking](/typed/benchmarks.html). Speed is luckily roughly where i wanted it. Mostly
(only mostly?) slower than c, but only by about 50, very understandable percent. It is doing
things in a more roundabout, and easier to understand way, and lacking any optimisation. It means
you can do about a million fibonacci(20) in a second on a pi, and beat ruby at it by a about
a factor of 20.
So, the good news: it **it works**
Working means: calling works, if, while, assignment, class and method definition. The benchmarks
were hello world and fibonacci, both recursive and by looping.
I even updated the [**whole book**](/book.html) to be up to date. Added a Soml section, updated
parfait, rewrote the register level . . .
### It all clicked into place
To be fair, i don't think anyone writes a language that isn't a toy in 2 months, and it was only
possible because a lot of the stuff was there already.
- [Parfait](/typed/parfait.html) was pretty much there. Just consolidated it as it is all just adapter.
- The [Register abstraction](/typed/debugger.html) (bottom) was there.
- Using the ast library made things easier.
- A lot of the [parser](https://github.com/ruby-x/salama-reader) could be reused.
And off course the second time around everything is easier (aka hindsight is perfect).
One of the better movie lines comes to mind,
([paraphrased](http://www.imdb.com/title/tt1341188/quotes)) "We are all just one small
adjustment away from making our code work". It was a step sideways in the head which brought a leap
forward in terms of direction. Not where i was going but where i wanted to go.
### Open issues
Clearly i had wobbled on the parfait front. Now it's clear it will have to be recoded in soml,
and then re-translated into ruby. But it was good to have it there in ruby all the time for the
concepts to solidify.
Typing is not completely done, and negative tests for types are non existant. Also exceptions and
the machinery for the returns.
I did a nice framework for testing the binaries on a remote machine, would be nice to have it
on travis. But my image is over 2Gb.
### And onto the next compiler
The ideas about how to compile ruby into soml have been percolating and are waiting to be put to
action. [The theory](http://book.salama-vm.org/object/dynamic_types.html) looks good,but one has
to see it to believe it.
The first steps are quite clear though. Get the [ruby parser](https://github.com/whitequark/parser)
integrated, get the compiler up, start with small tests. Work the types at the same time.
And let the adventure continue.

View File

@ -0,0 +1,63 @@
%p
Writing Soml helped a lot to separate the levels, or phases of the ruby compilation process. Helped
me that is, to plan the ruby compiler.
%p
But off course i had not written the ruby compiler, i have only
%a{:href => "https://dancinglightning.gitbooks.io/the-object-machine/content/object/dynamic_types.html"} planned
how the dynamic nature could be implemented, using soml. In very short summary, the plan was to
extend somls feature with esoteric multi-return features and use that to jump around different
implementation when types change.
%h2#the-benefit-of-communication The benefit of communication
%p
But first a thanks. When i was in the US, i talked to quite a few people about my plans. Everything
helped, but special thanks goes to Caleb for pointing out two issues.
%p
The simpler one is that what i had named Layout, is usually called Type. I have changed the code
and docs now and must admit it is a better name.
%p
The other thing Caleb was right about is that Soml is what is called an intermediate representation.
This rubbed a little, especially since i had just moved away from a purely intermediate
representation to an intermediate language. But still, well see below that the language is not
enough to solve the dynamic issues. I have already created an equivalent intermediate
representation (to the soml ast) and will probably let go of the language completely, in time.
%p
So thanks to Caleb, and a thumbs up for anyone else reading, to
%strong make contact
%h2#the-hierarchy-of-languages The hierarchy of languages
%p
It seemed like such a good idea. Just like third level languages are compiled down to second (ie c
to assembler), and second is compiled to first (ie assembler to binary), so fourth level would
get compiled down to third. Such a nice clean world, very appealing.
%p
Until i started working on the details. Specifically how the type (of almost anything) would change
in a statically typed language. And in short, I ran into yet another wall.
%p
So back it is to using an intermediate representation. Alas, at least it is a working one, so down
from there to executable, it is know to work.
%h2#cross-function-jumps Cross function jumps
%p
Lets call a method something akin to what ruby has. Its bound to a type, has a name and arguments.
But both return types and argument types are not specified. Then function could be a specific
implementation of that method, specific to a certain set of types for the arguments. The return type
%br/
is still not fixed.
%p
A compiler can generate all possible functions for a method as the set of basic types is small. Or
it could be a little cleverer and generate stubs and generate the actual functions on demand, as
probably only a fraction of the theoretical possibilities will be needed.
%p
Now, if we have an assignment, say to an argument, from a method call, the type of the variable
may have to change according to the return type.
So the return will be to different addresses (think of it as an if) and so in each branch,
code can be inserted to change the type. But that makes the rest of the function behave wrongly as
it assumes the type before the change.
%p
And this is where the cross function jumps come. Which is also the reason this can not be expressed
in a language. The code then needs to jump to the same place, in a different function.
%p
The function can be pre-compiled or compiled on demand at that point. All that matters is that the
logic of the function being jumped to is the same as where the jump comes from. And this is
guaranteed by the fact that both function are generated from the same (untyped ruby) source code.
%h2#next-steps Next steps
%p So whats left to do here: There is the little matter of implementing this plan.
%p Maybe it leads to another wall, maybe this is it. Fingers crossed.

View File

@ -1,66 +0,0 @@
Writing Soml helped a lot to separate the levels, or phases of the ruby compilation process. Helped
me that is, to plan the ruby compiler.
But off course i had not written the ruby compiler, i have only
[planned](https://dancinglightning.gitbooks.io/the-object-machine/content/object/dynamic_types.html)
how the dynamic nature could be implemented, using soml. In very short summary, the plan was to
extend somls feature with esoteric multi-return features and use that to jump around different
implementation when types change.
## The benefit of communication
But first a thanks. When i was in the US, i talked to quite a few people about my plans. Everything
helped, but special thanks goes to Caleb for pointing out two issues.
The simpler one is that what i had named Layout, is usually called Type. I have changed the code
and docs now and must admit it is a better name.
The other thing Caleb was right about is that Soml is what is called an intermediate representation.
This rubbed a little, especially since i had just moved away from a purely intermediate
representation to an intermediate language. But still, we'll see below that the language is not
enough to solve the dynamic issues. I have already created an equivalent intermediate
representation (to the soml ast) and will probably let go of the language completely, in time.
So thanks to Caleb, and a thumbs up for anyone else reading, to **make contact**
## The hierarchy of languages
It seemed like such a good idea. Just like third level languages are compiled down to second (ie c
to assembler), and second is compiled to first (ie assembler to binary), so fourth level would
get compiled down to third. Such a nice clean world, very appealing.
Until i started working on the details. Specifically how the type (of almost anything) would change
in a statically typed language. And in short, I ran into yet another wall.
So back it is to using an intermediate representation. Alas, at least it is a working one, so down
from there to executable, it is know to work.
## Cross function jumps
Let's call a method something akin to what ruby has. It's bound to a type, has a name and arguments.
But both return types and argument types are not specified. Then function could be a specific
implementation of that method, specific to a certain set of types for the arguments. The return type
is still not fixed.
A compiler can generate all possible functions for a method as the set of basic types is small. Or
it could be a little cleverer and generate stubs and generate the actual functions on demand, as
probably only a fraction of the theoretical possibilities will be needed.
Now, if we have an assignment, say to an argument, from a method call, the type of the variable
may have to change according to the return type.
So the return will be to different addresses (think of it as an if) and so in each branch,
code can be inserted to change the type. But that makes the rest of the function behave wrongly as
it assumes the type before the change.
And this is where the cross function jumps come. Which is also the reason this can not be expressed
in a language. The code then needs to jump to the same place, in a different function.
The function can be pre-compiled or compiled on demand at that point. All that matters is that the
logic of the function being jumped to is the same as where the jump comes from. And this is
guaranteed by the fact that both function are generated from the same (untyped ruby) source code.
## Next steps
So what's left to do here: There is the little matter of implementing this plan.
Maybe it leads to another wall, maybe this is it. Fingers crossed.

View File

@ -0,0 +1,108 @@
%p So, the plan, in short:
%ol
%li I need to work a little more on docs. Reading them i notice they are still not up to date
%li The Type system needs work
%li The Method-Function relationship needs to be created
%li Ruby compiler needs to be written
%li Parfait moves back completely into ruby land
%li Soml parser should be scrapped (or will become redundant by 2-4)
%li The memory model needs reworking (global not object based memory)
%h3#type-system 2. Type system
%p
A Type is an ordered list of associations from name to BasicType (Object/Integer). The class exists
off course and has been implemented as an array with the names and BasicTypes laid out in sequence.
This is basically fine, but support for navigation is missing.
%p
The whole type system is basically graph. A type
%em A
is connected to a type
%em B
if it has exactly
one different BasicType. So
%em A
needs to have
%strong exactly
the same names, and
%strong exactly
one
different BasicType. Another way of saying this is that the two types are related if in the class
that Type represents, exactly one variable changes type. This is off course exactly what happens
when an assignment assigns a different type.
%p
%em A
and
%em B
are also related when
%em A
has exactly one more name entry than
%em B
, but os otherwise
identical. This is what happens when a new variable is added too a class, or one is removed.
%p
The implementation needs to establish this graph (possibly lazily), so that the traversal is fast.
The most likely implementation seems a hash, so a hashing function has to be designed and the equals
implemented.
%h3#method-function-relationship 3. Method-Function relationship
%p
Just to get the naming clear: A method is at the ruby level, untyped. A Class has references to
Methods.
%p
Whereas a Function is at the level below, fully typed.
Functions arguments and local variables have a BasicType.
Type has references to Functions.
%p
A Functions type is fully described by the combination of the arguments Type and the Frame Type.
The Frame object is basically a wrapper for all local variables.
%p
A (ruby) Method has N Function “implementations”. One function for each different combination of
BasicTypes for arguments and local variables. Functions know which Method they belong to, because
their parent Type class holds a reference to the Class that the Type describes.
%h3#ruby-compiler 4. Ruby Compiler
%p
Currently there is only the Function level and the soml compiler. The ruby level / compiler is
missing.
%p
The Compiler generates Class objects, and Type objects as far as it can determine name and
BasicTypes of the instance variables.
%p
Then it creates Method objects for every Method parsed. Finally it must create all functions that
needed. In a first brute-force approach this may mean creating functions for all possible
type combinations.
%p
Call-sites must then be “linked”. Linking here refers to the fact that the compiler can not
determine how to call a function before it is created. So the functions get created in a first pass
and calls and returns “linked” in a second. The return addresses used at the “soml” level are
dependent on the BasicType that is being returned. This involves associating the code labels (at
the function level) with the ast nodes they come from (at the method level). With this, the compiler
ensures that the type of the variable receiving the return value is correct.
%h3#parfait-in-ruby 5. Parfait in ruby
%p
After SOML was originally written, parts of the run-time (parfait) was ported to soml. This was done with the
idea that the run-time is low level and thus needs to be fully typed. As it turns out this is only
partly correct, in the sense that there needs to exist Function definitions (in the sense above)
that implement basic functionality. But as the sub-chapter on the ruby compiler should explain,
this does not mean the code has to written in a typed language.
%p
After the ruby-compiler is implemented, the run-time can be implemented in ruby. While this may seem
strange at first, one must remember that the ruby-compiler creates N Functions of each method for
all possible type combinations. This means if the ruby method is correctly implemented, error
handling, for type errors, will be correctly generated by the compiler.
%h3#soml-goodbye 6. SOML goodbye
%p
By this time the soml language can be removed. Meaning the parser for the language and all
documentation is not needed. The ruby-complier compilers straight into the soml internal
representation (as the soml parser) and because parfait is back in ruby land, soml should be
removed. Which is a relief, because there are definitely enough languages in the world.
%h3#memory-model-rework 7. Memory model rework
%p
Slightly unrelated to the above (read: can be done at the same time), the memory model needs to be
expanded. The current per object
%em fake
memory works fine, but leaves memory management in
the compiler.
%p
Since ultimately memory management should be part of the run-time, the model needs to be changed
to a global one. This means class Page and Space should be implemented, and the
%em fake
memory
mapped to a global array.

View File

@ -1,94 +0,0 @@
So, the plan, in short:
1. I need to work a little more on docs. Reading them i notice they are still not up to date
2. The Type system needs work
3. The Method-Function relationship needs to be created
4. Ruby compiler needs to be written
5. Parfait moves back completely into ruby land
6. Soml parser should be scrapped (or will become redundant by 2-4)
7. The memory model needs reworking (global not object based memory)
### 2. Type system
A Type is an ordered list of associations from name to BasicType (Object/Integer). The class exists
off course and has been implemented as an array with the names and BasicTypes laid out in sequence.
This is basically fine, but support for navigation is missing.
The whole type system is basically graph. A type *A* is connected to a type *B* if it has exactly
one different BasicType. So *A* needs to have **exactly** the same names, and **exactly** one
different BasicType. Another way of saying this is that the two types are related if in the class
that Type represents, exactly one variable changes type. This is off course exactly what happens
when an assignment assigns a different type.
*A* and *B* are also related when *A* has exactly one more name entry than *B* , but os otherwise
identical. This is what happens when a new variable is added too a class, or one is removed.
The implementation needs to establish this graph (possibly lazily), so that the traversal is fast.
The most likely implementation seems a hash, so a hashing function has to be designed and the equals
implemented.
### 3. Method-Function relationship
Just to get the naming clear: A method is at the ruby level, untyped. A Class has references to
Methods.
Whereas a Function is at the level below, fully typed.
Function's arguments and local variables have a BasicType.
Type has references to Functions.
A Function's type is fully described by the combination of the arguments Type and the Frame Type.
The Frame object is basically a wrapper for all local variables.
A (ruby) Method has N Function "implementations". One function for each different combination of
BasicTypes for arguments and local variables. Functions know which Method they belong to, because
their parent Type class holds a reference to the Class that the Type describes.
### 4. Ruby Compiler
Currently there is only the Function level and the soml compiler. The ruby level / compiler is
missing.
The Compiler generates Class objects, and Type objects as far as it can determine name and
BasicTypes of the instance variables.
Then it creates Method objects for every Method parsed. Finally it must create all functions that
needed. In a first brute-force approach this may mean creating functions for all possible
type combinations.
Call-sites must then be "linked". Linking here refers to the fact that the compiler can not
determine how to call a function before it is created. So the functions get created in a first pass
and calls and returns "linked" in a second. The return addresses used at the "soml" level are
dependent on the BasicType that is being returned. This involves associating the code labels (at
the function level) with the ast nodes they come from (at the method level). With this, the compiler
ensures that the type of the variable receiving the return value is correct.
### 5. Parfait in ruby
After SOML was originally written, parts of the run-time (parfait) was ported to soml. This was done with the
idea that the run-time is low level and thus needs to be fully typed. As it turns out this is only
partly correct, in the sense that there needs to exist Function definitions (in the sense above)
that implement basic functionality. But as the sub-chapter on the ruby compiler should explain,
this does not mean the code has to written in a typed language.
After the ruby-compiler is implemented, the run-time can be implemented in ruby. While this may seem
strange at first, one must remember that the ruby-compiler creates N Functions of each method for
all possible type combinations. This means if the ruby method is correctly implemented, error
handling, for type errors, will be correctly generated by the compiler.
### 6. SOML goodbye
By this time the soml language can be removed. Meaning the parser for the language and all
documentation is not needed. The ruby-complier compilers straight into the soml internal
representation (as the soml parser) and because parfait is back in ruby land, soml should be
removed. Which is a relief, because there are definitely enough languages in the world.
### 7. Memory model rework
Slightly unrelated to the above (read: can be done at the same time), the memory model needs to be
expanded. The current per object *fake* memory works fine, but leaves memory management in
the compiler.
Since ultimately memory management should be part of the run-time, the model needs to be changed
to a global one. This means class Page and Space should be implemented, and the *fake* memory
mapped to a global array.

View File

@ -0,0 +1,61 @@
%h2#rubyx-compiles-ruby-to-binary RubyX compiles ruby to binary
%p
The previous name was from a time in ancient history, three years ago, in internet time over
a decade (X years!). From when i thought i was going to build
a virtual machine. It has been clear for a while that what i am really doing is building a
compiler. A new thing needs a new name and finally inspiration struck in the form of RubyX.
%p
Its a bit of a shame that both domain and github were taken, but the - versions work well too.
Renaming of the organization, repositories and changing of domain is now complete. I did not
rewrite history, so all old posts still refer to salama.
%p
What i like about the new name most, is the closeness to ruby, this is after all an implementation
of ruby. Also the unclarity of what the X is is nice, is it as in X-files, the unknown of the
maths variable or ala mac, the 10 for a version number? Or the hope of achieving 10 times
performance as a play on the 3 times performance of ruby 3. Its a mystery, but it is a ruby
mystery and that is the main thing.
%h3#type-system 2. Type system
%p About the work that has been done, the type system rewrite is probably the biggest.
%p
Types are now immutable throughout the system, and the space keeps a list of all unique types.
Adding, removing, changing type all goes through a hashing process and leads to a unique
instance, that may have to be created.
%h3#typedmethod-arguments-and-locals 3. TypedMethod arguments and locals
%p
Close on the heal of the type immutability was the change to types as argument and local variable
descriptors. A type instance is now used to describe the arguments (names and types) uniquely,
clearing up previous imprecision.
%p
Argument and locals type, along with the name of the method describe a method uniquely. Obviously
the types may not be changed. Methods with different argument types are thus different methods, a
fact that still has to be coded into the ruby compiler.
%h3#arguments-and-calling-convention 4. Arguments and calling convention
%p
The Message used to carry the arguments, while locals were a separate frame object. An imbalance
if one thinks about closures, as both have to be decoupled from their activation.
%p
Now both arguments and locals are represented as NamedLists, which are basically just objects.
The type is transferred from the method to the NamedList instance at call time, so it is available
at run-time. This makes the whole calling convention easier to understand.
%h3#parfait-in-ruby 5. Parfait in ruby
%p
Parfait is more normal ruby now, specifically we are using instance variables in Parfait again,
just like in any ruby. When compiling we have to deal with the mapping to indexes, but thats what
we have types for, so no problem. The new version simplifies the boot process a little too.
%p Positioning has been removed from Parfait completely and pushed into the Assembler where it belongs.
%h3#soml-goodbye 6. SOML goodbye
%p
All trances of the soml language have been eradicated. All that is left is an intermediate typed
tree representation. But the MethodCompiler still generates binary so thats good.
Class and method generation capabilities have been removed from that compiler and now live
one floor up, at the ruby level.
%h3#ruby-compiler 7. Ruby Compiler
%p
Finally work on the ruby compiler has started and after all that ground work is actually quite easy.
Class statements create classes already. Method definitions extract their argument and local
variable names, and create their representation as RubyMethod. More to come.
%p
All in all almost all of the previous posts todos are done. Next up is the fanning of RubyMethods
into TypedMethods by instantiating type variations. When compilation of those works, i just need
to implement the cross function jumps and voila.
%p Certainly an interesting year ahead.

View File

@ -1,71 +0,0 @@
## RubyX compiles ruby to binary
The previous name was from a time in ancient history, three years ago, in internet time over
a decade (X years!). From when i thought i was going to build
a virtual machine. It has been clear for a while that what i am really doing is building a
compiler. A new thing needs a new name and finally inspiration struck in the form of RubyX.
It's a bit of a shame that both domain and github were taken, but the - versions work well too.
Renaming of the organization, repositories and changing of domain is now complete. I did not
rewrite history, so all old posts still refer to salama.
What i like about the new name most, is the closeness to ruby, this is after all an implementation
of ruby. Also the unclarity of what the X is is nice, is it as in X-files, the unknown of the
maths variable or ala mac, the 10 for a version number? Or the hope of achieving 10 times
performance as a play on the 3 times performance of ruby 3. It's a mystery, but it is a ruby
mystery and that is the main thing.
### 2. Type system
About the work that has been done, the type system rewrite is probably the biggest.
Types are now immutable throughout the system, and the space keeps a list of all unique types.
Adding, removing, changing type all goes through a hashing process and leads to a unique
instance, that may have to be created.
### 3. TypedMethod arguments and locals
Close on the heal of the type immutability was the change to types as argument and local variable
descriptors. A type instance is now used to describe the arguments (names and types) uniquely,
clearing up previous imprecision.
Argument and locals type, along with the name of the method describe a method uniquely. Obviously
the types may not be changed. Methods with different argument types are thus different methods, a
fact that still has to be coded into the ruby compiler.
### 4. Arguments and calling convention
The Message used to carry the arguments, while locals were a separate frame object. An imbalance
if one thinks about closures, as both have to be decoupled from their activation.
Now both arguments and locals are represented as NamedList's, which are basically just objects.
The type is transferred from the method to the NamedList instance at call time, so it is available
at run-time. This makes the whole calling convention easier to understand.
### 5. Parfait in ruby
Parfait is more normal ruby now, specifically we are using instance variables in Parfait again,
just like in any ruby. When compiling we have to deal with the mapping to indexes, but that's what
we have types for, so no problem. The new version simplifies the boot process a little too.
Positioning has been removed from Parfait completely and pushed into the Assembler where it belongs.
### 6. SOML goodbye
All trances of the soml language have been eradicated. All that is left is an intermediate typed
tree representation. But the MethodCompiler still generates binary so that's good.
Class and method generation capabilities have been removed from that compiler and now live
one floor up, at the ruby level.
### 7. Ruby Compiler
Finally work on the ruby compiler has started and after all that ground work is actually quite easy.
Class statements create classes already. Method definitions extract their argument and local
variable names, and create their representation as RubyMethod. More to come.
All in all almost all of the previous posts todos are done. Next up is the fanning of RubyMethods
into TypedMethods by instantiating type variations. When compilation of those works, i just need
to implement the cross function jumps and voila.
Certainly an interesting year ahead.

View File

@ -0,0 +1,122 @@
%p
I just read mri 2.4 “unifies” Fixnum and Integer. This, it turns out, is something quite
different from what i though, mostly about which class names are returned.
And that it is ok to have two implementations for the same class, Integer.
%p
But even it wasnt what i thought, it did spark an idea, and i hope a solution to a problem
that i have seen lurking ahead. Strangely the solution maybe even more radical than the
cross function jumps it replaces.
%h2#a-problem-lurking-ahead A problem lurking ahead
%p As i have been thinking more about what happens when a type changes, i noticed something:
%p
An object may change its type in one method (A), but may be used in a method (B), far up the call
stack. How does B know to treat the object different. Specifically, the calls B makes
on the object are determined by the type before the change. So they will be wrong after the change,
and so B needs to know about the type change.
%p
Such a type change was supposed to be handled by a cross method jump, thus fixing the problem
in A. But the propagation to B is cumbersome, there can be just so many of them.
Anything that i though of is quite a bit too involved. And this is before even thinking about closures.
%h2#a-step-back A step back
%p
Looking at this from a little higher vantage there are maybe one too many things i have been trying
to avoid.
%p
The first one was the bit-tagging. The ruby (and smalltalk) way of tagging an integer
with a marker bit. Thus loosing a bit and gaining a gazillion type checks. In mri c land
an object is a VALUE, and a VALUE is either a tagged integer or a pointer to an object struct.
So on
%strong every
operation the bit has to be checked. Both of these ive been trying to avoid.
%p
So that lead to a system with no explicit information in the lowest level representation and
thus a large dance to have that information in an external type system and keeping that type
information up to date.
%p
Off course the elephant in the room here is that i have also be trying to avoid making integers and
floats objects. Ie keeping their c, or machine representation, just like anyone else before me.
Too wasteful to even think otherwise.
%h2#and-a-step-forward And a step forward
%p
The inspiration that came by reading about the unification of integers was exactly that:
%strong to unify integers
\. Unifying with objects, ie
%strong making integers objects
%p
I have been struggling with the dichotomy between integer and objects for a long time. There always
seemed something so fundamentally wrong there. Ok, maybe if the actual hardware would do the tagging
and that continuous checking, then maybe. But otherwise: one is a direct, the other an indirect
value. It just seemed wrong.
%p
Making Integers (and floats etc) first class citizens, objects with a type, resolves the chasm
very nicely. Off course it does so at a price, but i think it will be worth it.
%h2#the-price-of-unification The price of Unification
%p
Initially i wanted to make all objects the size of a cache line or multiples thereof. This is
something ill have to let go of: Integer objects should naturally be 2 words, namely the type
and the actual value.
%p
So this is doubling the amount of ram used to represent integers. But maybe worse, it makes them
subject to garbage collection. Both can probably be alleviated by having the first 256 pinned, ie
a fixed array, but still.
%p
Also using a dedicated memory manager for them and keeping a pool of unused as a linked list
should make it quick. And off course the main hope lies in the fact that your average program
nowadays (especially oo) does not really use integers all that much.
%h2#oo-to-the-rescue OO to the rescue
%p
Off course this is not the first time my thought have strayed that way. There are two reasons why
they quickly scuttled back home to known territory before. The first was the automatic optimization
reflex: why use 2 words for something that can be done in one, and all that gc on top.
%p
But the second was probably even more important: If we then have the value inside the object
(as a sort of instance variable or array element), then when return it then we have the “naked”
integer wreaking havoc in our system, as the code expects objects everywhere.
And if we dont return it, then how do operations happen, since machines only operate on values.
%p
The thing that i had not considered is that that line of thinking is mixing up the levels
of abstraction. It assumes a lower level than one needs: What is needed is that the system
knows about integer objects (in a similar way that the other ways assumes knowledge of integer
values.)
%p
Concretely the “machine”, or compiler, needs to be able to perform the basic Integer operations,
on the Integer objects. This is really not so different from it knowing how to perform the
operations on two values. It just involves getting the actual values from the object and
putting them back.
%p
OO helps in another way that never occurred to me.
%strong Data hiding:
we never actually pass out
the value. The value is private to the object and not accessible from the outside. In fact it not
even accessible from the inside to the object itself. Admittedly this means more functionality in
the compiler, but since that is a solved problem (see builtin), its ok.
%h2#unified-method-caching Unified method caching
%p
So having gained this unification, we can now determine the type of an object very very easily.
The type will
%em always
be the first word of the memory that the object occupies. We dont have
immediate values anymore, so always is always.
%p
This is
%em very
handy, since we have given up being god and thus knowing everything at any time.
In concrete terms this means that in a method, we can
%em not
know what type an object is.
In fact its worse, we cant even say what type it is, even if we have checked it, but after we
have passed it as an argument to another method.
%p
Luckily programs are not random, and it quite rare for an object to change type, and so a given
object will usually have one of a very small set of types. This can be used to do method caching.
Instead of looking up the method statically and calling it unconditionally at run-time, we will
need some kind of lookup at run-time.
%p
The lookup tables can be objects that the method carries. A small table (3 entries) with pairs of
type vs jump address. A little assembler to go through the list and jump, or in case of a miss
jump to some handler that does a real lookup in the type.
%p
In a distant future a smaller version may be created. For the case where the type has been
checked already during the method, a further check may be inlined completely into the code and
only revert to the table in case of a miss. But thats down the road a bit.
%p Next question: How does this work with Parfait. Or the interpreter??

View File

@ -1,117 +0,0 @@
I just read mri 2.4 "unifies" Fixnum and Integer. This, it turns out, is something quite
different from what i though, mostly about which class names are returned.
And that it is ok to have two implementations for the same class, Integer.
But even it wasn't what i thought, it did spark an idea, and i hope a solution to a problem
that i have seen lurking ahead. Strangely the solution maybe even more radical than the
cross function jumps it replaces.
## A problem lurking ahead
As i have been thinking more about what happens when a type changes, i noticed something:
An object may change it's type in one method (A), but may be used in a method (B), far up the call
stack. How does B know to treat the object different. Specifically, the calls B makes
on the object are determined by the type before the change. So they will be wrong after the change,
and so B needs to know about the type change.
Such a type change was supposed to be handled by a cross method jump, thus fixing the problem
in A. But the propagation to B is cumbersome, there can be just so many of them.
Anything that i though of is quite a bit too involved. And this is before even thinking about closures.
## A step back
Looking at this from a little higher vantage there are maybe one too many things i have been trying
to avoid.
The first one was the bit-tagging. The ruby (and smalltalk) way of tagging an integer
with a marker bit. Thus loosing a bit and gaining a gazillion type checks. In mri c land
an object is a VALUE, and a VALUE is either a tagged integer or a pointer to an object struct.
So on **every** operation the bit has to be checked. Both of these i've been trying to avoid.
So that lead to a system with no explicit information in the lowest level representation and
thus a large dance to have that information in an external type system and keeping that type
information up to date.
Off course the elephant in the room here is that i have also be trying to avoid making integers and
floats objects. Ie keeping their c, or machine representation, just like anyone else before me.
Too wasteful to even think otherwise.
## And a step forward
The inspiration that came by reading about the unification of integers was exactly that:
**to unify integers** . Unifying with objects, ie **making integers objects**
I have been struggling with the dichotomy between integer and objects for a long time. There always
seemed something so fundamentally wrong there. Ok, maybe if the actual hardware would do the tagging
and that continuous checking, then maybe. But otherwise: one is a direct, the other an indirect
value. It just seemed wrong.
Making Integers (and floats etc) first class citizens, objects with a type, resolves the chasm
very nicely. Off course it does so at a price, but i think it will be worth it.
## The price of Unification
Initially i wanted to make all objects the size of a cache line or multiples thereof. This is
something i'll have to let go of: Integer objects should naturally be 2 words, namely the type
and the actual value.
So this is doubling the amount of ram used to represent integers. But maybe worse, it makes them
subject to garbage collection. Both can probably be alleviated by having the first 256 pinned, ie
a fixed array, but still.
Also using a dedicated memory manager for them and keeping a pool of unused as a linked list
should make it quick. And off course the main hope lies in the fact that your average program
nowadays (especially oo) does not really use integers all that much.
## OO to the rescue
Off course this is not the first time my thought have strayed that way. There are two reasons why
they quickly scuttled back home to known territory before. The first was the automatic optimization
reflex: why use 2 words for something that can be done in one, and all that gc on top.
But the second was probably even more important: If we then have the value inside the object
(as a sort of instance variable or array element), then when return it then we have the "naked"
integer wreaking havoc in our system, as the code expects objects everywhere.
And if we don't return it, then how do operations happen, since machines only operate on values.
The thing that i had not considered is that that line of thinking is mixing up the levels
of abstraction. It assumes a lower level than one needs: What is needed is that the system
knows about integer objects (in a similar way that the other ways assumes knowledge of integer
values.)
Concretely the "machine", or compiler, needs to be able to perform the basic Integer operations,
on the Integer objects. This is really not so different from it knowing how to perform the
operations on two values. It just involves getting the actual values from the object and
putting them back.
OO helps in another way that never occurred to me. **Data hiding:** we never actually pass out
the value. The value is private to the object and not accessible from the outside. In fact it not
even accessible from the inside to the object itself. Admittedly this means more functionality in
the compiler, but since that is a solved problem (see builtin), it's ok.
## Unified method caching
So having gained this unification, we can now determine the type of an object very very easily.
The type will *always* be the first word of the memory that the object occupies. We don't have
immediate values anymore, so always is always.
This is *very* handy, since we have given up being god and thus knowing everything at any time.
In concrete terms this means that in a method, we can *not* know what type an object is.
In fact it's worse, we can't even say what type it is, even if we have checked it, but after we
have passed it as an argument to another method.
Luckily programs are not random, and it quite rare for an object to change type, and so a given
object will usually have one of a very small set of types. This can be used to do method caching.
Instead of looking up the method statically and calling it unconditionally at run-time, we will
need some kind of lookup at run-time.
The lookup tables can be objects that the method carries. A small table (3 entries) with pairs of
type vs jump address. A little assembler to go through the list and jump, or in case of a miss
jump to some handler that does a real lookup in the type.
In a distant future a smaller version may be created. For the case where the type has been
checked already during the method, a further check may be inlined completely into the code and
only revert to the table in case of a miss. But that's down the road a bit.
Next question: How does this work with Parfait. Or the interpreter??

View File

@ -0,0 +1,88 @@
%p
As i said in the last post, a step back and forward, possibly two, was taken and understanding
grows again. Especially when i think that some way is the way, it always changes and i turn out
to be at least partially wrong. The way of life, of imperfect intelligence, to strive for that
perfection that is forever out of reach. Heres the next installment.
%h2#slopes-and-ramps Slopes and Ramps
%p
When thinking about method caching and how to implement it i came across this thing that i will
call a Slope for now. The Slope of a function that is. At least thats where the thought started.
%p The Slope of a function is a piece of code that has two main properties:
%ul
%li
it is straight, up to the end. i mean it has no branches from the outside.
It may have internally but that does not affect anything.
%li it ends in a branch that returns (a call), but this is not part of the Slope
%p
Those
%em two
properties would better be called a Ramp. The Ramp the function goes along before it
jumps of to the next function.
%p
The
%strong Slope
is the part before the jump. So a Ramp is a Slope and a Jump.
%p
Code in the Slope, it struck me, has the unique possibility of doing a jump, with out worrying about
returning. After all, it knows there is a call coming. After contemplating this a little i
found the flaw, which one understands when thinking about where the function returns to. So Slope
can jump away without caring if (and only if) the return address is set to after that jump (and the
address is actually set by the code before the jump).
%p
Remembering that we set the return address in the caller (not as in c the callee) we can arrange
for that. And so we can write Slope code that just keeps going. Because once the return address
is set up, the code can just keep jumping forward. The only thing is that the call must come.
%p
In more concrete terms: Method caching can be a series of checks and jumps. If the check is ok
we call, otherwise jump on. And even the last fail (the switches default case) can be a jump
to what we would otherwise call a method. A method that determines the real jump target from
the type (of self, in the message) and calls it. Except its not a method because it never
returns, which is symmetrically to us not calling it.
%p
So this kind of “method” which is not really a method, but still a fair bit of logic, ill call
a Slope.
%h2#links-and-chains Links and Chains
%p
A Slope, the story continues, is really just a specific case of something else. If we take away
the expectation that a call is coming, we are left with a sequence of code with jumps to more
code. This could be called a Chain, and each part of the Chain would be a Link.
%p
To define that: a
%strong Link
is sequence of code that ends in a jump. It has no other jumps, just
the one at the end. And the jump at the end jumps to another Link.
%p The Code i am talking about here is risc level code, one could say assembler instructions.
%p
The concept though is very familiar: at a higher level the Link would be a Statement and a
Chain a sequence of Statements. Were missing the branch abstraction yet, but otherwise this is
a lower level description of code in a similar way as the typed level Code and Statements are
a description of higher level code.
%h2#typed-level-is-wrong Typed level is wrong
%p
The level that is nowadays called Typed, and used to be soml, is basically made up of language
constructs. It does not allow for manipulation of the risc level. As the ruby level is translated
to the typed level, which in turn is translated to the risc level, the ruby compiler has no
way of manipulating the risc level. This is as it should be.
%p
The problem is just, that the constructs that are currently at the typed level, do not allow
to express the results needed at the risc level.
%p
Through the history of the development the levels have become mixed up. It is relatively clear at
the ruby level what kind of construct is needed at the risc level. This is what has to drive the
constructs at the typed level. We need access to these kinds of Slope or Link ideas at the ruby
level.
%p
Another way of looking at the typed level inadequacies is the size of the codes generated. Some of
the expressions (or statements) resolve to 2 or 3 risc instructions. Others, like the call, are
15. This is an indication that part of the level is wrong. A good way to architect the layers
would result in an
%em even
expansion of the amount of code at every level.
%h2#too-little-testing Too little testing
%p
The ruby compiler should really drive the development more. The syntax and behavior of ruby are
quite clear, and i feel the risc layer is quite a solid target. So before removing too much or
rewriting too much i shall just add more (and more) functionality to the typed layer.
%p
At the same time some of the concepts (like a method call) will probably not find any use, but
as long as they dont harm, i shall leave them lying around.

View File

@ -1,84 +0,0 @@
As i said in the last post, a step back and forward, possibly two, was taken and understanding
grows again. Especially when i think that some way is the way, it always changes and i turn out
to be at least partially wrong. The way of life, of imperfect intelligence, to strive for that
perfection that is forever out of reach. Here's the next installment.
## Slopes and Ramps
When thinking about method caching and how to implement it i came across this thing that i will
call a Slope for now. The Slope of a function that is. At least that's where the thought started.
The Slope of a function is a piece of code that has two main properties:
- it is straight, up to the end. i mean it has no branches from the outside.
It may have internally but that does not affect anything.
- it ends in a branch that returns (a call), but this is not part of the Slope
Those *two* properties would better be called a Ramp. The Ramp the function goes along before it
jumps of to the next function.
The **Slope** is the part before the jump. So a Ramp is a Slope and a Jump.
Code in the Slope, it struck me, has the unique possibility of doing a jump, with out worrying about
returning. After all, it knows there is a call coming. After contemplating this a little i
found the flaw, which one understands when thinking about where the function returns to. So Slope
can jump away without caring if (and only if) the return address is set to after that jump (and the
address is actually set by the code before the jump).
Remembering that we set the return address in the caller (not as in c the callee) we can arrange
for that. And so we can write Slope code that just keeps going. Because once the return address
is set up, the code can just keep jumping forward. The only thing is that the call must come.
In more concrete terms: Method caching can be a series of checks and jumps. If the check is ok
we call, otherwise jump on. And even the last fail (the switches default case) can be a jump
to what we would otherwise call a method. A method that determines the real jump target from
the type (of self, in the message) and calls it. Except it's not a method because it never
returns, which is symmetrically to us not calling it.
So this kind of "method" which is not really a method, but still a fair bit of logic, i'll call
a Slope.
## Links and Chains
A Slope, the story continues, is really just a specific case of something else. If we take away
the expectation that a call is coming, we are left with a sequence of code with jumps to more
code. This could be called a Chain, and each part of the Chain would be a Link.
To define that: a **Link** is sequence of code that ends in a jump. It has no other jumps, just
the one at the end. And the jump at the end jumps to another Link.
The Code i am talking about here is risc level code, one could say assembler instructions.
The concept though is very familiar: at a higher level the Link would be a Statement and a
Chain a sequence of Statements. We're missing the branch abstraction yet, but otherwise this is
a lower level description of code in a similar way as the typed level Code and Statements are
a description of higher level code.
## Typed level is wrong
The level that is nowadays called Typed, and used to be soml, is basically made up of language
constructs. It does not allow for manipulation of the risc level. As the ruby level is translated
to the typed level, which in turn is translated to the risc level, the ruby compiler has no
way of manipulating the risc level. This is as it should be.
The problem is just, that the constructs that are currently at the typed level, do not allow
to express the results needed at the risc level.
Through the history of the development the levels have become mixed up. It is relatively clear at
the ruby level what kind of construct is needed at the risc level. This is what has to drive the
constructs at the typed level. We need access to these kinds of Slope or Link ideas at the ruby
level.
Another way of looking at the typed level inadequacies is the size of the codes generated. Some of
the expressions (or statements) resolve to 2 or 3 risc instructions. Others, like the call, are
15. This is an indication that part of the level is wrong. A good way to architect the layers
would result in an *even* expansion of the amount of code at every level.
## Too little testing
The ruby compiler should really drive the development more. The syntax and behavior of ruby are
quite clear, and i feel the risc layer is quite a solid target. So before removing too much or
rewriting too much i shall just add more (and more) functionality to the typed layer.
At the same time some of the concepts (like a method call) will probably not find any use, but
as long as they don't harm, i shall leave them lying around.

View File

@ -0,0 +1,90 @@
%p Going on holiday without a computer was great. Forcing me to recap and write things down on paper.
%h2#layers Layers
%p
One of the main results was that the current layers are a bit mixed up and that will have to be
fixed. But first, some of the properties in which i think of the different layers.
%h3#layer-properties Layer properties
%p
%strong Structure of the representation
is one of the main distinction of the layers. We know the parser gives us a
%strong tree
and that the produced binary is a
= succeed "," do
%strong blob
%p
A closely related property of the representation is whether it is
= succeed "." do
%strong abstract or concrete
%p
If we think of the layer as a language, what
%strong Language level
would it be, assembler, c, oo.
Does it have
= succeed "," do
%strong control structures
= succeed "." do
%strong jumps
%h3#ruby-layer Ruby Layer
%p
The top ruby layer is a given, since it is provided by the external gem
= succeed "." do
%em parser
= succeed "." do
%em tree
%p
What might sound self-evident that this layer is very close to ruby, this means that inherits
all of rubys quirks, and all the redundancy that makes ruby a nice language. By quirks i mean
things like the integer 0 being true in an if statement. A good example of redundancy is the
existence of if and until, or the ability to add if after the statement.
%h3#virtual-language Virtual Language
%p
The next layer down, and the first to be defined in ruby-x, is the virtual language layer.
By language i mean object oriented language, and by virtual an non existent minimal version of an
object oriented language. This is like ruby, but without the quirks or redundancy. This is
meant to be compatible with other oo languages, meaning that it should be possible to transform
a python or smalltalk program into this layer.
%p
The layer is represented as a concrete tree and derived from the ast by removing:
\- unless, the ternary operator and post conditionals
\- splats and multi-assignment
\- implicit block passing
\- case statement
\- global variables
%p It should be relatively obvious how these can be replaced by existing constructs (details in code)
%h3#virtual-object-machine Virtual object Machine
%p
The next down represents what we think of as a machine, more than a language, and an object
oriented at that.
%p
A differentiating factor is that a machine has no control structures like a language. Only jumps.
The logical structure is more a stream or array. Something closer to the memory that
i will map to in lower layers. We still use a tree representation for this level, but with the
interpretation that neighboring children get implicitly jumped to.
%p
The machine deals in objects, not in memory as a von Neumann machine would. The machine has
instructions to move data from one object to another. There are no registers, just objects.
Also basic arithmetic and testing is covered by the instruction set.
%h3#risc-layer Risc layer
%p
This layer is a minimal abstraction of an arm processor. Ie there are eight registers, instructions
to and from memory and between registers. Basic integer operations work on registers. So does
testing, and off course there are jumps. While the layer deals in random access memory, it is
aware and uses the object machines objects.
%p
The layer is minimal in the sense that it defines only instructions needed to implement ruby.
Instructions are defined in a concrete manner, ie one class per Instruction, which make the
set of Instructions extensible by other gems.
%p
The structure is a linked list which is manly interested in three types of Instructions. Namely
Jumps, jump targets (Labels), and all other. All the other Instructions a linear in the von Neumann
sense, that the next instruction will be executed implicitly.
%h3#arm-and-elf-layer Arm and elf Layer
%p
The mapping of the risc layer to the arm layer is very straightforward, basically one to one with
the exception of constant loading (which is quirky on the arm 32 bit due to historical reasons).
Arm instructions (being instructions of a real cpu), have the ability to assemble themselves into
binary, which apart from the loading are 4 bytes.
%p The structure of the Arm instruction is the same as the risc layer, a linked list.
%p
There is also code to assemble the objects, and with the instruction stream make a binary elf
executable. While elf support is minimal, the executable does execute on rasperry pi or qemu.

View File

@ -1,88 +0,0 @@
Going on holiday without a computer was great. Forcing me to recap and write things down on paper.
## Layers
One of the main results was that the current layers are a bit mixed up and that will have to be
fixed. But first, some of the properties in which i think of the different layers.
### Layer properties
**Structure of the representation** is one of the main distinction of the layers. We know the parser gives us a **tree** and that the produced binary is a **blob**, but what in between. As options we would still have graphs and lists.
A closely related property of the representation is whether it is **abstract or concrete**.
An abstract representation is represented as a single class in ruby and it's properties are
accessible through an abstract interface, like a hash. A concrete representation would use
a class per type, have properties available as ruby attributes and thus allow functions on the
class.
If we think of the layer as a language, what **Language level** would it be, assembler, c, oo.
Does it have **control structures**, or **jumps**.
### Ruby Layer
The top ruby layer is a given, since it is provided by the external gem *parser*.
Parser outputs an abstract syntax tree (AST), so it is a *tree*. Also it is abstract, thus
represented by a single ruby class, which carries a type as an attribute.
What might sound self-evident that this layer is very close to ruby, this means that inherits
all of ruby's quirks, and all the redundancy that makes ruby a nice language. By quirks i mean
things like the integer 0 being true in an if statement. A good example of redundancy is the
existence of if and until, or the ability to add if after the statement.
### Virtual Language
The next layer down, and the first to be defined in ruby-x, is the virtual language layer.
By language i mean object oriented language, and by virtual an non existent minimal version of an
object oriented language. This is like ruby, but without the quirks or redundancy. This is
meant to be compatible with other oo languages, meaning that it should be possible to transform
a python or smalltalk program into this layer.
The layer is represented as a concrete tree and derived from the ast by removing:
- unless, the ternary operator and post conditionals
- splats and multi-assignment
- implicit block passing
- case statement
- global variables
It should be relatively obvious how these can be replaced by existing constructs (details in code)
### Virtual object Machine
The next down represents what we think of as a machine, more than a language, and an object
oriented at that.
A differentiating factor is that a machine has no control structures like a language. Only jumps.
The logical structure is more a stream or array. Something closer to the memory that
i will map to in lower layers. We still use a tree representation for this level, but with the
interpretation that neighboring children get implicitly jumped to.
The machine deals in objects, not in memory as a von Neumann machine would. The machine has
instructions to move data from one object to another. There are no registers, just objects.
Also basic arithmetic and testing is covered by the instruction set.
### Risc layer
This layer is a minimal abstraction of an arm processor. Ie there are eight registers, instructions
to and from memory and between registers. Basic integer operations work on registers. So does
testing, and off course there are jumps. While the layer deals in random access memory, it is
aware and uses the object machines objects.
The layer is minimal in the sense that it defines only instructions needed to implement ruby.
Instructions are defined in a concrete manner, ie one class per Instruction, which make the
set of Instructions extensible by other gems.
The structure is a linked list which is manly interested in three types of Instructions. Namely
Jumps, jump targets (Labels), and all other. All the other Instructions a linear in the von Neumann
sense, that the next instruction will be executed implicitly.
### Arm and elf Layer
The mapping of the risc layer to the arm layer is very straightforward, basically one to one with
the exception of constant loading (which is quirky on the arm 32 bit due to historical reasons).
Arm instructions (being instructions of a real cpu), have the ability to assemble themselves into
binary, which apart from the loading are 4 bytes.
The structure of the Arm instruction is the same as the risc layer, a linked list.
There is also code to assemble the objects, and with the instruction stream make a binary elf
executable. While elf support is minimal, the executable does execute on rasperry pi or qemu.

View File

@ -0,0 +1,79 @@
%p Method caching can be done at language level. Wow. But first some boring news:
%h2#vool-is-ready-mom-is-coming Vool is ready, Mom is coming
%p
The
= succeed "irtual" do
%strong V
= succeed "bject" do
%strong O
= succeed "riented" do
%strong O
= succeed "anguage" do
%strong L
%p
Vool will not reflect some of rubys more advanced features, like splats or implicit blocks,
and hopes to make the conditional logic more consistent.
%p
The
= succeed "inimal" do
%strong M
= succeed "bject" do
%strong O
= succeed "achine" do
%strong M
%h2#inline-method-caching Inline Method caching
%p
In ruby almost all work is actually done by method calling and an interpreter spends much of its
time looking up methods to call. The obvious thing to do is to cache the result, and this has
been the plan for a while.
%p
Off course for caching to work, one needs a cache key and invalidation strategy, both of which
are handled by the static types, which ill review below.
%h3#small-cache Small cache
%p
Aaron Patterson has done
%a{:href => "https://www.youtube.com/watch?v=b77V0rkr5rk"} research into method caching
in mri and found that most call sites (&gt;99%) only need one cache entry.
%p
This means a single small object can carry the information needed, probably type, function address
and counter, times two.
%p
In rubyx this can literally be an object that we attach to the CallSite, either prefill if possible
or leave to be used at runtime.
%h3#method-lookup-is-a-static-function Method lookup is a static function
%p
The other important idea here is that the actual lookup of a method is a know function. Known at
compile time that is.
%p
Thus dynamic dispatch can be substituted by a cache lookup, and a static call. The result of the call
can/should update the cache and then we can start with the lookup again.
%p
This makes it possible to remove dynamic dispatch from the code, actually at code level.
I had previously though of implementing the send at a lower level, but see now that it would
be quite possible to do it at the language level with an if and a call, possible another call
for the miss. That would drop the language down from dynamic (4th level) to static (3rd level).
%p I am still somewhat at odds whether to actually do this or leave it for the machine level (mom).
%h2#static-type-review Static Type review
%p
To make the caching possible, the cache key - value association has to be constant.
Off course in oo systems the class of an object is constant and so we could just use that.
But in ruby you can change the class, add instance variables or add/remove/change methods,
and so the class as a key and the method as value is not correct over time.
%p
In rubyx, an object has a type, and its type can change. But a type can never change. A type refers
to the class that it represented at the time of creation. Conversely a class carries an instance
type, which is the type of new instances that get created. But when variables or methods are added
or removed from the class, a new type is created. Type instances never change. Method implementations
are attached to types, and once compiled, never changed either.
%p
Thus using the objects type as cache key and the method as its value will stay correct over time.
And the double bonus of this is that it takes care of both objects of different classes (as those will have different type for sure), but also objects of the same class, at different times, when
eg a method with the same name has been added. Those objects will have different type too, and
thus experience a cache miss and have their correct method found.
%h2#up-next Up next
%p
More grunt-work. Now that Vool replaces the ast the code from rubyx/passes has to be “ported” to use it. That means:
\- class extraction and class object creation
\- method extraction and creation
\- type creation by ivar analysis
\- frame creation by local variable analysis

View File

@ -1,77 +0,0 @@
Method caching can be done at language level. Wow. But first some boring news:
## Vool is ready, Mom is coming
The **V**irtual **O**bject **O**riented **L**anguage level, as envisioned in the previous post,
is done. Vool is meant to be a language agnostic layer, and is typed, unlike the ast that
the ruby parser outputs. This will allow to write more oo code, by putting code into the
statement classes, rather than using the visitor pattern. I tend to agree with CodeClimate on
the fact that the visitor pattern produces bad code.
Vool will not reflect some of ruby's more advanced features, like splats or implicit blocks,
and hopes to make the conditional logic more consistent.
The **M**inimal **O**bject **M**achine will be the next layer. It will sit between Vool and Risc
as an object version of the Risc machine. This is mainly to make it more understandable, as i
noticed that part of the Risc, especially calling, is getting quite complex. But more on that next..
## Inline Method caching
In ruby almost all work is actually done by method calling and an interpreter spends much of it's
time looking up methods to call. The obvious thing to do is to cache the result, and this has
been the plan for a while.
Off course for caching to work, one needs a cache key and invalidation strategy, both of which
are handled by the static types, which i'll review below.
### Small cache
Aaron Patterson has done [research into method caching](https://www.youtube.com/watch?v=b77V0rkr5rk)
in mri and found that most call sites (>99%) only need one cache entry.
This means a single small object can carry the information needed, probably type, function address
and counter, times two.
In rubyx this can literally be an object that we attach to the CallSite, either prefill if possible
or leave to be used at runtime.
### Method lookup is a static function
The other important idea here is that the actual lookup of a method is a know function. Known at
compile time that is.
Thus dynamic dispatch can be substituted by a cache lookup, and a static call. The result of the call
can/should update the cache and then we can start with the lookup again.
This makes it possible to remove dynamic dispatch from the code, actually at code level.
I had previously though of implementing the send at a lower level, but see now that it would
be quite possible to do it at the language level with an if and a call, possible another call
for the miss. That would drop the language down from dynamic (4th level) to static (3rd level).
I am still somewhat at odds whether to actually do this or leave it for the machine level (mom).
## Static Type review
To make the caching possible, the cache key - value association has to be constant.
Off course in oo systems the class of an object is constant and so we could just use that.
But in ruby you can change the class, add instance variables or add/remove/change methods,
and so the class as a key and the method as value is not correct over time.
In rubyx, an object has a type, and it's type can change. But a type can never change. A type refers
to the class that it represented at the time of creation. Conversely a class carries an instance
type, which is the type of new instances that get created. But when variables or methods are added
or removed from the class, a new type is created. Type instances never change. Method implementations
are attached to types, and once compiled, never changed either.
Thus using the object's type as cache key and the method as it's value will stay correct over time.
And the double bonus of this is that it takes care of both objects of different classes (as those will have different type for sure), but also objects of the same class, at different times, when
eg a method with the same name has been added. Those objects will have different type too, and
thus experience a cache miss and have their correct method found.
## Up next
More grunt-work. Now that Vool replaces the ast the code from rubyx/passes has to be "ported" to use it. That means:
- class extraction and class object creation
- method extraction and creation
- type creation by ivar analysis
- frame creation by local variable analysis

View File

@ -0,0 +1,91 @@
%p
While work on Mom (Minimal object machine) continues, i can see the futures a little clearer.
Alas, for now the shortest route is best, so the future will have to wait. But here is what im
thinking.
%h2#types-today Types today
%p
The
%a{:href => "/rubyx/layers.html"} architecture
document outlines this in more detail, but in short:
\- types are immutable
\- every object has a type (which may change)
\- a type implements the interface of a class at a given time
\- a type is defined by a list of attribute names
%p
%img{:alt => "Types diagram", :src => "/assets/types.jpg"}/
%h3#how-classes-work How classes work
%p
So the interesting thing here is how the classes work. Seeing as they are open, attributes can
be added and removed, but the types are immutable.
%p The solution is easy: when a new attribute is added to a class, a new type is created.
%p
The
%em instance type
is then updated to point to the current type. This means that new objects will
be created with the new type, and old ones will keep their old type. Until the attribute is
added to them too, in which case their
%em type
is updated too.
%p
%strong Methods
btw are stored at the Type, as they encode the knowledge of the memory layout
that comes with the type, into the code of the method. Remember: full data hiding, only objects
methods can access the variables, hence the type needs to be know only for
= succeed "." do
%em self
%h2#the-future-of-types The future of types
%p
But what i wanted to talk about is how this picture is going to change in the future.
To understand why we might want to, lets look at method dispatch on an instance variable.
%p
When you write something like @me.length , the compiler can check that @me is indeed an instance variable by checking the type of self. But since not information is stored about the type of
%em me
, a dynamic dispatch is needed to call
= succeed "." do
%em length
%p
The simple idea is to get rid of this dynamic dispatch by storing the type of instance variables
too. This makes a lot calls faster, but it does come at significant cost:
\- every assignment to the variable has to be checked for type.
\- many more types must be created to differentiate the variables by name
%strong and
type.
%p
Both of those dont maybe sound soo bad at first, but its the cumulative effects that make a
difference. Instance assignment is one of the only two ways to move data around in a oo machine.
Thats a lot of checking. And Types hold the methods, so for every new type
%em all
methods have
to be
%em a
stored, and
%em b
created/compiled .
%p But off course the biggest thing is all the coding this entails. So thats why its in the future :-)
%h2#multilayered-mom Multilayered Mom
%p
Just a note on Mom: this was meant to be a bridge between the language layer (vool) and the machine
layer (risc). This step, from tree and statements, to list and low level instructions was deemed
to big, so the abstract Minimal Object Machine is supposed to be a layer in between those.
And it is off course.
%p
What i didnt fully appreciate before starting was that the two things are related. I mean
statements lend themselves to a tree, while having instruction in a tree is kind of silly.
Similarly statements in a list doesnt really make sense either. So it ended up being a two step
process inside Mom.
%p
The
%em first
pass that transforms vool, keeps the tree structure. But it does introduce Moms own
instructions. It turns out that this is sensible for exactly the linear parts of code.
%p
The
%em second
pass flattens the remaining control structures into jumps and labels. The result
maps to the risc layer 1 to n, meaning every Mom instruction simple expands into one or usually
more risc instructions.
%p
In the future i envision that this intermediate representation at the Mom level will be a
good place for further optimisations, but we shall see. At least the code is still recognisable,
meaning relatively easy to reason about. This is a property that the risc layer really does
not have anymore.

View File

@ -1,72 +0,0 @@
While work on Mom (Minimal object machine) continues, i can see the futures a little clearer.
Alas, for now the shortest route is best, so the future will have to wait. But here is what i'm
thinking.
## Types today
The [architecture](/rubyx/layers.html) document outlines this in more detail, but in short:
- types are immutable
- every object has a type (which may change)
- a type implements the interface of a class at a given time
- a type is defined by a list of attribute names
![Types diagram](/assets/types.jpg)
### How classes work
So the interesting thing here is how the classes work. Seeing as they are open, attributes can
be added and removed, but the types are immutable.
The solution is easy: when a new attribute is added to a class, a new type is created.
The *instance type* is then updated to point to the current type. This means that new objects will
be created with the new type, and old ones will keep their old type. Until the attribute is
added to them too, in which case their *type* is updated too.
**Methods** btw are stored at the Type, as they encode the knowledge of the memory layout
that comes with the type, into the code of the method. Remember: full data hiding, only objects
methods can access the variables, hence the type needs to be know only for *self*.
## The future of types
But what i wanted to talk about is how this picture is going to change in the future.
To understand why we might want to, let's look at method dispatch on an instance variable.
When you write something like @me.length , the compiler can check that @me is indeed an instance variable by checking the type of self. But since not information is stored about the type of
*me* , a dynamic dispatch is needed to call *length*.
The simple idea is to get rid of this dynamic dispatch by storing the type of instance variables
too. This makes a lot calls faster, but it does come at significant cost:
- every assignment to the variable has to be checked for type.
- many more types must be created to differentiate the variables by name **and** type.
Both of those don't maybe sound soo bad at first, but it's the cumulative effects that make a
difference. Instance assignment is one of the only two ways to move data around in a oo machine.
That's a lot of checking. And Types hold the methods, so for every new type *all* methods have
to be *a* stored, and *b* created/compiled .
But off course the biggest thing is all the coding this entails. So that's why it's in the future :-)
## Multilayered Mom
Just a note on Mom: this was meant to be a bridge between the language layer (vool) and the machine
layer (risc). This step, from tree and statements, to list and low level instructions was deemed
to big, so the abstract Minimal Object Machine is supposed to be a layer in between those.
And it is off course.
What i didn't fully appreciate before starting was that the two things are related. I mean
statements lend themselves to a tree, while having instruction in a tree is kind of silly.
Similarly statements in a list doesn't really make sense either. So it ended up being a two step
process inside Mom.
The *first* pass that transforms vool, keeps the tree structure. But it does introduce Mom's own
instructions. It turns out that this is sensible for exactly the linear parts of code.
The *second* pass flattens the remaining control structures into jumps and labels. The result
maps to the risc layer 1 to n, meaning every Mom instruction simple expands into one or usually
more risc instructions.
In the future i envision that this intermediate representation at the Mom level will be a
good place for further optimisations, but we shall see. At least the code is still recognisable,
meaning relatively easy to reason about. This is a property that the risc layer really does
not have anymore.

View File

@ -0,0 +1,99 @@
%p Since i currently have no time to do actual work, ive been doing some research.
%p
Reading about other implementations, especially transpiling ones. Opal, ruby to
javascript, and jruby, ruby to java, or jvm instructions.
%h2#reconsidering-the-madness Reconsidering the madness
%p
One needs to keep an open mind off course. “Reinventing” the wheel is not good, they
say. Off course we dont invent any wheels in IT, we just like the way that sounds,
but even building a wheel, when you can buy one, is bad enough.
And off course i have looked at using other peoples code from the beginning.
%p
A special eye went towards the go language this time. Go has a built in assembler, i
didnt know that. Sure compilers use assembler stages, but the thing about gos
spin on it is that it is quite close to what i call the risc layer. Ie it is machine
independent and abstracts many of
%em real
assemblers quirks away. And also go does
not expose the full assembler spectrum , so there are ways to write assembler within
go. All very promising.
%p
Go has closures, also very nice, and what they call escape analysis. Meaning that while
normally go will use the stack for locals, it has checks for closures and moves
variables to the heap if need be.
%p
So many goodies. And then there is the runtime and all that code that exists already,
so the std lib would be a straight pass through, much like mri. On top one of the best
gcs ive heard about, tooling, lots of code, interoperability and a community.
%p
The price is off course that one (me) would have to become an expert in go. Not too
bad, but still. As a preference i naturally tend to ruby, but maybe one can devise
a way to automate the bridge somewhat. Already found a gem to make extensions in go.
%p
And, while looking, there seems to be one or two ruby in go projects already out there.
Unfortunately interpreters :-(
%h2#sort-of-dealbreaker Sort of dealbreaker
%p
Looking deeper into transpiling and using the go runtime i read about the type system.
Its a good type system i think, and go even provides reflection. So it would be
nice to use it. This would provide good interoperability with go and use the existing
facilities.
%p
Just to scrape the alternative: One could use arrays as the basic structure to build
objects. Much in the same way MRI does. This would mean
%em not
using the type system,
but instead building one. Thinking of the wheels … no, no go.
%p
So a go type for each of what we currently have as Type. Since the current system
is built around immutable types, this seems a good match. The only glitch is that,
eg when adding an instance or method to an existing object, the type of that object
would have to change. A glitch, nothing more, just breaking the one constant static
languages are built on. But digging deep into the go code, i am relatively
certain one could deal with that.
%p
Digging deeper i read more about the go interfaces. I really cant see a way to have
%em only
specific (typed) methods or instances. I mean the current type model is about
types names and the number of slots, not typing every slot, as go. Or for methods,
the idea is to have a name and a certain amount of arguments, and specific implementations for each type of self. Not a separate implementation for each possible combination of types. This means using gos interfaces for variables and methods.
%p
And here it comes: When using the reflect package to ensure the type safety at runtime,
go is really slow.
10+
%a{:href => "http://blog.burntsushi.net/type-parametric-functions-golang/"} times slower
maybe. Im guessing it is not really their priority.
%p
Also, from an architecture kind of viewpoint, having all those interfaces doesnt seem
good. Many small objects, basically one interface object for every object
in the system, just adds lots of load. Unnecessary, ugly.
%h2#the-conclusion The conclusion
%p I just read about a go proposal to have int overflow panic. Too good.
%p
But in the end, ive decided to let go go. In some ways it would seem transpiling
to C would be much easier. Use the array, bake our types, bend those pointers.
While go is definitely the much better language for working in, for transpiling into
it seems to put up more hurdles than provide help.
%p
Having considered this, i can understand rubiniuss choice of c++ much better.
The object model fits well. Given just a single slot for dynamic expansion one
could make that work. One would just have to use the c++ classes as types, not as ruby
classes. Classes are not types, not when you can modify them!
%p But at the end it is not even about which code youre writing, how good the fit.
%p
It is about design, about change. To make this work (this meaning compiling a dynamic language to binary), flexibility is the key. Its not done, much is unclear, and one
must be able to change and change quickly.
%p
Self change, just like in life, is the only real form of control. To maximise that
i didnt use metasm or llvm, and it is also the reason go will not feature in this
project. At the risk of never actually getting there, or having no users. Something
Sinatra sang comes to mind, about doing it a specific way :-)
%p
There is still a lot to be learnt from go though, as much from the language as the
project. I find it inspiring that they moved from a c to a go compiler in a minor
version. And that what must be a major language in google has less commits than
rails. It does give hope.
%p
PPS: Also revisited llvm (too complicated) and crystal (too complicated, bad fit in
type system) after this. Could still do rust off course, but the more i write, the
more i hear the call of simplicity (something that a normal person can still understand)

View File

@ -1,98 +0,0 @@
Since i currently have no time to do actual work, i've been doing some research.
Reading about other implementations, especially transpiling ones. Opal, ruby to
javascript, and jruby, ruby to java, or jvm instructions.
## Reconsidering the madness
One needs to keep an open mind off course. "Reinventing" the wheel is not good, they
say. Off course we don't invent any wheels in IT, we just like the way that sounds,
but even building a wheel, when you can buy one, is bad enough.
And off course i have looked at using other peoples code from the beginning.
A special eye went towards the go language this time. Go has a built in assembler, i
didn't know that. Sure compilers use assembler stages, but the thing about go's
spin on it is that it is quite close to what i call the risc layer. Ie it is machine
independent and abstracts many of *real* assemblers quirks away. And also go does
not expose the full assembler spectrum , so there are ways to write assembler within
go. All very promising.
Go has closures, also very nice, and what they call escape analysis. Meaning that while
normally go will use the stack for locals, it has checks for closures and moves
variables to the heap if need be.
So many goodies. And then there is the runtime and all that code that exists already,
so the std lib would be a straight pass through, much like mri. On top one of the best
gc's i've heard about, tooling, lot's of code, interoperability and a community.
The price is off course that one (me) would have to become an expert in go. Not too
bad, but still. As a preference i naturally tend to ruby, but maybe one can devise
a way to automate the bridge somewhat. Already found a gem to make extensions in go.
And, while looking, there seems to be one or two ruby in go projects already out there.
Unfortunately interpreters :-(
## Sort of dealbreaker
Looking deeper into transpiling and using the go runtime i read about the type system.
It's a good type system i think, and go even provides reflection. So it would be
nice to use it. This would provide good interoperability with go and use the existing
facilities.
Just to scrape the alternative: One could use arrays as the basic structure to build
objects. Much in the same way MRI does. This would mean *not* using the type system,
but instead building one. Thinking of the wheels ... no, no go.
So a go type for each of what we currently have as Type. Since the current system
is built around immutable types, this seems a good match. The only glitch is that,
eg when adding an instance or method to an existing object, the type of that object
would have to change. A glitch, nothing more, just breaking the one constant static
languages are built on. But digging deep into the go code, i am relatively
certain one could deal with that.
Digging deeper i read more about the go interfaces. I really can't see a way to have
*only* specific (typed) methods or instances. I mean the current type model is about
types names and the number of slots, not typing every slot, as go. Or for methods,
the idea is to have a name and a certain amount of arguments, and specific implementations for each type of self. Not a separate implementation for each possible combination of types. This means using go's interfaces for variables and methods.
And here it comes: When using the reflect package to ensure the type safety at runtime,
go is really slow.
10+ [times slower](http://blog.burntsushi.net/type-parametric-functions-golang/)
maybe. I'm guessing it is not really their priority.
Also, from an architecture kind of viewpoint, having all those interfaces doesn't seem
good. Many small objects, basically one interface object for every object
in the system, just adds lots of load. Unnecessary, ugly.
## The conclusion
I just read about a go proposal to have int overflow panic. Too good.
But in the end, i've decided to let go go. In some ways it would seem transpiling
to C would be much easier. Use the array, bake our types, bend those pointers.
While go is definitely the much better language for working in, for transpiling into
it seems to put up more hurdles than provide help.
Having considered this, i can understand rubinius's choice of c++ much better.
The object model fits well. Given just a single slot for dynamic expansion one
could make that work. One would just have to use the c++ classes as types, not as ruby
classes. Classes are not types, not when you can modify them!
But at the end it is not even about which code you're writing, how good the fit.
It is about design, about change. To make this work (this meaning compiling a dynamic language to binary), flexibility is the key. It's not done, much is unclear, and one
must be able to change and change quickly.
Self change, just like in life, is the only real form of control. To maximise that
i didn't use metasm or llvm, and it is also the reason go will not feature in this
project. At the risk of never actually getting there, or having no users. Something
Sinatra sang comes to mind, about doing it a specific way :-)
There is still a lot to be learnt from go though, as much from the language as the
project. I find it inspiring that they moved from a c to a go compiler in a minor
version. And that what must be a major language in google has less commits than
rails. It does give hope.
PPS: Also revisited llvm (too complicated) and crystal (too complicated, bad fit in
type system) after this. Could still do rust off course, but the more i write, the
more i hear the call of simplicity (something that a normal person can still understand)

View File

@ -0,0 +1,116 @@
%p
Now that i
%em have
had time to write some more code (250 commits last month), here is
the good news:
%h2#sending-is-done Sending is done
%p
A dynamic language like ruby really has at its heart the dynamic method resolution. Without
that wed be writing C++. Not much can be done in ruby without looking up methods.
%p
Yet all this time i have been running circles around this mother of a problem, because
(after all) it is a BIG one. It must be the one single most important reason why dynamic
languages are interpreted and not compiled.
%h2#a-brief-recap A brief recap
%p
Last year already i started on a rewrite. After hitting this exact same wall for the fourth
time. I put in some more Layers, the way a good programmer fixes any daunting problem.
%p
The
%a{:href => "https://github.com/ruby-x/rubyx"} Readme
has quite a good summary on the new layers,
and off course ill update the architecture soon. But in case you didnt click, here is the
very very short summary:
%ul
%li
%p
Vool is a Virtual Object Oriented Language. Virtual in that is has no own syntax. But
it has semantics, and those are substantially simpler than ruby. Vool is Ruby without
the fluff.
%li
%p
Mom, the Minimal Object Machine layer is the first machine layer. Mom has no concept of memory
yet, only objects. Data is transferred directly from object
to object with one of Moms main instructions, the SlotLoad.
%li
%p
Risc layer here abstracts the Arm in a minimal and independent way. It does not model
any real RISC cpu instruction set, but rather implements what is needed for rubyx.
%li
%p
There is a minimal
%em Arm
translator that transforms Risc instructions to Arm instructions.
Arm instructions assemble themselves into binary code. A minimal
%em Elf
implementation is
able to create executable binaries from the assembled code and Parfait objects.
%li
%p
Parfait: Generating code (by descending above layers) is only half the story in an oo system.
The other half is classes, types, constant objects and a minimal run-time. This is
what is Parfait is.
%h2#compiling-and-building Compiling and building
%p
After having finished all this layering work, i was back to square
= succeed ":" do
%em resolve
%p
But off course when i got there i started thinking that the resolve method (in ruby)
would need resolve itself. And after briefly considering cheating (hardcoding type
information into this
%em one
method), i opted to write the code in Risc. Basically assembler.
%p
And it was horrible. It worked, but it was completely unreadable. So then i wrote a dsl for
generating risc instructions, using a combination of method_missing, instance_eval and
operator overloading. The result is quite readable code, a mixture between assembler and
a mathematical notation, where one can just freely name registers and move data around
with
%em []
and
= succeed "." do
%em «
%p
By then resolving worked, but it was still a method. Since it was already in risc, i basically
inlined the code by creating a new Mom instruction and moving the code to its
= succeed "." do
%em to_risc
%p
A small bug in calling the resulting method was fixed, and
= succeed "," do
%em voila
%h2#the-proof The proof
%p
Previous, static, Hello Worlds looked like this:
\&gt; “Hello world”.putstring
%p
Off course we can know the type that putstring applies to and so this does not
involve any method resolution at runtime, only at compile time.
%p
Todays step is thus:
\&gt; a = “Hello World”
%blockquote
%p a.putstring
%p
This does involve a run-time lookup of the
%em putstring
method. It being a method on String,
it is indeed found and called.(1) Hurray.
%p
And maths works too:
\&gt; a = 150
%blockquote
%p a.div10
%p
Does indeed result in 15. Even with the
%em new
integers. Part of the rewrite was to upgrade
integers to first class objects.
%p
PS(1): I know with more analysis the compiler
%em could
now that
%em a
is a String (or Integer),
but just now it doesnt. Take my word for it or even better, read the code.

View File

@ -1,90 +0,0 @@
Now that i *have* had time to write some more code (250 commits last month), here is
the good news:
## Sending is done
A dynamic language like ruby really has at it's heart the dynamic method resolution. Without
that we'd be writing C++. Not much can be done in ruby without looking up methods.
Yet all this time i have been running circles around this mother of a problem, because
(after all) it is a BIG one. It must be the one single most important reason why dynamic
languages are interpreted and not compiled.
## A brief recap
Last year already i started on a rewrite. After hitting this exact same wall for the fourth
time. I put in some more Layers, the way a good programmer fixes any daunting problem.
The [Readme](https://github.com/ruby-x/rubyx) has quite a good summary on the new layers,
and off course i'll update the architecture soon. But in case you didn't click, here is the
very very short summary:
- Vool is a Virtual Object Oriented Language. Virtual in that is has no own syntax. But
it has semantics, and those are substantially simpler than ruby. Vool is Ruby without
the fluff.
- Mom, the Minimal Object Machine layer is the first machine layer. Mom has no concept of memory
yet, only objects. Data is transferred directly from object
to object with one of Mom's main instructions, the SlotLoad.
- Risc layer here abstracts the Arm in a minimal and independent way. It does not model
any real RISC cpu instruction set, but rather implements what is needed for rubyx.
- There is a minimal *Arm* translator that transforms Risc instructions to Arm instructions.
Arm instructions assemble themselves into binary code. A minimal *Elf* implementation is
able to create executable binaries from the assembled code and Parfait objects.
- Parfait: Generating code (by descending above layers) is only half the story in an oo system.
The other half is classes, types, constant objects and a minimal run-time. This is
what is Parfait is.
## Compiling and building
After having finished all this layering work, i was back to square *resolve*: how to
dynamically, at run-time, resolve a method to binary. The strategy was going to be to have
some short risc based check and bail out to a method.
But off course when i got there i started thinking that the resolve method (in ruby)
would need resolve itself. And after briefly considering cheating (hardcoding type
information into this *one* method), i opted to write the code in Risc. Basically assembler.
And it was horrible. It worked, but it was completely unreadable. So then i wrote a dsl for
generating risc instructions, using a combination of method_missing, instance_eval and
operator overloading. The result is quite readable code, a mixture between assembler and
a mathematical notation, where one can just freely name registers and move data around
with *[]* and *<<*.
By then resolving worked, but it was still a method. Since it was already in risc, i basically
inlined the code by creating a new Mom instruction and moving the code to it's *to_risc*.
Now resolving still worked, and also looked good.
A small bug in calling the resulting method was fixed, and *voila*, ruby-x can dynamically call
any method.
## The proof
Previous, static, Hello Worlds looked like this:
> "Hello world".putstring
Off course we can know the type that putstring applies to and so this does not
involve any method resolution at runtime, only at compile time.
Todays step is thus:
> a = "Hello World"
> a.putstring
This does involve a run-time lookup of the *putstring* method. It being a method on String,
it is indeed found and called.(1) Hurray.
And maths works too:
> a = 150
> a.div10
Does indeed result in 15. Even with the *new* integers. Part of the rewrite was to upgrade
integers to first class objects.
PS(1): I know with more analysis the compiler *could* now that *a* is a String (or Integer),
but just now it doesn't. Take my word for it or even better, read the code.

50
arm/overview.html.haml Normal file
View File

@ -0,0 +1,50 @@
%hr/
%p
layout: arm
title: Arm resources
%h2#arm-is-the-target Arm is the target
%p
So, since the first target is arm, some of us may need to learn a bit (yep, thats me). So this is
a collection of helpful resources (links and specs) with sometimes very very brief summaries.
%p So why learn assembler, after all, its likely you spent your programmers life avoiding it:
%ul
%li Some things can not be expressed in ruby
%li To speed things up.
%li To add cpu specific capabilities
%h2#links Links
%p
A very good
%a{:href => "/arm/arm_inst.pdf"} summary pdf
was created by the arm university, which i converted
to
%a{:href => "/arm/target.html"} html for online reading
%p
%a{:href => "http://www.davespace.co.uk/arm/introduction-to-arm/why-learn.html"} Daves
site explains just about
everything about the arm in nice and easy to understand terms.
%p
A nice series on thinkgeek, here is the integer
%a{:href => "http://thinkingeek.com/2013/08/11/arm-assembler-raspberry-pi-chapter-15/"} division section
that has a
%a{:href => "https://github.com/rofirrim/raspberry-pi-assembler/blob/master/chapter15/magic.py"} code respository
with code to generate code for constants.
%p
And off course there is the overwhelming arm infocenter,
%a{:href => "http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473c/CEGECDGD.html"} here with its bizarre division
%p
The full 750 page specification for the pi , the
%a{:href => "/arm/big_spec.pdf"} ARM1176JZF-S pdf is here
or
%a{:href => "http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/BABFADHJ.html"} online
%p
A nice list of
%a{:href => "http://docs.cs.up.ac.za/programming/asm/derick_tut/syscalls.html"} Kernel calls
\## Virtual pi
%p
And since not everyone has access to an arm, here is a description how to set up an
%a{:href => "/arm/qemu.html"} emulated pi
%p
And how to
%a{:href => "/arm/remote_pi.html"} access that
or any remote machine with ssl

View File

@ -1,39 +0,0 @@
---
layout: arm
title: Arm resources
---
## Arm is the target
So, since the first target is arm, some of us may need to learn a bit (yep, that's me). So this is
a collection of helpful resources (links and specs) with sometimes very very brief summaries.
So why learn assembler, after all, it's likely you spent your programmers life avoiding it:
- Some things can not be expressed in ruby
- To speed things up.
- To add cpu specific capabilities
## Links
A very good [summary pdf](/arm/arm_inst.pdf) was created by the arm university, which i converted
to [html for online reading](/arm/target.html)
[Dave's](http://www.davespace.co.uk/arm/introduction-to-arm/why-learn.html) site explains just about
everything about the arm in nice and easy to understand terms.
A nice series on thinkgeek, here is the integer [division section](http://thinkingeek.com/2013/08/11/arm-assembler-raspberry-pi-chapter-15/) that has a
[code respository](https://github.com/rofirrim/raspberry-pi-assembler/blob/master/chapter15/magic.py)
with code to generate code for constants.
And off course there is the overwhelming arm infocenter, [here with it's bizarre division](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473c/CEGECDGD.html)
The full 750 page specification for the pi , the [ARM1176JZF-S pdf is here](/arm/big_spec.pdf) or
[online](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/BABFADHJ.html)
A nice list of [Kernel calls](http://docs.cs.up.ac.za/programming/asm/derick_tut/syscalls.html)
## Virtual pi
And since not everyone has access to an arm, here is a description how to set up an [emulated pi](/arm/qemu.html)
And how to [access that](/arm/remote_pi.html) or any remote machine with ssl

103
arm/qemu.html.haml Normal file
View File

@ -0,0 +1,103 @@
%hr/
%p
layout: arm
title: How to configure Qemu
%h2#target-pi-on-mac Target Pi on Mac
%p So even the idea is to run software on the Pi, not everyone has a Pi (yet :-)
%p Others, like me, prefer to develop on a laptop and not carry the Pi around.
%p For all those, this here explains how to emulate the Pi on a Mac.
%p
Even if you have a Pi,
%a{:href => "/remote_pi.html"} this explains
a nice way to develop with it.
%h3#replace-the-buggy-llvm Replace the buggy llvm
%p Written April 2014: as of writing the latest and greatest llvm based gcc (5.1) on Maverick (10.9) has a bug that makes qemu hang.
%p So type gcc -v and if the output contains “LLVM version 5.1”, you must install gcc4.2. Easily done with homebrew:
%pre
%code
:preserve
brew install https://raw.github.com/Homebrew/homebrew-dupes/master/apple-gcc42.rb
%p This will not interfere with the systems compiler as the gcc4.2 has postfixed executables (ie gcc-4.2)
%h3#qemu Qemu
%p Then its time to get the Qemu. There may be other emulators out there, and i have read of armulator, but this is what i found discribed and it works and is “easy enough”.
%pre
%code
:preserve
brew install qemu --env=std --cc=gcc-4.2
%p For people not on Maverick it may work without the -cc option.
%h3#pi-images Pi images
%p Create a directory for the stuff on your mac, ie pi.
%p Get the latest Raspian image.
%p There seems to be some chicken and egg problem, so quemu needs the kernel seperately. There is one in the links.
%h3#configure Configure
%p
In the blog post there is some fun configuration, I did it and it works. Not sure what happens if you dont.
The booting is described below (you may or may not need an extra init=/bin/bash in the root… quotes), so boot your Pi and then configure:
%p nano /etc/ld.so.preload
%p Put a # in front of the first to comment it out. Should just be one line there.
%p Press ctrl-x then y then enter to save and exit.
%p (Optional) Create a file /etc/udev/rules.d/90-qemu.rules with the following content:
%pre
%code
:preserve
KERNEL=="sda", SYMLINK+="mmcblk0"
KERNEL=="sda?", SYMLINK+="mmcblk0p%n"
KERNEL=="sda2", SYMLINK+="root"
%p
The kernel sees the disk as /dev/sda, while a real pi sees /dev/mmcblk0.
This will create symlinks to be more consistent with the real pi.
%h3#boot Boot
%p There is quite a bit to the command line to boot the pi (i have an alias), here it is:
%pre
%code
:preserve
qemu-system-arm -kernel kernel-qemu -cpu arm1176 -m 256 -M versatilepb -no-reboot -serial stdio -append 'root=/dev/sda2 panic=1 rootfstype=ext4 rw' -hda raspbian.img -redir tcp:2222::22
%ul
%li the cpu is what braodcom precifies, ok
%li memory is unfortuantely hardcoded in the versatilepb “machine”
%li the kernel is the file name of the kernel you downloaded (or extracted)
%li raspbian.img is the image you downloaded. Renamed as it probably had the datestamp on it
%li the redir redircts the port 2222 to let you log into the pi
%p So
%pre
%code
:preserve
ssh -p 2222 -l pi localhost
%p will get you “in”. Ie username pi (password raspberry is the default) and port 2222
%p Qemu bridges the network (that it emulates), and so your pi is now as connected as your mac.
%h3#more-disk More Disk
%p The image that you download has only 200Mb free. Since the gcc is included and were developing (tiny little files of) ruby, this may be ok. If not there is a 3 step procedure to up the space.
%pre
%code
:preserve
dd if=/dev/zero bs=1m count=2048 >> raspbian.img
%p The 2048 gets you 2Gb as we specified 1m (meg).
%p On the pi launch
%pre
%code
:preserve
sudo fdisk /dev/sda
%p This will probably only work if your do the (Optional) config above.
%p
Say p, and write down the start of the second partition (122880 for me).
d 2 will delete the second partition
n p 2 will create a new primary second partition
write the number as start and just return to the end.
p to check
w to write and quit.
%p Reboot, and run
%pre
%code
:preserve
resize2fs
%h2#links Links
%p
Blog post:
%a{:href => "http://xecdesign.com/qemu-emulating-raspberry-pi-the-easy-way/"} http://xecdesign.com/qemu-emulating-raspberry-pi-the-easy-way/
%p
Kernel:
%a{:href => "http://xecdesign.com/downloads/linux-qemu/kernel-qemu"} http://xecdesign.com/downloads/linux-qemu/kernel-qemu
%p
Rasbian file system(preferably be torrent):
%a{:href => "http://www.raspberrypi.org/downloads/"} http://www.raspberrypi.org/downloads/

View File

@ -1,116 +0,0 @@
---
layout: arm
title: How to configure Qemu
---
## Target Pi on Mac
So even the idea is to run software on the Pi, not everyone has a Pi (yet :-)
Others, like me, prefer to develop on a laptop and not carry the Pi around.
For all those, this here explains how to emulate the Pi on a Mac.
Even if you have a Pi, [this explains](/remote_pi.html) a nice way to develop with it.
### Replace the buggy llvm
Written April 2014: as of writing the latest and greatest llvm based gcc (5.1) on Maverick (10.9) has a bug that makes qemu hang.
So type gcc -v and if the output contains "LLVM version 5.1", you must install gcc4.2. Easily done with homebrew:
brew install https://raw.github.com/Homebrew/homebrew-dupes/master/apple-gcc42.rb
This will not interfere with the systems compiler as the gcc4.2 has postfixed executables (ie gcc-4.2)
### Qemu
Then its time to get the Qemu. There may be other emulators out there, and i have read of armulator, but this is what i found discribed and it works and is "easy enough".
brew install qemu --env=std --cc=gcc-4.2
For people not on Maverick it may work without the -cc option.
### Pi images
Create a directory for the stuff on your mac, ie pi.
Get the latest Raspian image.
There seems to be some chicken and egg problem, so quemu needs the kernel seperately. There is one in the links.
### Configure
In the blog post there is some fun configuration, I did it and it works. Not sure what happens if you don't.
The booting is described below (you may or may not need an extra init=/bin/bash in the root... quotes), so boot your Pi and then configure:
nano /etc/ld.so.preload
Put a # in front of the first to comment it out. Should just be one line there.
Press ctrl-x then y then enter to save and exit.
(Optional) Create a file /etc/udev/rules.d/90-qemu.rules with the following content:
KERNEL=="sda", SYMLINK+="mmcblk0"
KERNEL=="sda?", SYMLINK+="mmcblk0p%n"
KERNEL=="sda2", SYMLINK+="root"
The kernel sees the disk as /dev/sda, while a real pi sees /dev/mmcblk0.
This will create symlinks to be more consistent with the real pi.
### Boot
There is quite a bit to the command line to boot the pi (i have an alias), here it is:
qemu-system-arm -kernel kernel-qemu -cpu arm1176 -m 256 -M versatilepb -no-reboot -serial stdio -append 'root=/dev/sda2 panic=1 rootfstype=ext4 rw' -hda raspbian.img -redir tcp:2222::22
- the cpu is what braodcom precifies, ok
- memory is unfortuantely hardcoded in the versatilepb "machine"
- the kernel is the file name of the kernel you downloaded (or extracted)
- raspbian.img is the image you downloaded. Renamed as it probably had the datestamp on it
- the redir redircts the port 2222 to let you log into the pi
So
ssh -p 2222 -l pi localhost
will get you "in". Ie username pi (password raspberry is the default) and port 2222
Qemu bridges the network (that it emulates), and so your pi is now as connected as your mac.
### More Disk
The image that you download has only 200Mb free. Since the gcc is included and we're developing (tiny little files of) ruby, this may be ok. If not there is a 3 step procedure to up the space.
dd if=/dev/zero bs=1m count=2048 >> raspbian.img
The 2048 gets you 2Gb as we specified 1m (meg).
On the pi launch
sudo fdisk /dev/sda
This will probably only work if your do the (Optional) config above.
Say p, and write down the start of the second partition (122880 for me).
d 2 will delete the second partition
n p 2 will create a new primary second partition
write the number as start and just return to the end.
p to check
w to write and quit.
Reboot, and run
resize2fs
Links
-----
Blog post: [http://xecdesign.com/qemu-emulating-raspberry-pi-the-easy-way/](http://xecdesign.com/qemu-emulating-raspberry-pi-the-easy-way/)
Kernel: [http://xecdesign.com/downloads/linux-qemu/kernel-qemu](http://xecdesign.com/downloads/linux-qemu/kernel-qemu)
Rasbian file system(preferably be torrent): [http://www.raspberrypi.org/downloads/](http://www.raspberrypi.org/downloads/)

58
arm/remote_pi.html.haml Normal file
View File

@ -0,0 +1,58 @@
%hr/
%p
layout: arm
title: How to use a remote pi
%h3#headless Headless
%p The pi is a strange mix, development board and full pc in one. Some people use it as a pc, but not me.
%p I use the pi because it is the same price as an Arduino, but much more powerful.
%p As such i dont use the keyboard or display and that is called headless mode, logging in with ssh.
%pre
%code
:preserve
ssh -p 2222 -l pi localhost
%p the -p 2222 is only needed for the qemu version, not the real pi.
%h3#authorized Authorized
%p
Over ssh one can use many other tools, but the password soon gets to be a pain.
So the first thing i do is copy my public key over to the pi. This will allow login without password.
%pre
%code
:preserve
scp -P 2222 .ssh/id_rsa.pub pi@localhost:.ssh/authorized_keys
%p
This assumes a fresh pi, otherwise you have to append your key to the authorized ones. Also if it complains about no
id_rsa.pub then you have to generate a key pair (public/private) using ssh-keygen (no password, otherwise youll be typing that)
%h3#syncing Syncing
%p
Off course I do all that to be able to actually work on my machine. On the Pi my keyboard doesnt even work and
id have to use emacs or nano instead of TextMate. So i need to get the files accross.
For this there is a million ways, but since i just go one way (mac to pi) i use rsync (over ssh).
%p I set up a directory (home) in my pi directory (on the mac), that i copy to the home directory on the pi using:
%pre
%code
:preserve
rsync -r -a -v -e "ssh -l pi -p 2222" ~/pi/home/ localhost:/home/pi
%p The pi/home is on my laptop and the command transfers all files to /home/pi , the default directory of the pi user.
%h3#automatic-sync Automatic sync
%p Transferring files is off course nice, but having to do it by hand after saving quickly becomes tedious.
%p Fswatch to the rescue. It will watch the filesystem (fs) for changes. Install with brew install fswatch
%p
Then you can store the above rsync command in a shell script, say sync.sh.
Add afplay “/System/Library/Sounds/Morse.aiff” if you like to know it worked.
%p Then just run
%pre
%code
:preserve
fswatch ~/pi/home/ sync.sh
%p And hear the ping each time you save.
%h2#conclusion Conclusion
%p So the total setup involves the qemu set up as described. To work i
%ul
%li start the terminal (iterm)
%li start the pi, with my alias “pi” *
%li log in to the pi in its window
%li open textmate with the directory i work (within the home)
%li
%p edit, save, wait for ping, alt-tab to pi window, run my whatever and repeat until its time for tea
%li (i dont log into the prompt it gives in item so as not to accidentally quit the qemu session with ctr-c )

View File

@ -1,66 +0,0 @@
---
layout: arm
title: How to use a remote pi
---
### Headless
The pi is a strange mix, development board and full pc in one. Some people use it as a pc, but not me.
I use the pi because it is the same price as an Arduino, but much more powerful.
As such i don't use the keyboard or display and that is called headless mode, logging in with ssh.
ssh -p 2222 -l pi localhost
the -p 2222 is only needed for the qemu version, not the real pi.
### Authorized
Over ssh one can use many other tools, but the password soon gets to be a pain.
So the first thing i do is copy my public key over to the pi. This will allow login without password.
scp -P 2222 .ssh/id_rsa.pub pi@localhost:.ssh/authorized_keys
This assumes a fresh pi, otherwise you have to append your key to the authorized ones. Also if it complains about no
id_rsa.pub then you have to generate a key pair (public/private) using ssh-keygen (no password, otherwise you'll be typing that)
### Syncing
Off course I do all that to be able to actually work on my machine. On the Pi my keyboard doesn't even work and
i'd have to use emacs or nano instead of TextMate. So i need to get the files accross.
For this there is a million ways, but since i just go one way (mac to pi) i use rsync (over ssh).
I set up a directory (home) in my pi directory (on the mac), that i copy to the home directory on the pi using:
rsync -r -a -v -e "ssh -l pi -p 2222" ~/pi/home/ localhost:/home/pi
The pi/home is on my laptop and the command transfers all files to /home/pi , the default directory of the pi user.
### Automatic sync
Transferring files is off course nice, but having to do it by hand after saving quickly becomes tedious.
Fswatch to the rescue. It will watch the filesystem (fs) for changes. Install with brew install fswatch
Then you can store the above rsync command in a shell script, say sync.sh.
Add afplay "/System/Library/Sounds/Morse.aiff" if you like to know it worked.
Then just run
fswatch ~/pi/home/ sync.sh
And hear the ping each time you save.
Conclusion
----------
So the total setup involves the qemu set up as described. To work i
- start the terminal (iterm)
- start the pi, with my alias "pi" *
- log in to the pi in it's window
- open textmate with the directory i work (within the home)
- edit, save, wait for ping, alt-tab to pi window, run my whatever and repeat until it's time for tea
* (i don't log into the prompt it gives in item so as not to accidentally quit the qemu session with ctr-c )

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,48 @@
%hr/
%p
layout: project
title: Join the fun
%p
I am very open for people to join. Say hello at the
= succeed "." do
%a{:href => "https://groups.google.com/forum/#!forum/ruby-x"} list
%p
I just want to mention that this is my hobby, something i do in my spare time, for fun.
I dont get any money and in fact, running 2 companies, have to carve the time to do this.
%p As such i want it to stay fun. So i am looking for friendly, constructive, positive contact.
%p
Please read the pages and the code and find something that interests you, possibly from the todo list.
Then talk to me what you are planning. Issues can be good to capture topic conversations.
The list is good for more general discussion.
%p Then fork and work on a branch before sending pull request.
%p
If you dont have an arm, here are instructions to run an
%a{:href => "/qemu.html"} emulator
(on mac)
%p I wrote some ideas in the about page, but here some more code related guidelines
%ul
%li
%p
Walk the straight line
Or “No futureproof” means not to design before you code. Not to anticipate, only to do the job that
needs doing. Better design should be extracted from working code.
%li
%p
tdd extreme
Having suffered from broken software (small feature add breaks whole software) so many times, the new tdd
wind is not just nice, it is essential. Software size is measured in tests passed, not lines written. Any
new feature is only accepted with enough tests, bugs fixed after a failed test is written.
%li
%p
Use names rightly
or the principle of least surprise. Programming is so much naming, so if done right will lead to a
natural understanding, even of code not read.
Good names are Formatter or compile, but unfortunately not everything we have learnt is named well, like
Array (should be ordered list), Hash (names implementation not function) or string (should be word, or bytebuffer).
%li
%p
No sahara
There has been much misunderstood talk about drying things up. Dry is good, but was never meant for code, but
for information (configuration). Trying to dry code leads to overly small functions, calling chains that
are difficult to understand and serve only a misundertood slogan.

View File

@ -1,41 +0,0 @@
---
layout: project
title: Join the fun
---
I am very open for people to join. Say hello at the [list](https://groups.google.com/forum/#!forum/ruby-x).
I just want to mention that this is my hobby, something i do in my spare time, for fun.
I don't get any money and in fact, running 2 companies, have to carve the time to do this.
As such i want it to stay fun. So i am looking for friendly, constructive, positive contact.
Please read the pages and the code and find something that interests you, possibly from the todo list.
Then talk to me what you are planning. Issues can be good to capture topic conversations.
The list is good for more general discussion.
Then fork and work on a branch before sending pull request.
If you don't have an arm, here are instructions to run an [emulator](/qemu.html) (on mac)
I wrote some ideas in the about page, but here some more code related guidelines
- Walk the straight line
Or "No futureproof" means not to design before you code. Not to anticipate, only to do the job that
needs doing. Better design should be extracted from working code.
- tdd extreme
Having suffered from broken software (small feature add breaks whole software) so many times, the new tdd
wind is not just nice, it is essential. Software size is measured in tests passed, not lines written. Any
new feature is only accepted with enough tests, bugs fixed after a failed test is written.
- Use names rightly
or the principle of least surprise. Programming is so much naming, so if done right will lead to a
natural understanding, even of code not read.
Good names are Formatter or compile, but unfortunately not everything we have learnt is named well, like
Array (should be ordered list), Hash (names implementation not function) or string (should be word, or bytebuffer).
- No sahara
There has been much misunderstood talk about drying things up. Dry is good, but was never meant for code, but
for information (configuration). Trying to dry code leads to overly small functions, calling chains that
are difficult to understand and serve only a misundertood slogan.

View File

@ -1,143 +0,0 @@
---
layout: project
title: RubyX, where it started
---
<div class="row vspace10">
<div class="span12 center">
<h1><span></span></h1>
<p></p>
</div>
</div>
<div class="row ">
<div class="span1"> &nbsp; </div>
<div class="span10">
<p>
Torsten Ruger started this on 10.04.2014 after having read the Blue Book 20 years earlier.
The main ideas were:
</p>
<p>
<b>Mikrokernel</b>: The microkernel idea: anything that can be left out, should, puts a nice upper limit
on things and at the same time provides a great cooking pot for everyone else to try out their ideas.<br/>
Given gems and bundler this also seems an obvious choice. I really hope to see things i hadn't even thought of.
<br/>
<b>Layers represent an interface, not an implementation</b>:
It is said that every problem in computing can be solved by adding another layer of indirection. And so
we have many layers, which, when done right, help us to understand the system. (Read, layers are for us,
not the computer)
But implementing each layer comes with added cost, often unneccessary. Layers can and should be collapsed
in the implementation. Inlining, is a good example of this.
<br/>
<b>Empowerment</b>: I like the openness of ruby. Everyone can do what and how they want. And change other
peoples code in an easy and sensible way. The best ideas survive and even better ones are coming.
Friendly competition as it were, cooperation, independant improvement all make ruby gems better all the time.<br/>
But ruby itself has not benefited from this in the same way (ie by ruby developers), because it is not in ruby.
<br/>
<b>To get it done</b>: I don't know why this has not been done before, it seems so obvious.
The Blue Book influence has left me interested in virtual machines and that hasn't gone away for
so long. So when i bought my raspberry pi and had a real need for speed, the previous ecommerce project
left me feeling that anything could be done. And so i started.
<br/>
</p>
</div>
</div>
<div class="row">
<div class="span12 center">
<h1><span>Thanks</span></h1>
<p>This would not have happened without:</p>
</div>
</div>
<!-- About Us -->
<div class="row">
<div class="tripple">
<h2 class="center">Smalltalk</h2>
<p>
Smalltalk is the mother of OO for me. Adele Goldberg has written down the details of early implementations in the
Blue Book, which made a great impression on me. Having read it, mri code is quite easy to understand. <br/>
Unfortunately Smalltalk was too far ahead of it's time and used the image, the implications of which are still
not understood imho.<br/>
Additional bad luck struck when, in Steven Jobs great heist of the PARC UI, he did not recognise the value of it's
implementation language and so pure OO did not get the same boost as the gui. Instead we got difficult c dialects.
</p>
</div>
<div class="tripple">
<h2 class="center">Ruby and Rails</h2>
<p>
After years of coding Java, Ruby was a very fresh wind. Smalltalk reborn without the funny syntax or image.
Instead of the image we now have gems, git and bundler, so code exchange has never been easier.
</p>
<p>
Rails has sort of given Ruby it's purpose and made it grow from a perl like scripting language to a server programming
environment with all the whistles and bells. Rails maturity and code quality make it not only a joy to use,
but an excellent source for good ruby practises.
</p>
<p>
</p>
</div>
<div class="tripple">
<h2 class="center">Synthesis</h2>
<p>Synthesis</a> is a microkernel OS written
in the 80's by Alexia Massalin which not only proves the validity of the microkernel idea, but also
introduces self modifying code into, of all places, the OS.
</p>
<p>
Alexia has raised questions about the nature of code and ways of programming which are still unresolved.
I regularly reread the thesis and especially the chapter on
<a href="http://valerieaurora.org/synthesis/SynthesisOS/ch4.html"> Quajects</a> in the endeavour to understand what
they are in any higher language terms.
</p>
</div>
</div>
<div class="row">
<div class="span12 center">
<p>Many other steps on the way that have left their mark:</p>
</div>
</div>
<div class="row ">
<div class="span1"> &nbsp; </div>
<div class="span10">
<p>
<b><a href="http://judy.sourceforge.net/">Judy</a></b> has been a major source of inspiration and opened new
ways of thinking about data structures and indeed coding. It has been the basis of two databases i wrote and together
with Synthesis redefined the meaning of speed for me.
</p>
<p>
<b><a href="http://metasm.cr0.org/">Metasm</a></b> finally confirmed what i had suspected for a while.
Namely that you don't need C to generate machine code. Metasm has be been assembling, deassembling and
compiling for several cpu's since 2007, in 100% ruby.
A great feat, and the only reason i don't use it is because it is too big (for me to understand).
</p>
<p>
<b><a href="https://github.com/cyndis/as">As</a></b> ended up being the starting point for the assembly layer.
It was nice and small and produced working ARM code, which is what i wanted, as raspberry is arm.
<b><a href="https://github.com/seattlerb/wilson"> Wilson</a> </b>got assimilated for similar reasons, ie small and
no dependencies.
</p>
<p>
<b><a href="http://kschiess.github.io/parslet/">Parslet</a></b> is great, thanks Kasper!
Parslet makes parsing possible for everyone.
</p>
<p>
<b><a href="http://bundler.io/">Bundler</a></b> just makes you wonder how we managed before.
Thanks to Yahuda, for starting it and Andre for making it fantastic.
</p>
</div>
</div>
<div class="row">
<div class="span12 center">
<p>Lastly, but most importantly there is a siritual side to this too. Actually to anything i have done for at
least 15 years, and i just mention it <a href="spiritual.html">here</a>, thinking that it won't concern
most people which is fine. I don't really want to talk about it, but i can't leave it unsaid either.</p>
</div>
</div>

View File

@ -1,83 +0,0 @@
---
layout: project
title: Effectiveness, not efficiency
sub-title: By way of a new look at programming.
---
<div class="row">
<div class="tripple">
<h2 class="center"> Where to go</h2>
<p>
When making the distinction between effectiveness and efficiency i like to think of transport.
</p>
<p>
Efficiency is going fast, like an airplane is much more efficient than a car and that is more so than walking.
</p>
<p>
Effectiveness on the other hand is how straight your route is. Say you're in Hamburg and want to go to Berlin, then
it is not effective to go to Rome first.
</p>
<p>
Ruby, like python and mother smalltalk, let us be more effective at programming. We accept that they are not efficient,
but i think that can be changed.
</p>
<p>
But even while ruby has blossomed we have seen noticeable increase in effectiveness with so called dsl's and
what is generally called meta-programming.
</p>
<p>
But meta-programming is just a way to say that we manipulate the program just as we manipulate data. Off course! But
to do that effectively we need a better model of what an object oriented program actually is.
</p>
</div>
<div class="tripple">
<h2 class="center">Understandability</h2>
<p>
The way i see it is that it is the understandibility that makes ruby or python more effective. As we read much more
code than write (even it's our own), focusing on descriptive programs helps.
</p>
<p>
But you only have to look at even rubies basic blocks, to see how misleadingly language is used.
We use Strings to represent words and text, while we store data in Arrays or Hashes.
If you look these up in a dictionary you may find: a thread used for tying,
a military force, or a dish of diced meat and vegetables. So we have a way to go there.
</p>
<p>
But even more disconcerting is that we have no model of how an object oriented system actually works. We know what it
does off course, as we programm using it all the time. But how it does it is not clear.
</p>
<p>
At least not clear in the sense that i could go and read it's code. Ruby like python are written in c and that just
is not easily understandable code.
</p>
</div>
<div class="tripple">
<h2 class="center">Playing computer</h2>
<p>
When programming, we fly blind. We have no visual idea of what the system that we write will do and the only way
to get feedback is to have the final version run. Bret Victor has put this
<a href="http://vimeo.com/36579366"> into words well</a>.
</p>
<p>
So when we program, it's actually mostly in our head. By playing computer, ie simulating in the head what the computer
will do when it runs the programm.
</p>
<p>
And so what we consider good programmers, are people who are good at playing computer in their head.
</p>
<p>
But off course we have the computer right there before us. Really the computr should do it rather than
us having to simulate it.
</p>
<p>
What will come out of that line when we actually manage to put it into practise is unclear, though it is certain it
will be easier to do and result in hugely more powerful programs
</p>
<p>
Yet to get there we need better tools. Better tools that let us understand what we are doing better. Better models of
what we call programming, and by better i mean easier to understand by normal people (not the computer simluators).
</p>
</div>
</div>

View File

@ -1,104 +0,0 @@
---
layout: project
title: Ruby in Ruby
sub-title: RubyX hopes make the the mysterious more accessible, shed light in the farthest (ruby) corners, and above all, <b>empower you</b>
---
<div class="row">
<div class="tripple">
<h2 class="center"> A better tool, a better job</h2>
<p>
Ruby is the better tool to do the job. Any software job that is.
We, who use ruby daily do so because it is more productive,
better in almost every way.
The only downside is speed and we argue that with cheap resources.
</p>
<p>
Why it has taken this long to even seriously attempt a ruby implementation in ruby is due to the overwhelming
influence of C (folks), especially at the time.
</p>
<p>
Just a short and subjective list of why ruby is the better tool:
<ul>
<li>More fun. Ask anyone :-) </li>
<li>Lets you focus on the task</li>
<li>Elegant, both in syntax and solution</li>
<li>Understandable</li>
<li>Much faster to code</li>
</ul>
</p>
</div>
<div class="tripple">
<h2 class="center">Boys and toys</h2>
<p>
Rails has evolved tremendously from what was already a good start. All the development <em>around</em> it has nurtured
ruby developement in all areas. Rails and all those parts make up a most mature and advanced software system.
</p>
<p> The "rails effect" is due to the accessibility of the system, imho. Ie it is written in ruby.</p>
<p> Ruby itself has not enjoyed this rails effect, and that is because it is written in C
Crystal, Rust, Go Julia etc, have, for the exact same reason.</p>
<p> It is my firm belief that given a vm in ruby, ruby development will "take off" too. In other words, given an
easy way to improve his tools, a developer will do so. Easy means understandable and that means ruby for a
ruby developer
</p>
</div>
<div class="tripple">
<h2 class="center">Step to Indepencance</h2>
<p>
The first thing any decent compiler does, is compile itself. It is the maturity test of a language to implement
itself in itself, and the time has come for ruby. The mark of growing up is being independant, in ruby's case of C.
</p>
<p>
Having just learned Assembler, i can attest what a great improvement C is over Assembler.
But that was then and it is not just chance that developemnt has been slow in the last 50 years.
</p>
<p>
There is this attitude C believers elude and since they are the gatekeepers of the os,
everyone is fooled into believing only c is fast. Whereas what is true is that
<em>complied (binary) code </em>is fast.
</p>
<p>
On a very similar note we are lead to believe that os features must be used from c. Whereas system calls
are software interrupts, not really <em>calls<em> at all.
Only the c std library makes them look like c functions, but they are not.
</p>
</div>
</div>
<div class="span12">
<p class="center"><span> <b> So what does empowerment mean. </b></span></p>
<p>
For me it is means owning your tools.
For everyone to really be able to unfold their ideas and potential.
Not to be stuck, rather to be able to change anything one wishes.
We usually own the code we write, and we have seen amazing progress in opening up new ideas.
</p>
<p>
So it is hard to even think of ruby holding us back, and it isn't off course, only current implementations of it are.
</p>
<p>
Concretely what does this mean: Well i don't know do i! That's the whole point, that anyone can improve it beyond
the original creators horizon.
</p>
<p>
But to mention a few things that have crossed my mind (and that i will most certainly not implement)
<ul>
<li> Efficient vector extensions that use cpu/gpu instructions not supported in the core</li>
<li> Efficient graphics extensions</li>
<li> New language features, ie real dsl's that extend the parser on the fly </li>
<li> Off course there is always new cpu's and os's</li>
<li> Better implementation of core datastructures. Did i hear digital trees being mentioned?</li>
<li> Better gc's, better memory management.</li>
<li> Superoptimization! (heard of that one?)</li>
</ul>
</p>
<p>
And the fun thing is off course that all the above can be created as gems. No recompiling, no rvm/rbenv.
Anyone can choose how they want to pimp
their vm in the same way as you can decide what stack/tools you use in a rails project. And we have the essential
tool to do this: the bundler.
</p>
<p> And off course democracy decides what is good and what will stay. Natural extinction and all.</p>
</div>

View File

@ -0,0 +1,70 @@
%hr/
%p
layout: project
title: Yes, there is a spiritual side
sub-title: It is the question that drives us
%p
Its taken me a while to come out with it, but here it goes. The nice quote (got it?) has truth in it. Though we often dont
know what the question is and that is fine. It is the search that drives us and almost defines us as humans.
The search for higher meaning, the meaning of life, truth, love or any of them in a mix, is what makes us human.
%p
Alas, the search for wealth, comfort or as in the case of science, facts, is not fulfilling and thus i need to make the
distinction here.
%h3#it-started-with-truth-and-ended-with-love It started with truth and ended with love
%p
In hindsight it was even the search for truth that got me to study physics 25 odd years ago.
Not conciously at the time, and it was only much later that i could express the anger at the deception.
There is off course no truth in Physics, or science, though it was a hard nut to swallow.
%p
Science is about facts, usually irrelavant facts. Irrelevant to the persons life that is learning and teaching.
Its about talking in detail about the irrelevant, while never mentioning that it is irrelevant.
Science is about pretence of knowlege, not real knowledge which is about life, love, or something meaningful.
%p
The sign of intelligence is surely learning and reflection.
But while it is not a big step to realise that what everyone needs to learn most is how to be fullfilled
(colloqually called happy), and the reflection that matters most is that of ones own life,
it is also a step rarely taken.
Sadly all the talk about not meaningful things is keeping most people quite busy and away from any meaningful self-searching.
%p
I have done my spiritual search, it started long ago, lead me to my master and finding the truth and love.
Spiritually i have searched and i have found. I am left without question. That is, for me (not generally, or you) i have answers to any meaningful question. And i am completely fulfilled in my life, my work, and most importantly my love. My whole life is a whole, not as it used to be many distinct parts and i have no problems, either in daily life or in general.
%h3#an-echo-karma-unfulfilled An echo, karma unfulfilled
%p
So why am still doing this and not out there teaching. At least it seems that anyone who has realized anything feels the
need to go forth and tell others. But it is not my way and that is ok, in the same way that not everyone who learns english
becomes an english teacher.
%p
The best i can come up with it that there is still karma that needs to be cleaned. Just to clarify, karma is unresolved
problems left from our forebares or ones own action before they were concious. I was surely not concious when i first
started with virtual machines and so i get to clean that up now.
%p
But dont misunderstand, i dont resent my karma (another great matrix line). I accept it and am willing to clean my bit
up, even i can see that much of it has been handed down from my parents. I dont blame them (or anyone) as they got handed
their bit from their parents and did their best. It is the way of things, and i have long ago resolved to do my bit
to further human conciousness as i can (starting in me).
%p
So i just wanted to say that this project in itself is not important in any sense of the word.
And the main meaning i get from it, is the cleaning of my karma.
%h3#the-way-back The way back
%p
I noticed that quite quickly after i started the project, i was diverging radically from old ideas. And actually that
is not just from my old ideas, which is nice in itself. A certain freshness and the fact that i am not just going over
old ground. No, its from any old ideas that i am aware of.
%p
I just noticed another crystal project with similar goals, but sort of more traditional choices (salama was called
crystal in the beginning). Ie llvm to generate binaries
and a more static approach. And that would have been me as a younger version. Now i go the long way because i know i have
all the time i need, and what matters is direction, not speed.
%p
The way it is happening is that i am reexamining just about everything i touch. A part of that is the kind of no stone
unturned mentality. Thoroughness in a way.
%p
But mostly it is a reexamination of everything i learned. It is going back over old ground and really looking at things,
seeing them in a fresh way and coming to mostly new conclusions. Off course the main reason we get so much done so quickly
in software engineering these days is that we build on previous, and other peoples work. But so much of that is just
layer on layers of stuff that is not needed. And they are not just baggage, they really stop doing things differently.
%p
Going over this old ground and finding new ways does give me a certain satisfaction and already has lead to a much better
understanding of what programming actually is. Also i find it meaningful that this sense of rediscovery is so similar to
what the spiritual path was about for me. And the idea does make me smile, that i am now a spiritual programmer.

View File

@ -1,72 +0,0 @@
---
layout: project
title: Yes, there is a spiritual side
sub-title: It is the question that drives us
---
It's taken me a while to come out with it, but here it goes. The nice quote (got it?) has truth in it. Though we often don't
know what the question is and that is fine. It is the search that drives us and almost defines us as humans.
The search for higher meaning, the meaning of life, truth, love or any of them in a mix, is what makes us human.
Alas, the search for wealth, comfort or as in the case of science, facts, is not fulfilling and thus i need to make the
distinction here.
### It started with truth and ended with love
In hindsight it was even the search for truth that got me to study physics 25 odd years ago.
Not conciously at the time, and it was only much later that i could express the anger at the deception.
There is off course no truth in Physics, or science, though it was a hard nut to swallow.
Science is about facts, usually irrelavant facts. Irrelevant to the person's life that is learning and teaching.
It's about talking in detail about the irrelevant, while never mentioning that it is irrelevant.
Science is about pretence of knowlege, not real knowledge which is about life, love, or something meaningful.
The sign of intelligence is surely learning and reflection.
But while it is not a big step to realise that what everyone needs to learn most is how to be fullfilled
(colloqually called happy), and the reflection that matters most is that of ones own life,
it is also a step rarely taken.
Sadly all the talk about not meaningful things is keeping most people quite busy and away from any meaningful self-searching.
I have done my spiritual search, it started long ago, lead me to my master and finding the truth and love.
Spiritually i have searched and i have found. I am left without question. That is, for me (not generally, or you) i have answers to any meaningful question. And i am completely fulfilled in my life, my work, and most importantly my love. My whole life is a whole, not as it used to be many distinct parts and i have no problems, either in daily life or in general.
### An echo, karma unfulfilled
So why am still doing this and not out there teaching. At least it seems that anyone who has realized anything feels the
need to go forth and tell others. But it is not my way and that is ok, in the same way that not everyone who learns english
becomes an english teacher.
The best i can come up with it that there is still karma that needs to be cleaned. Just to clarify, karma is unresolved
problems left from our forebares or ones own action before they were concious. I was surely not concious when i first
started with virtual machines and so i get to clean that up now.
But don't misunderstand, i don't resent my karma (another great matrix line). I accept it and am willing to clean my bit
up, even i can see that much of it has been handed down from my parents. I don't blame them (or anyone) as they got handed
their bit from their parents and did their best. It is the way of things, and i have long ago resolved to do my bit
to further human conciousness as i can (starting in me).
So i just wanted to say that this project in itself is not important in any sense of the word.
And the main meaning i get from it, is the cleaning of my karma.
### The way back
I noticed that quite quickly after i started the project, i was diverging radically from old ideas. And actually that
is not just from my old ideas, which is nice in itself. A certain freshness and the fact that i am not just going over
old ground. No, it's from any old ideas that i am aware of.
I just noticed another crystal project with similar goals, but sort of more traditional choices (salama was called
crystal in the beginning). Ie llvm to generate binaries
and a more static approach. And that would have been me as a younger version. Now i go the long way because i know i have
all the time i need, and what matters is direction, not speed.
The way it is happening is that i am reexamining just about everything i touch. A part of that is the kind of no stone
unturned mentality. Thoroughness in a way.
But mostly it is a reexamination of everything i learned. It is going back over old ground and really looking at things,
seeing them in a fresh way and coming to mostly new conclusions. Off course the main reason we get so much done so quickly
in software engineering these days is that we build on previous, and other peoples work. But so much of that is just
layer on layers of stuff that is not needed. And they are not just baggage, they really stop doing things differently.
Going over this old ground and finding new ways does give me a certain satisfaction and already has lead to a much better
understanding of what programming actually is. Also i find it meaningful that this sense of rediscovery is so similar to
what the spiritual path was about for me. And the idea does make me smile, that i am now a spiritual programmer.

View File

@ -1,67 +0,0 @@
<!DOCTYPE html>
<html>
<head>
<title>The page you were looking for doesn't exist (404)</title>
<meta name="viewport" content="width=device-width,initial-scale=1">
<style>
.rails-default-error-page {
background-color: #EFEFEF;
color: #2E2F30;
text-align: center;
font-family: arial, sans-serif;
margin: 0;
}
.rails-default-error-page div.dialog {
width: 95%;
max-width: 33em;
margin: 4em auto 0;
}
.rails-default-error-page div.dialog > div {
border: 1px solid #CCC;
border-right-color: #999;
border-left-color: #999;
border-bottom-color: #BBB;
border-top: #B00100 solid 4px;
border-top-left-radius: 9px;
border-top-right-radius: 9px;
background-color: white;
padding: 7px 12% 0;
box-shadow: 0 3px 8px rgba(50, 50, 50, 0.17);
}
.rails-default-error-page h1 {
font-size: 100%;
color: #730E15;
line-height: 1.5em;
}
.rails-default-error-page div.dialog > p {
margin: 0 0 1em;
padding: 1em;
background-color: #F7F7F7;
border: 1px solid #CCC;
border-right-color: #999;
border-left-color: #999;
border-bottom-color: #999;
border-bottom-left-radius: 4px;
border-bottom-right-radius: 4px;
border-top-color: #DADADA;
color: #666;
box-shadow: 0 3px 8px rgba(50, 50, 50, 0.17);
}
</style>
</head>
<body class="rails-default-error-page">
<!-- This file lives in public/404.html -->
<div class="dialog">
<div>
<h1>The page you were looking for doesn't exist.</h1>
<p>You may have mistyped the address or the page may have moved.</p>
</div>
<p>If you are the application owner check the logs for more information.</p>
</div>
</body>
</html>

View File

@ -1,67 +0,0 @@
<!DOCTYPE html>
<html>
<head>
<title>The change you wanted was rejected (422)</title>
<meta name="viewport" content="width=device-width,initial-scale=1">
<style>
.rails-default-error-page {
background-color: #EFEFEF;
color: #2E2F30;
text-align: center;
font-family: arial, sans-serif;
margin: 0;
}
.rails-default-error-page div.dialog {
width: 95%;
max-width: 33em;
margin: 4em auto 0;
}
.rails-default-error-page div.dialog > div {
border: 1px solid #CCC;
border-right-color: #999;
border-left-color: #999;
border-bottom-color: #BBB;
border-top: #B00100 solid 4px;
border-top-left-radius: 9px;
border-top-right-radius: 9px;
background-color: white;
padding: 7px 12% 0;
box-shadow: 0 3px 8px rgba(50, 50, 50, 0.17);
}
.rails-default-error-page h1 {
font-size: 100%;
color: #730E15;
line-height: 1.5em;
}
.rails-default-error-page div.dialog > p {
margin: 0 0 1em;
padding: 1em;
background-color: #F7F7F7;
border: 1px solid #CCC;
border-right-color: #999;
border-left-color: #999;
border-bottom-color: #999;
border-bottom-left-radius: 4px;
border-bottom-right-radius: 4px;
border-top-color: #DADADA;
color: #666;
box-shadow: 0 3px 8px rgba(50, 50, 50, 0.17);
}
</style>
</head>
<body class="rails-default-error-page">
<!-- This file lives in public/422.html -->
<div class="dialog">
<div>
<h1>The change you wanted was rejected.</h1>
<p>Maybe you tried to change something you didn't have access to.</p>
</div>
<p>If you are the application owner check the logs for more information.</p>
</div>
</body>
</html>

View File

@ -1,66 +0,0 @@
<!DOCTYPE html>
<html>
<head>
<title>We're sorry, but something went wrong (500)</title>
<meta name="viewport" content="width=device-width,initial-scale=1">
<style>
.rails-default-error-page {
background-color: #EFEFEF;
color: #2E2F30;
text-align: center;
font-family: arial, sans-serif;
margin: 0;
}
.rails-default-error-page div.dialog {
width: 95%;
max-width: 33em;
margin: 4em auto 0;
}
.rails-default-error-page div.dialog > div {
border: 1px solid #CCC;
border-right-color: #999;
border-left-color: #999;
border-bottom-color: #BBB;
border-top: #B00100 solid 4px;
border-top-left-radius: 9px;
border-top-right-radius: 9px;
background-color: white;
padding: 7px 12% 0;
box-shadow: 0 3px 8px rgba(50, 50, 50, 0.17);
}
.rails-default-error-page h1 {
font-size: 100%;
color: #730E15;
line-height: 1.5em;
}
.rails-default-error-page div.dialog > p {
margin: 0 0 1em;
padding: 1em;
background-color: #F7F7F7;
border: 1px solid #CCC;
border-right-color: #999;
border-left-color: #999;
border-bottom-color: #999;
border-bottom-left-radius: 4px;
border-bottom-right-radius: 4px;
border-top-color: #DADADA;
color: #666;
box-shadow: 0 3px 8px rgba(50, 50, 50, 0.17);
}
</style>
</head>
<body class="rails-default-error-page">
<!-- This file lives in public/500.html -->
<div class="dialog">
<div>
<h1>We're sorry, but something went wrong.</h1>
</div>
<p>If you are the application owner check the logs for more information.</p>
</div>
</body>
</html>

111
rubyx/layers.html.haml Normal file
View File

@ -0,0 +1,111 @@
%hr/
%p
layout: rubyx
title: RubyX architectural layers
%h2#main-layers Main Layers
%p
To implement an object system to execute object oriented languages takes a large system.
The parts or abstraction layers are detailed below.
%p
It is important to understand the approach first though, as it differs from the normal
interpretation. The idea is to
%strong compile
ruby. The argument is often made that
typed languages are faster, but i dont believe in that. I think dynamic languages
just push more functionality into the “virtual machine” and it is in fact only the
compiling to binaries that gives static languages their speed. This is the reason
to compile ruby.
%p
%img{:alt => "Architectural layers", :src => "/assets/layers.jpg"}/
%h3#ruby Ruby
%p
To compile and run ruby, we first need to parse ruby. While parsing ruby is quite
a difficult task, it has already been implemented in pure ruby
%a{:href => "https://github.com/whitequark/parser"}> here
\. The output of the parser is
an ast, which holds information about the code in instances of a single
%em Node
class.
Nodes have a type (which you sometimes see in s-expressions) and a list of children.
%p There are two basic problems when working with ruby ast: one is the a in ast, the other is ruby.
%p
Since an abstract syntax tree only has one base class, one needs to employ the visitor
pattern to write a compiler. This ends up being one great class with lots of unrelated
functions, removing much of the benefit of OO.
%p
The second, possibly bigger problem, is ruby itself: Ruby is full of programmer happiness,
three ways to do this, five to do that. To simplify that, remove the duplication and
make analyis easier, Vool was created.
%h3#virtual-object-oriented-language Virtual Object Oriented Language
%p
Virtual, in this context, means that there is no syntax for this language; it is an
intermediate representation which
%em could
be targeted by several languages.
%p
The main purpose is to simplify existing oo languages down to its core components: mostly
calling, assignment, continuations and exceptions. Typed classes for each language construct
exist and make it easier to transform a statement into a lower level representations.
%p
Examples for things that exist in ruby but are broken down in Vool are
%em unless
, ternary operator,
do while or for loops and other similar syntactic sugar.
%h3#minimal-object-machine Minimal Object machine
%p
We compile Vool statements into Mom instructions. Mom is a machine, which means it has
instructions. But unlike a cpu (or the risc layer below) it does not have memory, only objects.
It also has no registers, and together these two things mean that all information is stored in
objects. Also the calling convention is object based and uses Frame and Message instances to
save state.
%p
Objects are typed, and are in fact the same objects the language operates on. Just the
functionality is expressed through instructions. Methods are in fact defined (as vool) on classes
and then compiled to Mom/Risc/Arm and the results stored in the method object.
%p
Compilation to Mom happens in two stages:
1. The linear statements/code is translated to Mom instructions.
2. Control statements are translated to jumps and labels.
%p
The second step leaves a linked list of machine instructions as the input for the next stage.
In the future a more elaborate system of optimisations is envisioned between these stages.
%h3#risc Risc
%p
The Register machine layer is a relatively close abstraction of risc hardware, but without the
quirks.
%p
The Risc machine has registers, indexed addressing, operators, branches and everything
needed for the next layer. It does not try to abstract every possible machine feature
(like llvm), but rather “objectifies” the general risc view to provide what is needed for
the Mom layer, the next layer up.
%p
The machine has its own (abstract) instruction set, and the mapping to arm is quite
straightforward. Since the instruction set is implemented as derived classes, additional
instructions may be defined and used later, as long as translation is provided for them too.
In other words the instruction set is extensible (unlike cpu instruction sets).
%p
Basic object oriented concepts are needed already at this level, to be able to generate a whole
self contained system. Ie what an object is, a class, a method etc. This minimal runtime is called
parfait, and the same objects will be used at runtime and compile time.
%p
Since working with at this low machine level (essentially assembler) is not easy to follow for
everyone (me :-), an interpreter was created (by me:-). Later a graphical interface, a kind of
%a{:href => "https://github.com/ruby-x/rubyx-debugger"} visual debugger
was added.
Visualizing the control flow and being able to see values updated immediately helped
tremendously in creating this layer. And the interpreter helps in testing, ie keeping it
working in the face of developer change.
%h3#binary--arm-and-elf Binary , Arm and Elf
%p
A physical machine will run binaries containing instructions that the cpu understands, in a
format the operating system understands (elf). Arm and elf subdirectories hold the code for
these layers.
%p
Arm is a risc architecture, but anyone who knows it will attest, with its own quirks.
For example any instruction may be executed conditionally in arm. Or there is no 32bit
register load instruction. It is possible to create very dense code using all the arm
special features, but this is not implemented yet.
%p
All Arm instructions are (ie derive from) Register instruction and there is an ArmTranslator
that translates RegisterInstructions to ArmInstructions.

View File

@ -1,109 +0,0 @@
---
layout: rubyx
title: RubyX architectural layers
---
## Main Layers
To implement an object system to execute object oriented languages takes a large system.
The parts or abstraction layers are detailed below.
It is important to understand the approach first though, as it differs from the normal
interpretation. The idea is to **compile** ruby. The argument is often made that
typed languages are faster, but i don't believe in that. I think dynamic languages
just push more functionality into the "virtual machine" and it is in fact only the
compiling to binaries that gives static languages their speed. This is the reason
to compile ruby.
![Architectural layers](/assets/layers.jpg)
### Ruby
To compile and run ruby, we first need to parse ruby. While parsing ruby is quite
a difficult task, it has already been implemented in pure ruby
[here](https://github.com/whitequark/parser). The output of the parser is
an ast, which holds information about the code in instances of a single *Node* class.
Nodes have a type (which you sometimes see in s-expressions) and a list of children.
There are two basic problems when working with ruby ast: one is the a in ast, the other is ruby.
Since an abstract syntax tree only has one base class, one needs to employ the visitor
pattern to write a compiler. This ends up being one great class with lots of unrelated
functions, removing much of the benefit of OO.
The second, possibly bigger problem, is ruby itself: Ruby is full of programmer happiness,
three ways to do this, five to do that. To simplify that, remove the duplication and
make analyis easier, Vool was created.
### Virtual Object Oriented Language
Virtual, in this context, means that there is no syntax for this language; it is an
intermediate representation which *could* be targeted by several languages.
The main purpose is to simplify existing oo languages down to it's core components: mostly
calling, assignment, continuations and exceptions. Typed classes for each language construct
exist and make it easier to transform a statement into a lower level representations.
Examples for things that exist in ruby but are broken down in Vool are *unless* , ternary operator,
do while or for loops and other similar syntactic sugar.
### Minimal Object machine
We compile Vool statements into Mom instructions. Mom is a machine, which means it has
instructions. But unlike a cpu (or the risc layer below) it does not have memory, only objects.
It also has no registers, and together these two things mean that all information is stored in
objects. Also the calling convention is object based and uses Frame and Message instances to
save state.
Objects are typed, and are in fact the same objects the language operates on. Just the
functionality is expressed through instructions. Methods are in fact defined (as vool) on classes
and then compiled to Mom/Risc/Arm and the results stored in the method object.
Compilation to Mom happens in two stages:
1. The linear statements/code is translated to Mom instructions.
2. Control statements are translated to jumps and labels.
The second step leaves a linked list of machine instructions as the input for the next stage.
In the future a more elaborate system of optimisations is envisioned between these stages.
### Risc
The Register machine layer is a relatively close abstraction of risc hardware, but without the
quirks.
The Risc machine has registers, indexed addressing, operators, branches and everything
needed for the next layer. It does not try to abstract every possible machine feature
(like llvm), but rather "objectifies" the general risc view to provide what is needed for
the Mom layer, the next layer up.
The machine has it's own (abstract) instruction set, and the mapping to arm is quite
straightforward. Since the instruction set is implemented as derived classes, additional
instructions may be defined and used later, as long as translation is provided for them too.
In other words the instruction set is extensible (unlike cpu instruction sets).
Basic object oriented concepts are needed already at this level, to be able to generate a whole
self contained system. Ie what an object is, a class, a method etc. This minimal runtime is called
parfait, and the same objects will be used at runtime and compile time.
Since working with at this low machine level (essentially assembler) is not easy to follow for
everyone (me :-), an interpreter was created (by me:-). Later a graphical interface, a kind of
[visual debugger](https://github.com/ruby-x/rubyx-debugger) was added.
Visualizing the control flow and being able to see values updated immediately helped
tremendously in creating this layer. And the interpreter helps in testing, ie keeping it
working in the face of developer change.
### Binary , Arm and Elf
A physical machine will run binaries containing instructions that the cpu understands, in a
format the operating system understands (elf). Arm and elf subdirectories hold the code for
these layers.
Arm is a risc architecture, but anyone who knows it will attest, with it's own quirks.
For example any instruction may be executed conditionally in arm. Or there is no 32bit
register load instruction. It is possible to create very dense code using all the arm
special features, but this is not implemented yet.
All Arm instructions are (ie derive from) Register instruction and there is an ArmTranslator
that translates RegisterInstructions to ArmInstructions.

45
rubyx/memory.html.haml Normal file
View File

@ -0,0 +1,45 @@
%hr/
%p
layout: rubyx
title: Types, memory layout and management
%p Memory management must be one of the main horrors of computing. Thats why garbage collected languages like ruby are so great. Even simple malloc implementations tend to be quite complicated. Unnecessary so, if one used object oriented principles of data hiding.
%h3#object-and-values Object and values
%p As has been mentioned, in a true OO system, object tagging is not really an option. Tagging being the technique of adding the lowest bit as marker to pointers and thus having to shift ints and loosing a bit. Mri does this for Integers but not other value types. We accept this and work with it and just say “off course” , but its not modeled well.
%p Integers are not Objects like “normal” objects. They are Values, on par with ObjectReferences, and have the following distinctive differences:
%ul
%li equality implies identity
%li constant for whole lifetime
%li pass by value semantics
%p If integers were normal objects, the first would mean they would be singletons. The second means you cant change them, you can only change a variable to hold a different value. It also means you cant add instance variables to an integer, neither singleton_methods. And the third means that if you do change the variable, a passed value will not be changed. Also they are not garbage collected. If you noticed how weird that idea is (the gc), you can see how natural is that Value idea.
%p Instead of trying to make this difference go away (like MRI) I think it should be explicit and indeed be expanded to all Objects that have these properties. Words for examples (ruby calls them Symbols), are the same. A Table is a Table, and Toble is not. Floats (all numbers) and Times are the same.
%h3#object-type Object Type
%p So if were not tagging we must pass and keep the type information around separately. For passing it has been mentioned that a separate register is used.
%p For keeping track of the type data we need to make a decision of how many we support. The register for passing gives the upper limit of 4 bits, and this fits well with the idea of cache lines. So if we use cache lines, for every 8 words, we take one for the type.
%p Traditionally the class of the object is stored in the object. But this forces the dynamic lookup that is a good part of the performance problem. Instead we store the Objects Type. The Type then stores the Class, but it is the type that describes the memory layout of the object (and all objects with the same type).
%p This is is in essence a level of indirection that gives us the space to have several Types for one class, and so we can evolve the class without having to change the Type (we just create new ones for every change)
%p
The memory layout of
%strong every
object is type word followed by “data”.
%p That leaves the length open and we can use the 8th 4bits to store it. That gives a maximum of 16 Lines.
%h4#continuations Continuations
%p
But (i hear), ruby is dynamic, we must be able to add variables and methods to an object at any time.
So the type cant be fixed. Ok, we can change the Type every time, but when any empty slots have
been used up, what then.
%p
Then we use Continuations, so instead of adding a new variable to the end of the object, we use a
new object and store it in the original object. Thus extending the object.
%p
Continuations are pretty normal objects and it is just up to the object to manage the redirection.
Off course this may splatter objects a little, but in running application this does not really happen much. Most instance variables are added quite soon after startup, just as functions are usually parsed in the beginning.
%p The good side of continuation is also that we can be quite tight on initial allocation, and even minimal with continuations. Continuations can be completely changed out after all.
%h3#pages-and-spaces Pages and Spaces
%p
Now we have the smallest units taken care of, we need to store them and allocate and manage larger chunks. This is much
simpler and we can use a fixed size Page, as say 256 lines.
%p The highest order is a Space, which is just a list of Pages. Spaces manage Pages in a very simliar way that Pages manage Objects, ie ie as liked lists of free Objects/Pages.
%p
A Page, like a Space, is off course a normal object. The actual memory materialises out of nowhere, but then gets
filled immediately with objects. So no empty memory is managed, just objects that can be repurposed.

View File

@ -1,58 +0,0 @@
---
layout: rubyx
title: Types, memory layout and management
---
Memory management must be one of the main horrors of computing. That's why garbage collected languages like ruby are so great. Even simple malloc implementations tend to be quite complicated. Unnecessary so, if one used object oriented principles of data hiding.
### Object and values
As has been mentioned, in a true OO system, object tagging is not really an option. Tagging being the technique of adding the lowest bit as marker to pointers and thus having to shift ints and loosing a bit. Mri does this for Integers but not other value types. We accept this and work with it and just say "off course" , but it's not modeled well.
Integers are not Objects like "normal" objects. They are Values, on par with ObjectReferences, and have the following distinctive differences:
- equality implies identity
- constant for whole lifetime
- pass by value semantics
If integers were normal objects, the first would mean they would be singletons. The second means you can't change them, you can only change a variable to hold a different value. It also means you can't add instance variables to an integer, neither singleton_methods. And the third means that if you do change the variable, a passed value will not be changed. Also they are not garbage collected. If you noticed how weird that idea is (the gc), you can see how natural is that Value idea.
Instead of trying to make this difference go away (like MRI) I think it should be explicit and indeed be expanded to all Objects that have these properties. Words for examples (ruby calls them Symbols), are the same. A Table is a Table, and Toble is not. Floats (all numbers) and Times are the same.
### Object Type
So if we're not tagging we must pass and keep the type information around separately. For passing it has been mentioned that a separate register is used.
For keeping track of the type data we need to make a decision of how many we support. The register for passing gives the upper limit of 4 bits, and this fits well with the idea of cache lines. So if we use cache lines, for every 8 words, we take one for the type.
Traditionally the class of the object is stored in the object. But this forces the dynamic lookup that is a good part of the performance problem. Instead we store the Object's Type. The Type then stores the Class, but it is the type that describes the memory layout of the object (and all objects with the same type).
This is is in essence a level of indirection that gives us the space to have several Types for one class, and so we can evolve the class without having to change the Type (we just create new ones for every change)
The memory layout of **every** object is type word followed by "data".
That leaves the length open and we can use the 8th 4bits to store it. That gives a maximum of 16 Lines.
#### Continuations
But (i hear), ruby is dynamic, we must be able to add variables and methods to an object at any time.
So the type can't be fixed. Ok, we can change the Type every time, but when any empty slots have
been used up, what then.
Then we use Continuations, so instead of adding a new variable to the end of the object, we use a
new object and store it in the original object. Thus extending the object.
Continuations are pretty normal objects and it is just up to the object to manage the redirection.
Off course this may splatter objects a little, but in running application this does not really happen much. Most instance variables are added quite soon after startup, just as functions are usually parsed in the beginning.
The good side of continuation is also that we can be quite tight on initial allocation, and even minimal with continuations. Continuations can be completely changed out after all.
### Pages and Spaces
Now we have the smallest units taken care of, we need to store them and allocate and manage larger chunks. This is much
simpler and we can use a fixed size Page, as say 256 lines.
The highest order is a Space, which is just a list of Pages. Spaces manage Pages in a very simliar way that Pages manage Objects, ie ie as liked lists of free Objects/Pages.
A Page, like a Space, is off course a normal object. The actual memory materialises out of nowhere, but then gets
filled immediately with objects. So no empty memory is managed, just objects that can be repurposed.

View File

@ -0,0 +1,76 @@
%hr/
%p
layout: rubyx
title: Optimisation ideas
%p I wont manage to implement all of these idea in the beginning, so i just jot them down.
%h3#avoid-dynamic-lookup Avoid dynamic lookup
%p This off course is a broad topic, which may be seen under the topic of caching. Slightly wrongly though in my view, as avoiding them is really the aim. Especially for variables.
%h4#i---instance-variables I - Instance Variables
%p Ruby has dynamic instance variables, meaning you can add a new one at any time. This is as it should be.
%p
But this can easily lead to a dictionary/hash type of implementation. As variable “lookup” is probably
%em the
most
common thing an OO system does, that leads to bad performance (unneccessarily).
%p
So instead we keep variables layed out c++ style, continous, array style, at the address of the object. Then we have
to manage that in a dynamic manner. This (as i mentioned
= succeed ")" do
%a{:href => "memory.html"} here
%p
When a new variable is added, we create a
%em new
Type and change the Type of the object. We can do this as the Type will
determine the Class of the object, which stays the same. The memory page mentions how this works with constant sized objects.
%p So, Problem one fixed: instance variable access at O(1)
%h4#ii---method-lookup II - Method lookup
%p Off course that helps with Method access. All Methods are at the end variables on some (class) object. But as we cant very well have the same (continuous) index for a given method name on all classes, it has to be looked up. Or does it?
%p
Well, yes it does, but maybe not more than once: We can conceivably store the result, except off course not in a dynamic
structure as that would defeat the purpose.
%p
In fact there could be several caching strategies, possibly for different use cases, possibly determined by actual run-time
measurements, but for now I just destribe a simeple one using Data-Blocks, Plocks.
%p
So at a call-site, we know the name of the function we want to call, and the object we want to call it on, and so have to
find the actual function object, and by that the actual call address. In abstract terms we want to create a switch with
3 cases and a default.
%p
So the code is something like, if first cache hit, call first cache , .. times three and if not do the dynamic lookup.
The Plock can store those cache hits inside the code. So then we “just” need to get the cache loaded.
%p Initializing the cached values is by normal lazy initialization. Ie we check for nil and if so we do the dynamic lookup, and store the result.
%p
Remember, we cache Type against function address. Since Types never change, were done. We could (as hinted above)
do things with counters or robins, but that is for later.
%p
Alas: While Types are constant, darn the ruby, method implementations can actually change! And while it is tempting to
just create a new Type for that too, that would mean going through existing objects and changing the Type, nischt gut.
So we need change notifications, so when we cache, we must register a change listener and update the generated function,
or at least nullify it.
%h3#inlining Inlining
%p
Ok, this may not need too much explanation. Just work. It may be intersting to experiment how much this saves, and how much
inlining is useful. I could imagine at some point its the register shuffling that determines the effort, not the
actual call.
%p Again the key is the update notifications when some of the inlined functions have changed.
%p
And it is important to code the functions so that they have a single exit point, otherwise it gets messy. Up to now this
was quite simple, but then blocks and exceptions are undone.
%h3#register-negotiation Register negotiation
%p
This is a little less baked, but it comes from the same idea as inlining. As calling functions is a lot of register
shuffling, we could try to avoid some of that.
%p More precisely, usually calling conventions have registers in which arguments are passed. And to call an “unknown”, ie any function, some kind of convention is neccessary.
%p
But on “cached” functions, where the function is know, it is possible to do something else. And since we have the source
(ast) of the function around, we can do things previouly imposible.
%p One such thing may be to recompile the function to acccept arguments exactly where they are in the calling function. Well, now that its written down. it does sound a lot like inlining, except without the inlining:-)
%p
An expansion if this idea would be to have a Negotiator on every function call. Meaning that the calling function would not
do any shuffling, but instead call a Negotiator, and the Negotiator does the shuffling and calling of the function.
This only really makes sense if the register shuffling information is encoded in the Negotiator object (and does not have
to be passed).
%p
Negotiators could do some counting and do the recompiling when it seems worth it. The Negotiator would remove itself from
the chain and connect called and new receiver directly. How much is in this i couldnt say though.

View File

@ -1,84 +0,0 @@
---
layout: rubyx
title: Optimisation ideas
---
I won't manage to implement all of these idea in the beginning, so i just jot them down.
### Avoid dynamic lookup
This off course is a broad topic, which may be seen under the topic of caching. Slightly wrongly though in my view, as avoiding them is really the aim. Especially for variables.
#### I - Instance Variables
Ruby has dynamic instance variables, meaning you can add a new one at any time. This is as it should be.
But this can easily lead to a dictionary/hash type of implementation. As variable "lookup" is probably *the* most
common thing an OO system does, that leads to bad performance (unneccessarily).
So instead we keep variables layed out c++ style, continous, array style, at the address of the object. Then we have
to manage that in a dynamic manner. This (as i mentioned [here](memory.html)) is done by the indirection of the Type. A Type is
a dynamic structure mapping names to indexes (actually implemented as an array too, but the api is hash-like).
When a new variable is added, we create a *new* Type and change the Type of the object. We can do this as the Type will
determine the Class of the object, which stays the same. The memory page mentions how this works with constant sized objects.
So, Problem one fixed: instance variable access at O(1)
#### II - Method lookup
Off course that helps with Method access. All Methods are at the end variables on some (class) object. But as we can't very well have the same (continuous) index for a given method name on all classes, it has to be looked up. Or does it?
Well, yes it does, but maybe not more than once: We can conceivably store the result, except off course not in a dynamic
structure as that would defeat the purpose.
In fact there could be several caching strategies, possibly for different use cases, possibly determined by actual run-time
measurements, but for now I just destribe a simeple one using Data-Blocks, Plocks.
So at a call-site, we know the name of the function we want to call, and the object we want to call it on, and so have to
find the actual function object, and by that the actual call address. In abstract terms we want to create a switch with
3 cases and a default.
So the code is something like, if first cache hit, call first cache , .. times three and if not do the dynamic lookup.
The Plock can store those cache hits inside the code. So then we "just" need to get the cache loaded.
Initializing the cached values is by normal lazy initialization. Ie we check for nil and if so we do the dynamic lookup, and store the result.
Remember, we cache Type against function address. Since Types never change, we're done. We could (as hinted above)
do things with counters or robins, but that is for later.
Alas: While Types are constant, darn the ruby, method implementations can actually change! And while it is tempting to
just create a new Type for that too, that would mean going through existing objects and changing the Type, nischt gut.
So we need change notifications, so when we cache, we must register a change listener and update the generated function,
or at least nullify it.
### Inlining
Ok, this may not need too much explanation. Just work. It may be intersting to experiment how much this saves, and how much
inlining is useful. I could imagine at some point it's the register shuffling that determines the effort, not the
actual call.
Again the key is the update notifications when some of the inlined functions have changed.
And it is important to code the functions so that they have a single exit point, otherwise it gets messy. Up to now this
was quite simple, but then blocks and exceptions are undone.
### Register negotiation
This is a little less baked, but it comes from the same idea as inlining. As calling functions is a lot of register
shuffling, we could try to avoid some of that.
More precisely, usually calling conventions have registers in which arguments are passed. And to call an "unknown", ie any function, some kind of convention is neccessary.
But on "cached" functions, where the function is know, it is possible to do something else. And since we have the source
(ast) of the function around, we can do things previouly imposible.
One such thing may be to recompile the function to acccept arguments exactly where they are in the calling function. Well, now that it's written down. it does sound a lot like inlining, except without the inlining:-)
An expansion if this idea would be to have a Negotiator on every function call. Meaning that the calling function would not
do any shuffling, but instead call a Negotiator, and the Negotiator does the shuffling and calling of the function.
This only really makes sense if the register shuffling information is encoded in the Negotiator object (and does not have
to be passed).
Negotiators could do some counting and do the recompiling when it seems worth it. The Negotiator would remove itself from
the chain and connect called and new receiver directly. How much is in this i couldn't say though.

72
rubyx/threads.html.haml Normal file
View File

@ -0,0 +1,72 @@
%hr/
%p
layout: rubyx
title: Threads are broken
author: Torsten
%p
Having just read about rubys threads, i was moved to collect my thoughts on the topic. How this will influence implementation
i am not sure yet. But good to get it out on paper as a basis for communication.
%h3#processes Processes
%p
I find it helps to consider why we have threads. Before threads, unix had only processes and ipc,
so inter-process-communication.
%p
Processes were a good idea, keeping each programm save from the mistakes of others by restricting access to the processes
own memory. Each process had the view of “owning” the machine, being alone on the machine as it were. Each a small turing/
von neumann machine.
%p
But one had to wait for io, the network and so it was difficult, or even impossible to get one process to use the machine
to the hilt.
%p
IPC mechnisms were and are sockets, shared memory regions, files, each with their own sets of strengths, weaknesses and
apis, all deemed complicated and slow. Each switch encurs a process switch and processes are not lightweight structures.
%h3#thread Thread
%p
And so threads were born as a lightweight mechanisms of getting more things done. Concurrently, because when the one
thread is in a kernel call, it is suspended.
%h4#green-or-fibre Green or fibre
%p
The first threads that people did without kernel support, were quickly found not to solve the problem so well. Because as any
thread is calling the kernel, all threads stop. Not really that much won one might think, but wrongly.
%p
Now that Green threads are coming back in fashion as fibres they are used for lightweight concurrency, actor programming and
we find that the different viewpoint can help to express some solutions more naturally.
%h4#kernel-threads Kernel threads
%p
The real solution, where the kernel knows about threads and does the scheduling, took some while to become standard and
makes processes more complicated a fair degree. Luckily we dont code kernels and dont have to worry.
%p
But we do have to deal with the issues that come up. The isse is off course data corruption. I dont even want to go into
how to fix this, or the different ways that have been introduced, because the main thrust becomes clear in the next chapter:
%h3#broken-model Broken model
%p
My main point about threads is that they are one of the worse hacks, especially in a c environemnt. Processes had a good
model of a programm with a global memory. The equivalent of threads would have been shared memory with
%strong many
programs
connected. A nightmare. It even breaks that old turing idea and so it is very difficult to reason about what goes on in a
multi threaded program, and the only ways this is achieved is by developing a more restrictive model.
%p
In essence the thread memory model is broken. Ideally i would not like to implement it, or if implemented, at least fix it
first.
%p But what is the fix? It is in essence what the process model was, ie each thread has its own memory.
%h3#thread-memory Thread memory
%p
In OO it is possible to fix the thread model, just because we have no global memory access. In effect the memory model
must be inverted: instead of almost all memory being shared by all threads and each thread having a small thread local
storage, threads must have mostly thread specific data and a small amount of shared resources.
%p
A thread would thus work as a process used. In essence it can update any data it sees without restrictions. It must
exchange data with other threads through specified global objects, that take the role of what ipc used to be.
%p In an oo system this can be enforced by strict pass-by-value over thread borders.
%p
The itc (inter thread communication) objects are the only ones that need current thread synchronization techniques.
The one mechanism that could cover all needs could be a simple lists.
%h3#rubyx RubyX
%p
The original problem of what a program does during a kernel call could be solved by a very small number of kernel threads.
Any kernel call would be listed and “c” threads would pick them up to execute them and return the result.
%p
All other threads could be managed as green threads. Threads may not share objects, other than a small number of system
provided.

View File

@ -1,78 +0,0 @@
---
layout: rubyx
title: Threads are broken
author: Torsten
---
Having just read about rubys threads, i was moved to collect my thoughts on the topic. How this will influence implementation
i am not sure yet. But good to get it out on paper as a basis for communication.
### Processes
I find it helps to consider why we have threads. Before threads, unix had only processes and ipc,
so inter-process-communication.
Processes were a good idea, keeping each programm save from the mistakes of others by restricting access to the processes
own memory. Each process had the view of "owning" the machine, being alone on the machine as it were. Each a small turing/
von neumann machine.
But one had to wait for io, the network and so it was difficult, or even impossible to get one process to use the machine
to the hilt.
IPC mechnisms were and are sockets, shared memory regions, files, each with their own sets of strengths, weaknesses and
api's, all deemed complicated and slow. Each switch encurs a process switch and processes are not lightweight structures.
### Thread
And so threads were born as a lightweight mechanisms of getting more things done. Concurrently, because when the one
thread is in a kernel call, it is suspended.
#### Green or fibre
The first threads that people did without kernel support, were quickly found not to solve the problem so well. Because as any
thread is calling the kernel, all threads stop. Not really that much won one might think, but wrongly.
Now that Green threads are coming back in fashion as fibres they are used for lightweight concurrency, actor programming and
we find that the different viewpoint can help to express some solutions more naturally.
#### Kernel threads
The real solution, where the kernel knows about threads and does the scheduling, took some while to become standard and
makes processes more complicated a fair degree. Luckily we don't code kernels and don't have to worry.
But we do have to deal with the issues that come up. The isse is off course data corruption. I don't even want to go into
how to fix this, or the different ways that have been introduced, because the main thrust becomes clear in the next chapter:
### Broken model
My main point about threads is that they are one of the worse hacks, especially in a c environemnt. Processes had a good
model of a programm with a global memory. The equivalent of threads would have been shared memory with **many** programs
connected. A nightmare. It even breaks that old turing idea and so it is very difficult to reason about what goes on in a
multi threaded program, and the only ways this is achieved is by developing a more restrictive model.
In essence the thread memory model is broken. Ideally i would not like to implement it, or if implemented, at least fix it
first.
But what is the fix? It is in essence what the process model was, ie each thread has it's own memory.
### Thread memory
In OO it is possible to fix the thread model, just because we have no global memory access. In effect the memory model
must be inverted: instead of almost all memory being shared by all threads and each thread having a small thread local
storage, threads must have mostly thread specific data and a small amount of shared resources.
A thread would thus work as a process used. In essence it can update any data it sees without restrictions. It must
exchange data with other threads through specified global objects, that take the role of what ipc used to be.
In an oo system this can be enforced by strict pass-by-value over thread borders.
The itc (inter thread communication) objects are the only ones that need current thread synchronization techniques.
The one mechanism that could cover all needs could be a simple lists.
### RubyX
The original problem of what a program does during a kernel call could be solved by a very small number of kernel threads.
Any kernel call would be listed and "c" threads would pick them up to execute them and return the result.
All other threads could be managed as green threads. Threads may not share objects, other than a small number of system
provided.

View File

@ -1,53 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div id="nav">
<a class=home href="../index.html">Alexia's Home</a>
<a href="index.html">Dissertation</a>
<a class=here href="abs.html">Abstract</a>
<a href="ack.html">Acknowledgements</a>
<a href="toc.html">Contents</a>
<a href="ch1.html">Chapter 1</a>
<a href="ch2.html">Chapter 2</a>
<a href="ch3.html">Chapter 3</a>
<a href="ch4.html">Chapter 4</a>
<a href="ch5.html">Chapter 5</a>
<a href="ch6.html">Chapter 6</a>
<a href="ch7.html">Chapter 7</a>
<a href="ch8.html">Chapter 8</a>
<a href="bib.html">Bibliography</a>
<a href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
<h1>Abstract</h1>
<p>This dissertation shows that operating systems can provide fundamental services an order of magnitude more efficiently than traditional implementations. It describes the implementation of a new operating system kernel, Synthesis, that achieves this level of performance.
<p>The Synthesis kernel combines several new techniques to provide high performance without sacrificing the expressive power or security of the system. The new ideas include:
<ul>
<li>Run-time code synthesis - a systematic way of creating executable machine code at runtime to optimize frequently-used kernel routines - queues, buffers, context switchers, interrupt handlers, and system call dispatchers - for specific situations, greatly reducing their execution time.
<li>Fine-grain scheduling - a new process-scheduling technique based on the idea of feedback that performs frequent scheduling actions and policy adjustments (at submillisecond intervals) resulting in an adaptive, self-tuning system that can support real-time data streams.
<li>Lock-free optimistic synchronization is shown to be a practical, efficient alternative to lock-based synchronization methods for the implementation of multiprocessor operating system kernels.
<li>An extensible kernel design that provides for simple expansion to support new kernel services and hardware devices while allowing a tight coupling between the kernel and the applications, blurring the distinction between user and kernel services.
</ul>
The result is a significant performance improvement over traditional operating system implementations in addition to providing new services.
</div>
</body>
</html>

View File

@ -1,58 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div id="nav">
<a class=home href="../index.html">Alexia's Home</a>
<a href="index.html">Dissertation</a>
<a href="abs.html">Abstract</a>
<a class=here href="ack.html">Acknowledgements</a>
<a href="toc.html">Contents</a>
<a href="ch1.html">Chapter 1</a>
<a href="ch2.html">Chapter 2</a>
<a href="ch3.html">Chapter 3</a>
<a href="ch4.html">Chapter 4</a>
<a href="ch5.html">Chapter 5</a>
<a href="ch6.html">Chapter 6</a>
<a href="ch7.html">Chapter 7</a>
<a href="ch8.html">Chapter 8</a>
<a href="bib.html">Bibliography</a>
<a href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
<h1>Acknowledgements</h1>
<p>Many people contributed to making this research effort a success. First and foremost, I want to thank my advisor, Calton Pu. He was instrumental in bringing this thesis to fruition. He helped clarify the ideas buried in my "collection of fast assembly-language routines," and his dedication through difficult times encouraged me to keep pushing forward. Without him, this dissertation would not exist.
<p>I am also greatly indebted to the other members of my committee: Dan Duchamp, Bob Sproull, Sal Stolfo, and John Zahorjan. Their valuable insight and timely suggestions helped speed this dissertation to completion.
<p>My sincerest appreciation and deepest "Qua!"s go to Renate Valencia. Her unselfish love and affection and incredible amount of emotional support helped me through some of my darkest hours here at Columbia and gave me the courage to continue on. Thanks also to Matthew, her son, for letting me borrow Goofymeyer, his stuffed dog.
<p>Many other friends in many places have helped in many ways; I am grateful to Emilie Dao for her generous help and support trying to help me understand myself and for the fun times we had together; to John Underkoffler and Clea Waite for their ear in times of personal uncertainty; to Mike Hawley and Olin Shivers, for their interesting conversation, rich ideas, and untiring willingness to "look at a few more sentences"; to Ken Phillips, for the thoughts we shared over countless cups of coffee; to Mort Meyerson, whose generosity in those final days helped to dissipate some of the pressure; to Brewster Kahle, who always has a ready ear and a warm hug to offer; to Domenic Frontiere and family, who are some of the most hospitable people I know; and to all my friends at Cooper Union, who made my undergrad and teaching years there so enjoyable.
<p>I also wish to thank Ming Chiang, Tom Matthews, and Tim Jones, the project students who worked so hard on parts of the Synthesis system. Thanks also go to all the people in the administrative offices, particularly Germaine, who made sure all the paperwork flowed smoothly between the various offices and who helped schedule my thesis defense on record short notice. I particularly want to thank my friends here at Columbia - Cliff Beshers, Shu-Wie Chen, Ashutosh Dutta, Edward Hee, John Ioannidis, Paul Kanevsky, Fred Korz, David Kurlander, Jong Lim, James Tanis, and George Wolberg, to name just a few. The countless dinners, good times, and piggy-back rides we shared helped make my stay here that much more enjoyable.
<p>I also wish to extend special thanks to the people at the University of Washington, especially Ed Lazowska, Hank Levy, Ed Felten, David Keppel (a.k.a. Pardo), Dylan McNamee, and Raj Vaswani, whose boundless energy and happiness always gave me something to look forward to when visiting Seattle or traveling to conferences and workshops. Special thanks to Raj, Dylan, Ed Felten and Jan and Denny Prichard, for creating that `carry' tee shirt and making me feel special; and to Lauren Bricker, Denise Draper, and John Zahorjan for piggy-back rides of unparalleled quality and length.
<p>Thanks goes to Sony corporation for the use of their machine; to Motorola for supplying most of the parts used to build my computer, the Quamachine; and to Burr Brown for their generous donation of digital audio chips.
<p>And finally, I want to thank my family, whose patience endured solidly to the end. Thanks to my mother and father, who always welcomed me home even when I was too busy to talk to them. Thanks, too, to my sister Lucy, sometimes the only person with whom I could share my feelings, and to my brother, Peter, who is always challenging me to a bicycle ride.
<p>In appreciation, I offer to all a warm, heartfelt<br><br>
<center><font size=18pt>- Qua! -</font></center>
</div>
</body>
</html>

View File

@ -1,79 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div id="nav">
<a class=home href="../index.html">Alexia's Home</a>
<a href="index.html">Dissertation</a>
<a href="abs.html">Abstract</a>
<a href="ack.html">Acknowledgements</a>
<a href="toc.html">Contents</a>
<a href="ch1.html">Chapter 1</a>
<a href="ch2.html">Chapter 2</a>
<a href="ch3.html">Chapter 3</a>
<a href="ch4.html">Chapter 4</a>
<a href="ch5.html">Chapter 5</a>
<a href="ch6.html">Chapter 6</a>
<a href="ch7.html">Chapter 7</a>
<a href="ch8.html">Chapter 8</a>
<a href="bib.html">Bibliography</a>
<a class=here href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
Appendix A <span class=smallcaps>Unix</span> Emulator Test Programs
#define N 500000int x[N]; main() -int i;
for(i=5; i--; )g(); printf("%d"n%d"n", x[N-2], x[N-1]); ""
g() - int i;
x[0] = x[1] = 1;for(i=2; i!N; i++)
x[i] = x[i-x[i-1]] + x[i-x[i-2]];""
Figure A.1: Test 1: Compute
137 #define N 1024 /* or 1 or 4096 */char x[N]; main()-
int fd[2],i;pipe(fd); for(i=10000; i--; ) -write(fd[1], x, N);
read(fd[0], x, N);"" ""
Figure A.2: Test 2, 3, and 4: Read/Write to a Pipe
#include !sys/file.h? #define Test.dev "/dev/null" /* or /dev/tty */ main()-
int f,i;for(i=10000; i--; ) -
f = open(Test.dev, O.RDONLY); close(f);""
""
Figure A.3: Test 5 and 6: Opening and Closing
#include !sys/file.h?#define N 1024 char x[N];main() - int f,i,j;
f = open("file", O.RDWR -- O.CREAT -- O.TRUNC, 0666); for(j=1000; j--; ) -
lseek(f, 0L, L.SET);for(i=10; i--; )
write(f, x, N);lseek(f, 0L, L.SET); for(i=10; i--; )read(f, x, N); ""close(f); ""
Figure A.4: Test 7: Read/Write to a File
</div>
</body>
</html>

View File

@ -1,344 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div id="nav">
<a class=home href="../index.html">Alexia's Home</a>
<a href="index.html">Dissertation</a>
<a href="abs.html">Abstract</a>
<a href="ack.html">Acknowledgements</a>
<a href="toc.html">Contents</a>
<a href="ch1.html">Chapter 1</a>
<a href="ch2.html">Chapter 2</a>
<a href="ch3.html">Chapter 3</a>
<a href="ch4.html">Chapter 4</a>
<a href="ch5.html">Chapter 5</a>
<a href="ch6.html">Chapter 6</a>
<a href="ch7.html">Chapter 7</a>
<a href="ch8.html">Chapter 8</a>
<a class=here href="bib.html">Bibliography</a>
<a href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
<h1>Bibliography</h1>
<div class=bib-conference>
<span class=bib-number>[1]</span>
<span class=bib-author>M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young.</span>
<span class=bib-title>Mach: A New Kernel Foundation for <span class=smallcaps>Unix</span> Development.</span>
<span class=bib-source>Proceedings of the 1986 Usenix Conference</span>
<span class=bib-pages>pages 93-112.</span>
<span class=bib-publisher>Usenix Association,</span>
<span class=bib-date>1986.</span>
</div>
<div class=bib-conference>
<span class=bib-number>[2]</span>
<span class=bib-author>Sarita V. Adve, Vikram S. Adve, Mark D. Hill, and Mary K. Vernon.</span>
<span class=bib-title>Comparison of Hardware and Software Cache Coherence Schemes.</span>
<span class=bib-source>The 18th Annual International Symposium on Computer Architecture</span>
<span class=bib-pages>volume 19, pages 298-308,</span>
<span class=bib-date>1991.</span>
</div>
<div class=bib-conference>
<span class=bib-number>[3]</span>
<span class=bib-author>T.E. Anderson, B.N. Bershad, E.D. Lazowska, and H.M. Levy.</span>
<span class=bib-title>Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism.</span>
<span class=bib-source>Proceedings of the 13th ACM Symposium on Operating Systems Principles</span>
<span class=bib-pages>pages 95-109,</span>
Pacific Grove, CA,
<span class=bib-date>October 1991.</span> ACM.
</div>
<div class=bib-entry>
<span class=bib-number>[4]</span>
<span class=bib-author>James Arleth.</span>
<span class=bib-title>A 68010 multiuser development system.</span>
<span class=bib-source>Master's thesis</span>, The Cooper Union for the Advancement of Science and Art, New York City,
<span class=bib-date>1984.</span>
</div>
<div class=bib-conference>
<span class=bib-number>[5]</span>
<span class=bib-author>Brian N. Bershad, Edward D. Lazowska, Henry M. Levy, and David B. Wagner.</span>
<span class=bib-title>An Open Environment for Building Parallel Programming Systems.</span>
<span class=bib-source>Symposium on Parallel Programming: Experience with Applications, Languages and Systems</span>
<span class=bib-pages>pages 1-9, </span>
New Haven, Connecticut (USA),
<span class=bib-date>July 1988.</span> ACM SIGPLAN.
</div>
<div class=bib-conference>
<span class=bib-number>[6]</span>
<span class=bib-author>A. Black, N. Hutchinson, E. Jul, and H. Levy.</span>
<span class=bib-title>Object Structure in the Emerald System.</span>
<span class=bib-source>Proceedings of the First Annual Conference on Object-Oriented Programming, Systems, Languages, and Applications</span>
<span class=bib-pages>pages 78-86.</span>
ACM,
<span class=bib-date>September 1986.</span>
</div>
<div class=bib-journal>
<span class=bib-number>[7]</span>
<span class=bib-author>D.L. Black.</span>
<span class=bib-title>Scheduling Support for Concurrency and Parallelism in the Mach Operating System.</span>
<span class=bib-source>IEEE Computer</span>
<span class=bib-pages>23(5):35-43,</span>
<span class=bib-date>May 1990.</span>
</div>
<div class=bib-conference>
<span class=bib-number>[8]</span>
<span class=bib-author>Min-Ih Chen and Kwei-Jay Lin.</span>
<span class=bib-title>A Priority Ceiling Protocol for Multiple-Instance Resources.</span>
<span class=bib-source>IEEE Real-Time Systems Symposium</span>, San Antonio, TX,
<span class=bib-date>December 1991.</span>
</div>
<div class=bib-journal>
<span class=bib-number>[9]</span>
<span class=bib-author>David Cheriton.</span>
<span class=bib-title>An Experiment Using Registers for Fast Message-Based Interprocess Communication.</span>
<span class=bib-source>ACM SIGOPS Operating Systems Review</span>
<span class=bib-pages>18(4):12-20,</span>
<span class=bib-date>October 1984.</span>
</div>
<div class=bib-entry>
<span class=bib-number>[10]</span>
<span class=bib-author>F. Christian.</span>
<span class=bib-title>Probabilistic Clock Synchronization.</span>
<span class=bib-source>Technical Report RJ6432 (62550) Computer Science</span>, IBM Almaden Research Center,
<span class=bib-date>September 1988.</span>
</div>
<div class=bib-book>
<span class=bib-number>[11]</span>
<span class=bib-author>H.M. Deitel.</span>
<span class=bib-title>An Introduction to Operating Systems.</span>
Addison-Wesley Publishing Company, second edition,
<span class=bib-date>1989.</span>
</div>
<div class=bib-conference>
<span class=bib-number>[12]</span>
<span class=bib-author>Richard P. Draves, Brian N. Bershad, Richard F. Rashid, and Randall W. Dean.</span>
<span class=bib-title>Using Continuations to Implement Thread Management and Communication in Operating Systems.</span>
<span class=bib-source>Proceedings of the 13th ACM Symposium on Operating Systems Principles</span>
<span class=bib-pages>pages 122-136,</span>
Pacific Grove, CA,
<span class=bib-date>October 1991.</span> ACM.
</div>
<div class=bib-journal>
<span class=bib-number>[13]</span>
<span class=bib-author>J. Feder.</span>
<span class=bib-title>The Evolution of <span class=smallcaps>Unix</span> System Performance.</span>
<span class=bib-source>AT&amp;T Bell Laboratories Technical Journal</span>
<span class=bib-pages>63(8):1791-1814,</span>
<span class=bib-date>October 1984.</span>
</div>
<div class=bib-journal>
<span class=bib-number>[14]</span>
<span class=bib-author>P.M. Herlihy.</span>
<span class=bib-title>Wait-Free Synchronization.</span>
<span class=bib-source>ACM Transactions on Programming Languages and Systems</span>
<span class=bib-pages>13(1),</span>
<span class=bib-date>January 1991.</span>
</div>
<div class=bib-entry>
<span class=bib-number>[15]</span>
<span class=bib-author>Neil D. Jones, Peter Sestoft, and Harald Sondergaard.</span>
<span class=bib-title>Mix: A Self-Applicable Partial Evaluator for Experiments in Compiler Generation.</span>
<span class=bib-source>Lisp and Symbolic Computation</span>
<span class=bib-pages>2(9-50):10,</span>
1989.
</div>
<div class=bib-entry>
<span class=bib-number>[16]</span>
<span class=bib-author>David Keppel, Susan J. Eggers, and Robert R. Henry.</span>
<span class=bib-title>A Case for Runtime Code Generation.</span>
<span class=bib-source>Technical Report UW CS&amp;E 91-11-04</span>, University of Washington Department of Computer Science and Engineering,
<span class=bib-date>November 1991.</span>
</div>
<div class=bib-conference>
<span class=bib-number>[17]</span>
<span class=bib-author>B.D. Marsh, M.L.Scott, T.J.LeBlanc, and E.P.Markatos.</span>
<span class=bib-title>First-Class User-Level Threads.</span>
<span class=bib-source>Proceedings of the 13th ACM Symposium on Operating Systems Principles</span>
<span class=bib-pages>pages 95-109,</span>
Pacific Grove, CA,
<span class=bib-date>October 1991.</span> ACM.
</div>
<div class=bib-conference>
<span class=bib-number>[18]</span>
<span class=bib-author>H. Massalin and C. Pu.</span>
<span class=bib-title>Threads and Input/Output in the Synthesis Kernel.</span>
<span class=bib-source>Proceedings of the Twelfth Symposium on Operating Systems Principles</span>
<span class=bib-pages>pages 191-201,</span>
Arizona,
<span class=bib-date>December 1989.</span>
</div>
<div class=bib-entry>
<span class=bib-number>[19]</span>
<span class=bib-author>Henry Massalin.</span>
<span class=bib-title>A 68010 Multitasking Development System.</span>
Master's thesis, The Cooper Union for the Advancement of Science and Art, New York City,
<span class=bib-date>1984.</span>
</div>
<div class=bib-book>
<span class=bib-number>[20]</span>
<span class=bib-author>Motorola.</span>
<span class=bib-title>MC68881 and MC68882 Floating-Point Coprocessor User's Manual.</span>
Prentice Hall, Englewood Cliffs, NJ, 07632,
<span class=bib-date>1987.</span>
</div>
<div class=bib-book>
<span class=bib-number>[21]</span>
<span class=bib-author>Motorola.</span>
<span class=bib-title>MC68030 User's Manual.</span>
Prentice Hall, Englewood Cliffs, NJ, 07632,
<span class=bib-date>1989.</span>
</div>
<div class=bib-conference>
<span class=bib-number>[22]</span>
<span class=bib-author>J. Ousterhout.</span>
<span class=bib-title>Why Aren't Operating Systems Getting Faster as Fast as Hardware.</span>
<span class=bib-source>USENIX Summer Conference</span>
<span class=bib-pages>pages 247-256,</span>
Anaheim, CA,
<span class=bib-date>June 1990.</span>
</div>
<div class=bib-conference>
<span class=bib-number>[23]</span>
<span class=bib-author>Susan Owicki and Anant Agarwal.</span>
<span class=bib-title>Evaluating the Performance of Software Cache Coherence.</span>
<span class=bib-source>Proceedings of the 3rd Symposium on Programming Languages and Operating Systems</span>. ACM,
<span class=bib-date>1989.</span>
</div>
<div class=bib-entry>
<span class=bib-number>[24]</span>
<span class=bib-author>R. Pike, D. Presotto, K. Thompson, and H. Trickey.</span>
<span class=bib-title>Plan 9 from Bell Labs.</span>
<span class=bib-source>Technical Report CSTR # 158</span>, AT&amp;T Bell Labs,
<span class=bib-date>1991.</span>
</div>
<div class=bib-journal>
<span class=bib-number>[25]</span>
<span class=bib-author>C. Pu, H. Massalin, and J. Ioannidis.</span>
<span class=bib-title>The Synthesis Kernel.</span>
<span class=bib-source>Computing Systems</span>
<span class=bib-pages>1(1):11-32,</span>
<span class=bib-date>Winter 1988.</span>
</div>
<div class=bib-journal>
<span class=bib-number>[26]</span>
<span class=bib-author>J.S. Quarterman, A. Silberschatz, and J.L. Peterson.</span>
<span class=bib-title>4.2BSD and 4.3BSD as Examples of the <span class=smallcaps>Unix</span> System.</span>
<span class=bib-source>ACM Computing Surveys</span>
<span class=bib-pages>17(4):379-418,</span>
<span class=bib-date>December 1985.</span>
</div>
<div class=bib-journal>
<span class=bib-number>[27]</span>
<span class=bib-author>D. Ritchie.</span>
<span class=bib-title>A Stream Input-Output System.</span>
<span class=bib-source>AT&amp;T Bell Laboratories Technical Journal</span>
<span class=bib-pages>63(8):1897-1910,</span>
<span class=bib-date>October 1984.</span>
</div>
<div class=bib-journal>
<span class=bib-number>[28]</span>
<span class=bib-author>D.M. Ritchie and K. Thompson.</span>
<span class=bib-title>The <span class=smallcaps>Unix</span> Time-Sharing System.</span>
<span class=bib-source>Communications of ACM</span>
<span class=bib-pages>7(7):365-375,</span>
<span class=bib-date>July 1974.</span>
</div>
<div class=bib-journal>
<span class=bib-number>[29]</span>
<span class=bib-author>J.A. Stankovic.</span>
<span class=bib-title>Misconceptions About Real-Time Computing: A Serious Problem for Next-Generation Systems.</span>
<span class=bib-source>IEEE Computer</span>
<span class=bib-pages>21(10):10-19,</span>
<span class=bib-date>October 1988.</span>
</div>
<div class=bib-journal>
<span class=bib-number>[30]</span>
<span class=bib-author>M. Stonebraker.</span>
<span class=bib-title>Operating System Support for Database Management.</span>
<span class=bib-source>Communications of ACM</span>
<span class=bib-pages>24(7):412-418,</span>
<span class=bib-date>July 1981.</span>
</div>
<div class=bib-entry>
<span class=bib-number>[31]</span>
<span class=bib-author>Sun Microsystems Incorporated, 2550 Garcia Avenue, Mountain View, California 94043, 415-960-1300.</span>
<span class=bib-title>SunOS Reference Manual,
<span class=bib-date>May 1988.</span></span>
</div>
<div class=bib-conference>
<span class=bib-number>[32]</span>
<span class=bib-author>Peter Wegner.</span>
<span class=bib-title>Dimensions of Object-Based Language Design.</span>
<span class=bib-source>Norman Meyrowitz, editor, Proceedings of the OOPSLA'87 conference</span>
<span class=bib-pages>pages 168-182,</span>
Orlando FL (USA),
<span class=bib-date>1987.</span> ACM.
</div>
<div class=bib-conference>
<span class=bib-number>[33]</span>
<span class=bib-author>Mark Weiser, Alan Demers, and Carl Hauser.</span>
<span class=bib-title>The Portable Common Runtime Approach to Interoperability.</span>
<span class=bib-source>Proceedings of the 12th ACM Symposium on Operating Systems Principles</span>
<span class=bib-pages>pages 114-122,</span>
Litchfield Park AZ (USA),
<span class=bib-date>December 1989.</span> ACM.
</div>
<div class=bib-journal>
<span class=bib-number>[34]</span>
<span class=bib-author>W.A. Wulf, E. Cohen, W. Corwin, A. Jones, R. Levin, C. Pierson, and F. Pollack.</span>
<span class=bib-title>Hydra: The Kernel of a Multiprocessing Operating System.</span>
<span class=bib-source>Communications of ACM</span>
<span class=bib-pages>17(6):337-345,</span>
<span class=bib-date>June 1974.</span>
</div>
</div>
</body>
</html>

View File

@ -1,137 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div id="nav">
<a class=home href="../index.html">Alexia's Home</a>
<a href="index.html">Dissertation</a>
<a href="abs.html">Abstract</a>
<a href="ack.html">Acknowledgements</a>
<a href="toc.html">Contents</a>
<a class=here href="ch1.html">Chapter 1</a>
<a href="ch2.html">Chapter 2</a>
<a href="ch3.html">Chapter 3</a>
<a href="ch4.html">Chapter 4</a>
<a href="ch5.html">Chapter 5</a>
<a href="ch6.html">Chapter 6</a>
<a href="ch7.html">Chapter 7</a>
<a href="ch8.html">Chapter 8</a>
<a href="bib.html">Bibliography</a>
<a href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
<h1>1. Introduction</h1>
<div id="chapter-quote">
I must Create a System, or be enslav'd by another Man's;<br>
I will not Reason and Compare: my business is to Create.<br>
-- William Blake Jerusalem
</div>
<h2>1.1 Purpose</h2>
<p>This dissertation shows that operating systems can provide fundamental services an order of magnitude more efficiently than traditional implementations. It describes the implementation of a new operating system kernel, Synthesis, that achieves this level of performance.
<p>The Synthesis kernel combines several new techniques to provide high performance without sacrificing the expressive power or security of the system. The new ideas include:
<ul>
<li><i>Run-time code synthesis</i> - a systematic way of creating executable machine code at runtime to optimize frequently-used kernel routines - queues, buffers, context switchers, interrupt handlers, and system call dispatchers - for specific situations, greatly reducing their execution time.
<li><i>Fine-grain scheduling</i> - a new process-scheduling technique based on the idea of feedback that performs frequent scheduling actions and policy adjustments (at submillisecond intervals) resulting in an adaptive, self-tuning system that can support real-time data streams.
<li><i>Lock-free optimistic synchronization</i> is shown to be a practical, efficient alternative to lock-based synchronization methods for the implementation of multiprocessor operating system kernels.
<li>An extensible kernel design that provides for simple expansion to support new kernel services and hardware devices while allowing a tight coupling between the kernel and the applications, blurring the distinction between user and kernel services.
</ul>
The result is a significant performance improvement over traditional operating system implementations in addition to providing new services.
<p>The text is structured as follows: The remainder of this chapter summarizes the project. It begins with a brief history, showing how my dissatisfaction with the performance of computer software led me to do this research. It ends with an overview of the Synthesis kernel and the hardware it runs on. The intent is to establish context for the remaining chapters and reduce the need for forward references in the text.
<p>Chapter 2 examines the design decisions and tradeoffs in existing operating systems. It puts forth arguments telling why I believe some of these decisions and tradeoffs should be reconsidered, and points out how Synthesis addresses the issues.
<p>The next four chapters present the new implementation techniques. Chapter 3 explains run-time kernel code synthesis. Chapter 4 describes the structure of the Synthesis kernel. Chapter 5 explains the lock-free data structures and algorithms used in Synthesis. Chapter 6 talks about fine-grain scheduling. Each chapter includes measurements that prove the effectiveness of each idea.
<p>Application-level measurements of the system as a whole and comparisons with other systems are found in chapter 7. The dissertation closes with chapter 8, which contains conclusions and directions for further work.
<h2>1.2 History and Motivation</h2>
<p>This section gives a brief history of the Synthesis project. By giving the reader a glimpse of what was going through my mind while doing this research, I establish context and make the new ideas easier to grasp by showing the motivation behind them.
<p>. . . In 1983, the first <span class=smallcaps>Unix</span>-based workstations were being introduced. I was unhappy with the performance of computers of that day, particularly that of workstations relative to what DOS-based PCs could deliver. Among other things, I found it hard to believe that the workstations could not drive even one serial line at a full 19,200 baud - approximately 2000 characters per second <sup>1</sup>. I remember asking myself and others: "There is a full half-millisecond time between characters. What could the operating system possibly be doing for that long?" No one had a clear answer. Even at the relatively slow machine speed of that day - approximately one million machine instructions per second - the processor could execute 500 machine instructions in the time a character was transmitted. I could not understand why 500 instructions were not sufficient to read a character from a queue and have it available to write to the device's control register by the time the previous one had been transmitted.
<div class=footnote><sup>1</sup> This is still true today despite an order-of-magnitude speed increase in the processor hardware, and attests to a comparable increase in operating system overhead. Specifically, the Sony NEWS 1860 workstation, running release 4.0 of Sony's version of UNIX, places a software limit of 9600 baud on the machine's serial lines. If I force the line to go faster through the use of kernel hackery, the operating system loses data each time a burst longer than about 200 characters arrives at high speed.</div>
<p>That summer, I decided to try building a small computer system and writing some operating systems software. I thought it would be fun, and I wanted to see how far I could get. I teamed up with a fellow student, James Arleth, and together we built the precursor of what was later to become an experimental machine known as the Quamachine. It was a two-processor machine based on the 68000 CPU [4], but designed in such a way that it could be split into two independently-operating halves, so we each would have a computer to take with us after we graduated. Jim did most of the hardware design while I concentrated on software.
<p>The first version of the software [19] consisted of drivers for the machine's serial ports and 8-bit analog I/O ports, a simple multi-tasker, and an unusual debug monitor that included a rudimentary C-language compiler / interpreter as its front end. It was quite small - everything fit into the machine's 16 kilobyte ROM, and ran comfortably in its 16 kilobyte RAM. And it did drive the serial ports at 19,200 baud. Not just one, but all four of them, concurrently. Even though it lacked many fundamental services, such as a filesystem, and could not be considered a "real" operating system in the sense of <span class=smallcaps>Unix</span>, it was the precursor of the Synthesis kernel, though I did not know it at the time.
<p>After entering the PhD program at Columbia in the fall of 1984, I continued to develop the system in my spare time, improving both the hardware and the software, and also experimenting with other interests -- electronic music and signal processing. During this time, the CPU was upgraded several times as Motorola released new processors in the 68000 family. Currently, the Quamachine uses a 68030 processor rated for 33 MHz, but running at 50MHz, thanks to a homebrew clock circuit, special memory decoding tricks, a higher-than-spec operating voltage, and an ice-cube to cool the processor.
<p>But as the software was fleshed out with more features and new services, it became slower. Each new service required new code and data structures that often interacted with other, unrelated, services, slowing them down. I saw my system slowly acquiring the ills of <span class=smallcaps>Unix</span>, going down the same road to inefficiency. This gave me insight into the inefficiency of <span class=smallcaps>Unix</span>. I noticed that, often, the mere presence of a feature or capability incurs some cost, even when not being used. For example, as the number of services and options multiply, extra code is required to select from among them, and to check for possible interference between them. This code does no useful work processing the application's data, yet it adds overhead to each and every call.
<p>Suddenly, I had a glimmer of an idea of how to prevent this inefficiency from creeping into my system: runtime code generation! All along I had been using a monitor program with a C-language front end as my "shell." I could install and remove services as needed, so that no service would impose its overhead until it was used. I thought that perhaps there might be a way to automate the process, so that the correct code would be created and installed each time a service was used, and automatically removed when it was no longer needed. This is how the concept of creating code at runtime came to be. I hoped that this could provide relief from the inefficiencies that plague other full-featured operating systems.
<p>I was dabbling with these ideas, still in my spare time, when Calton Pu joined the faculty at Columbia as I was entering my third year. I went to speak with him since I was still unsure of my research plans and looking for a new advisor. Calton brought with him some interesting research problems, among them the efficient implementation of object 5 based systems. He had labored through his dissertation and knew where the problems were. Looking at my system, he thought that my ideas might solve that problem one day, and encouraged me to forge ahead.
<p>The project took shape toward the end of that semester. Calton had gone home for Christmas, and came back with the name Synthesis, chosen for the main idea: run-time kernel code synthesis. He helped package the ideas into a coherent set of concepts, and we wrote our first paper in February of 1987.
<p>I knew then what the topic of my dissertation would be. I started mapping out the structure of the basic services and slowly restructured the kernel to use code synthesis throughout. Every operation was subject to intense scrutiny. I recall the joy felt the day I discovered how to perform a "putchar" (place character into buffer) operation in four machine instructions rather than the five I had been using (or eight, using the common Clanguage macro). After all, "putchar" is a common operation, and I found it both satisfying and amusing that eliminating one machine instruction resulted in a 4% overall gain in performance for some of my music applications. I continued experimenting with electronic music, which by then had become more than a hobby, and, as shown in section 6.3, offered a convincing demonstration that Synthesis did deliver the kind of performance claimed.
<p>Over time, this type of semi-playful, semi-serious work toward a fully functional kernel inspired the other features in Synthesis - fine-grained scheduling, lock-free synchronization, and the kernel structure.
<p>Fine-grained scheduling was inspired by work in music and signal-processing. The early kernel's scheduler often needed tweaking in order to get a new music synthesis program to run in real-time. While early Synthesis was fast enough to make real-time signal processing possible by handling interrupts and context switches efficiently, it lacked a guarantee that real-time tasks got sufficient CPU as the machine load increased. I had considered the use of task priorities in scheduling, but decided against them, partly because of the programming effort involved, but mostly because I had observed other systems that used priorities, and they did not seem to fully solve the problem. Instead, I got the idea that the scheduler could deduce how much CPU time to give each stage of processing by measuring the data accumulation at each stage. That is how fine-grained scheduling was born. It seemed easy enough to do, and a few days later I had it working.
<p>The overall structure of the kernel was another idea developed over time. Initially, the kernel was an ad-hoc mass of procedures, some of which created code, some of which didn't. Runtime code generation was not well understood, and I did not know the best way to structure such a system. For each place in the kernel where code-synthesis would be beneficial, I wrote special code to do the job. But even though the kernel was lacking in overall structure, I did not see that as negative. This was a period where freedom to experiment led to valuable insights, and, as I found myself repeating certain things, an overall structure gradually became clear.
<p>Optimistic synchronization was a result of these experiments. I had started writing the kernel using disabled interrupts to implement critical sections, as is usually done in other single-processor operating systems. But the limitations of this method were soon brought out in my real-time signal processing work, which depends on the timely servicing of frequent interrupts, and therefore cannot run in a system that disables interrupts for too long. I therefore looked for alternatives to inter-process synchronization. I observed that in many cases, such as in a single-producer/single-consumer queue, the producer and consumer interact only when the queue is full or empty. During other times, they each work on different parts of the queue, and can do so independently, without synchronization. My interest in this area was further piqued when I read about the "Compare-and-Swap" instructions on the 68030 processor, which allows concurrent data structures to be implemented without using locks.
<h2>1.3 Synthesis Overview</h2>
<h3>1.3.1 Kernel Structure</h3>
<p>The Synthesis kernel is designed to support a real, full-featured operating system with functionality on the level of <span class=smallcaps>Unix</span> [28] and Mach [1]. It is built out of many small, independent modules called quajects. A quaject is an abstract data type -- a collection of code and data with a well-defined interface that performs a specific function. The interface encompasses not just the quaject's entry points, but also all its external invocations, making it possible to dynamically link quajects, thereby building up kernel services. Some examples of quajects include various kinds of queues and buffers, threads, TTY input and output editors, terminal emulators, and text and graphics windows.
<p>All higher-level kernel services are created by instantiating and linking two or more quajects through their interfaces. For example, a <span class=smallcaps>Unix</span>-like TTY device is built using the following quajects: a raw serial device driver, two queues, an input editor, an output format converter, and a system-call dispatcher. The wide choice of quajects and linkages allows Synthesis to support a wide range of different system interfaces at the user level. For example, Synthesis includes a (partial) <span class=smallcaps>Unix</span> emulator that runs some SUN-3 binaries. At the same time, a different application might use a different interface, for example, one that supports asynchronous I/O.
<h3>1.3.2 Implementation Ideas</h3>
<p>One of the ways Synthesis achieves order-of-magnitude gains in efficiency is through the technique of kernel code synthesis. Kernel code synthesis creates, on-the-fly, specialized (thus short and fast) kernel routines for specific situations, reducing the execution path for frequently used kernel calls. For example, queue quajects have their buffer and pointer addresses hard-coded using self-relative addressing; thread quajects have their system-call dispatch and context-switch code specially crafted to speed these operations. Section 3.3 illustrates the specific code created for these and other examples. This hard-coding eliminates indirection and reduces parameter passing, improving execution speed. Extensive use of the processor's self-relative addressing capability retains the benefits of relocatability and easy sharing. Shared libraries of non-specialized code handle less-frequently occurring cases and keep the memory requirements low. Chapter 3 explains this idea in detail and also introduces the idea of executable data structures, which are highly efficient "self-traversing" structures.
<p>Synthesis handles real-time data streams with fine-grain scheduling. Fine-grain scheduling measures process progress and performs frequent scheduling actions and policy adjustments at sub-millisecond intervals resulting in an adaptive, self-tuning system usable in a real-time environment. This idea is explained in chapter 6, and is illustrated with various music-synthesizer and signal-processing applications, all of which run in real time under Synthesis.
<p>Finally, lock-free optimistic synchronization increases concurrency within the multithreaded synthesis kernel and enhances Synthesis support for multiprocessors. Synthesis also includes a reentrant, optimistically-synchronized C-language runtime library suitable for use in multi-threaded and multi-processor applications written in C.
<h3>1.3.3 Implementation Language</h3>
<p>Synthesis is written in 68030 macro assembly language. Despite its obvious flaws - the lack of portability and the difficulty of writing complex programs - I chose assembler because no higher-level language provides both efficient execution and support for runtime code-generation. I also felt that it would be an interesting experiment to write a mediumsize system in assembler, which allows unrestricted access to the machine's architecture, and perhaps discover new coding idioms that have not yet been captured in a higher-level language. Section 7.4.1 reports on the experience.
<p>A powerful macro facility helped minimize the difficulty of writing complex programs. It also let me postpone making some difficult system-wide design decisions, and let me easily change them after they were made. For example, quaject definition is a declarative macro in the language. The structure of this macro and the code it produced changed several times during the course of system development. Even the object-file ".o" format is defined entirely by source-code macros, not by the assembler itself, and allows for easy expansion to accommodate new ideas.
<h3>1.3.4 Target Hardware</h3>
<p>At the time of this writing, Synthesis runs on two machines: the Quamachine and the Sony NEWS 1860 workstation.
<p>The Quamachine is a home-brew, experimental 68030-based computer system designed to aid systems research and measurement. Measurement facilities include an instruction counter, a memory reference counter, hardware program tracing, and an interval timer with 20-nanosecond resolution. As their names imply, the instruction counter keeps a count of machine instructions executed by the processor, and the memory reference counter keeps a count of memory references issued by the processor. The processor can operate at any clock speed from 1 MHz up to 50 MHz. Normally it runs at 50 MHz. But by setting the processor speed to 16 MHz and introducing 1 wait-state into the memory access, the Quamachine closely matches the performance characteristics of a the SUN-3/160, allowing direct measurements and comparisons with that machine and its operating system.
<p>Other features of the Quamachine include 256 kilobytes of no-wait-state ROM that holds the entire Synthesis kernel, monitor, and runtime libraries; 2 12 megabytes of no-waitstate main memory; a 2Kx2Kx8-bit framebuffer with graphics co-processor; and audio I/O devices: stereo 16-bit analog output, stereo 16-bit analog input, and a compact disc (CD) player digital interface.
<p>The Sony NEWS 1860 is a workstation with two 68030 processors. It is a commercially available machine, making Synthesis potentially accessible to other interested researchers. It has two processors, which, while not a large number, nevertheless demonstrates Synthesis multiprocessor support. While its architecture is not symmetric - one processor is the main processor and the other is the I/O processor - Synthesis treats it as if it were a symmetric multiprocessor, scheduling tasks on either processor without preference, except those that require something that is accessible from one processor and not the other.
<h3>1.3.5 <span class=smallcaps>Unix</span> Emulator</h3>
<p>A partial <span class=smallcaps>Unix</span> emulator runs on top of the Synthesis kernel and emulates certain SUNOS kernel calls [31]. Although the emulator supports only a subset of the <span class=smallcaps>Unix</span> system calls -- time constraints have forced an "implement-as-the-need-arises" strategy -- the set supported is sufficiently rich to provide many benefits. It helps with the problem of acquiring application software for a new operating system by allowing the use of SUN-3 binaries. It further demonstrates the generality of Synthesis by setting the lower bound -- emulating a widely used system. And, most important from the research point of view, it allows a direct comparison between Synthesis and <span class=smallcaps>Unix</span>. Section 7.2.1 presents measurements showing that the Synthesis emulation of <span class=smallcaps>Unix</span> is several times more efficient than native <span class=smallcaps>Unix</span> running the same set of programs on comparable hardware.
</div>
</body>
</html>

View File

@ -1,175 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div id="nav">
<a class=home href="../index.html">Alexia's Home</a>
<a href="index.html">Dissertation</a>
<a href="abs.html">Abstract</a>
<a href="ack.html">Acknowledgements</a>
<a href="toc.html">Contents</a>
<a href="ch1.html">Chapter 1</a>
<a class=here href="ch2.html">Chapter 2</a>
<a href="ch3.html">Chapter 3</a>
<a href="ch4.html">Chapter 4</a>
<a href="ch5.html">Chapter 5</a>
<a href="ch6.html">Chapter 6</a>
<a href="ch7.html">Chapter 7</a>
<a href="ch8.html">Chapter 8</a>
<a href="bib.html">Bibliography</a>
<a href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
<h1>2. Previous Work</h1>
<div id="chapter-quote">
If I have seen farther than others, it is because<br>
I was standing on the shoulders of giants.<br>
-- Isaac Newton<br>
<br>
If I have not seen as far as others, it is because<br>
giants were standing on my shoulders.<br>
-- Hal Abelson<br>
<br>
In computer science, we stand on each other's feet.<br>
-- Brian K. Reid
</div>
<h2>2.1 Overview</h2>
<p>This chapter sketches an overview of some of the classical goals of operating system design and tells how existing designs have addressed them. This provides a background against which the new techniques in Synthesis can be contrasted. I argue that some of the classical goals need to be reconsidered in light of new requirements and point out the new goals that have steered the design of Synthesis.
<p>There are four areas in which Synthesis makes strong departures from classical designs: overall kernel structure, the pervasive use of run-time code generation, the management of concurrency and synchronization, and novel use of feedback mechanisms in scheduling. The rest of this chapter discusses each of these four topics in turn, but first, it is useful to consider some broad design issues.
<h2>2.2 The Tradeoff Between Throughput and Latency</h2>
<p>The oldest goal in building operating systems has been to achieve high performance. There are two common measures of performance: throughput and latency. Throughput is a measure of how much useful work is done per unit time. Latency is a measure of how long it takes to finish an individual piece of work. Traditionally, high performance meant increasing the throughput - performing the most work in the minimum time. But traditional ways of increasing throughput also tend to increase latency.
<p>The classic way of increasing throughput is by batching data into large chunks which are then processed together. This way, the high overhead of initiating the processing is amortized over a large quantity of data. But batching increases latency because data that could otherwise be output instead sits in a buffer, waiting while it fills, causing delays. This happens at all levels. The mainframe batch systems of the 1960's made efficient use of machines, achieving high throughput but at the expense of intolerable latency for users and grossly inefficient use of people's time. In the 1970's, the shift toward timesharing operating systems made for a slightly less efficient use of the machine, but personal productivity was enormously improved. However, calls to the operating system were expensive, which meant that data had to be passed in big, buffered chunks in order to amortize the overhead.
<p>This is still true today. For example, the Sony NEWS workstation, running Sony's version of <span class=smallcaps>Unix</span> release 4.0C (a derivative of Berkeley <span class=smallcaps>Unix</span>), takes 260 microseconds to write a single character to an I/O pipe connecting to another program. But writing 1024 characters takes 450 microseconds - a little more than twice the cost of writing a single character. Looking at it another way, over 900 characters can be written in the time taken by the invocation overhead. The reasons for using buffering are obvious. In fact, Sony's program libraries use larger, 8192-character buffers to further amortize the overhead and increase throughput. Such large-scale buffering greatly increases latency and indeed the general trend has been to parcel systems into big pieces that communicate with high overhead, compounding the delays.
<p>In light of these large overheads, it is interesting to examine the history of operating system performance, paying particular attention to the important, low-level operations that are exercised often, such as context switch and system call dispatch. We find that operating systems have historically exhibited large invocation overheads. Due to its popularity and wide availability, <span class=smallcaps>Unix</span> is one of the more-studied systems, and I use it here as a baseline for performance comparisons.
<table class=table>
<caption>
Table 2.1: Overhead of Various System Calls<br>
<small>Sony NEWS 1860 workstation, 68030 processor, 25MHz, 1 waitstate, <span class=smallcaps>Unix</span> Release 4.0C.</small>
</caption>
<tr class=head><th>System Function<th>Time for 1 char (&#181;s)<th>Time for 1024 chars (&#181;s)
<tr><th>Write to a pipe<td class=number>260<td class=number>450
<tr><th>Write to a file<td class=number>340<td class=number>420
<tr><th>Read from a pipe<td class=number>190<td class=number>610
<tr><th>Read from a file<td class=number>380<td class=number>460
<tr class=head><th>System Function<th>Time (&#181;s)
<tr><th>Dispatch system call (getpid)<td class=number>40
<tr><th>Context Switch<td class=number>170
</table>
<table class=table>
<caption>
Table 2.2: Overhead of Various System Calls, Mach<br>
<small>NeXT workstation, 68030 processor, 25MHz, 1 waitstate, Mach Release 2.1.</small>
</caption>
<tr class=head><th>System Function<th>Time for 1 char (&#181;s)<th>Time for 1024 chars (&#181;s)
<tr><th>Write to a pipe<td class=number>470<td class=number>740
<tr><th>Write to a file<td class=number>370<td class=number>600
<tr><th>Read from a pipe<td class=number>550<td class=number>760
<tr><th>Read from a file<td class=number>350<td class=number>580
<tr class=head><th>System Function<th>Time (&#181;s)
<tr><th>Dispatch system call (getpid)<td class=number>88
<tr><th>Context Switch<td class=number>510
</table>
<p>In one study, Feder compares the evolution of <span class=smallcaps>Unix</span> system performance over time and over different machines [13]. He studies the AT&amp;T releases of <span class=smallcaps>Unix</span> - System 3 and System 5 - spanning a time period from the mid-70's to late 1982 and shows that <span class=smallcaps>Unix</span> performance had improved roughly 25% during this time. Among the measurements shown is the time taken to execute the getpid (get process id) system call. This system call fetches a tiny amount of information (one integer) from the kernel, and its speed is a good indicator of the cost of the system call mechanism. For the VAX-11/750 minicomputer, Feder reports a time of 420 microseconds for getpid and 1000 microseconds for context switch.
<p>I have duplicated some of these experiments on the Sony NEWS workstation, a machine of roughly 10 times the performance of the VAX-11/750. Table 2.1 summarizes the results.<sup>1</sup> On this machine, getpid takes 40 microseconds, and a context switch takes 170 microseconds. These numbers suggest that, since 1982, the performance of <span class=smallcaps>Unix</span> has remained relatively constant compared to the speed of the hardware.
<div class=footnote><sup>1</sup> Even though the Sony machine has two processors, one of them is dedicated exclusively to handling device I/O and does not run any <span class=smallcaps>Unix</span> code. This second processor does not affect the outcome of the tests, which are designed to measure <span class=smallcaps>Unix</span> system overhead, not device I/O capacity. The file read and write benchmarks were to an in-core file system. There was no disk activity.</div>
<p>A study done by Ousterhout [22] shows that operating system speed has not kept pace with hardware speed. The reasons he finds are that memory bandwidth and disk speed have not kept up with the dramatic increases in processor speed. Since operating systems tend to make heavier use of these resources than the typical application, this has a negative effect on operating system performance relative to how the processor's speed is measured.
<p>But I believe there are further reasons for the large overhead in existing systems. As new applications demand more functionality, the tendency has been simply to layer on more functions. This can slow down the whole system because often the mere existence of a feature forces extra processing steps, regardless of whether that feature is being used or not. New features often require extra code or more levels of indirection to select from among them. Kernels become larger and more complicated, leading designers to restructure their operating systems to manage the complexity and improve understandability and maintainability. This restructuring, if not carefully done, can reduce performance by introducing extra layers and overhead where there was none before.
<p>For example, the Mach operating system offers a wide range of new features, such as threads and flexible virtual memory management, all packaged in a small, modular, easy-toport kernel [1]. But it does not perform very well compared to Sony's <span class=smallcaps>Unix</span>. Table 2.2 shows the results of the previous experiment, repeated under Mach on the NeXT machine. Both the NeXT machine and the Sony workstation use the Motorola 68030 processor, and both run at 25MHz. All but one of the measurements show reduced performance compared to Sony's <span class=smallcaps>Unix</span>. Crucial low-level functions, such as context switch and system call dispatch, are two to three times slower in this version of Mach.
<p>Another reason for the large overheads might be that invocation overhead per se has not been subject to intense scrutiny. Designers tend to optimize the most frequently occurring special cases in the services offered, while the cases most frequently used tend to be those that were historically fast, since those are the ones people would have tended to use more. This self-reinforcing loop has the effect of encouraging optimizations that maintain the status quo with regard to relative performance, while eschewing optimizations that may have less immediate payoff but hold the promise of greater eventual return. Since large invocation overheads can usually be hidden with buffering, there has not been a large impetus to optimize in this direction.
<p>Instead of attacking the problem of high kernel overhead directly, performance problems are being solved with more buffering, applied in ever more ingenious ways to a wider array of services. Look, for example, at recent advances in thread management. A number of researchers begin with the premise that kernel thread operations are necessarily expensive, and go on to describe the implementation of a user-level threads package [17] [5] [33] [3]. Since much of the work is now done at the user-level by subscheduling one or more kernelsupplied threads, they can avoid many kernel invocations and their associated overhead.
<p>But there is a tradeoff: increased performance for operations at the user level come with increased overhead and latency when communicating with the kernel. One reason is that kernel calls no longer happen directly, but first go through the user-level code. Another reason could be that optimizing kernel invocations are no longer deemed to be as important, since they occur less often. For example, while Anderson reports order-of-magnitude performance improvement for user-level thread operations on the DEC CVAX multiprocessor compared to the native Topaz kernel threads implementation, the cost of invoking the kernel thread operations had been increased by a factor of 5 over Topaz threads [3].
<p>The factor of 5 is significant because, ultimately, programs interact with the outside world through kernel invocations. Increasing the overhead limits the rate at which a program can invoke the kernel and therefore, interact with the outside world.
<p>Taken to the limit, the things that remain fast are those local to an application, those that can be done at user-level without invoking the kernel often. But in a world of increasing interactions and communications between machines -- all of which require kernel intervention -- I do not think this is a wise optimization strategy. Distributed computing stresses the importance of low latency, both because throughput can actually suffer if machines spend time waiting for each others' responses rather than doing work, and because there are so many interactions with other machines that even a small delay in each is magnified, leading to uneven response time to the user.
<p>Improvement is clearly required to ensure consistent performance and controlled latencies, particularly when processing richer media like interactive sound and video. For example, in an application involving 8-bit audio sampled at 8KHz, using a 4096-byte buffer leads to a 1=2-second delay per stage of processing. This is unacceptable for real-time, interactive audio work. The basic system overhead must be reduced so that time-sensitive applications can use smaller buffers, reducing latency while maintaining throughput. But there is little room for revolutionary increases in performance when the fundamental operating system mechanisms, such as system call dispatch and context switch, are slow, and furthermore, show no trend in becoming faster. In general, existing designs have not focused on lower-level, low-overhead mechanisms, preferring instead to solve performance problems with more buffering.
<p>This dissertation shows that the unusual goal of providing high throughput with low latency can be achieved. There are many factors in the design of Synthesis that accomplish this result, which will be discussed at length in subsequent chapters. But let us now consider four important aspects of the Synthesis design that depart from common precedents and trends.
<h2>2.3 Kernel Structure</h2>
<h3>2.3.1 The Trend from Monolithic to Diffuse</h3>
<p>Early kernels tended to be large, isolated, monolithic structures that were hard to maintain. IBM's MVS is a classic example [11]. <span class=smallcaps>Unix</span> initially embodied the "small is beautiful" ideal [28]. It captured some of the most elegant ideas of its day in a kernel design that, while still monolithic, was small, easy to understand and maintain, and provided a synergistic, productive, and highly portable set of system tools. However, its subsequent evolution and gradual accumulation of new services resulted in operating systems like System V and Berkeley's BSD 4.3, whose large, sprawling kernels hearken back to MVS.
<p>These problems became apparent to several research teams, and a number of new system projects intended to address the problem were begun. For example, recognizing the need for clean, elegant services, the Mach group at CMU started with the BSD kernel and factored services into user-level tasks, leaving behind a very small kernel of common, central services [1]. Taking a different approach, the Plan 9 group at AT&amp;T Bell Laboratories chose to carve the monolithic kernel into three sub-kernels, one for managing files, one for computation, and one for user interfaces [24]. Their idea is to more accurately and flexibly fit the networks of heterogeneous machines that are common in large organizations today.
<p>There are difficulties with all these approaches. In the case of Mach, the goal of kernelizing the system by placing different services into separate user-level tasks forces additional parameter passing and context switches, adding overhead to every kernel invocation. Communication between the pieces relies heavily on message passing and remote procedure call. This adds considerable overhead despite the research that has gone into making them fast [12]. While Mach has addressed the issues of monolithic design and maintainability, it exacerbates the overhead and latency of system services. Plan 9 has chosen to focus on a particular cut of the system: large networks of machines. While it addresses the chosen problem well and extends the productive virtues of <span class=smallcaps>Unix</span>, its arrangement may not be as suitable for other machine topologies or features, for example, the isolated workstation in a private residence, or those with richer forms of input and output, such as sound and video, which I believe will be common in the near future.
<p>In a sense, kernelized systems can hide ugliness by partitioning it away. The kernel alone is not useful without a great deal of affiliated user-level service. Many papers publish numbers touting small kernel sizes but these hide the large amount of code that has been moved to user-level services. Some people argue that the size of user-level services does not count as much, because they are pageable and are not constrained to occupy real memory. But I argue: is it really a good idea to page out operating system services? This can only result in increased latency and unpredictable response time.
<p>In general, I agree that the diffusion of the kernel structure is a good idea but find it unfortunate that current-generation kernelized systems tend to be slow, even in spite of ongoing efforts to make them faster. Perhaps people commonly accept that some loss of performance is the inevitable result of partitioning, and are willing to suffer that loss in return for greatly increased maintainability and extensibility.
<p>My dissertation shows that this need not be the case: Synthesis addresses the issues of structuring and performance. Its quaject-based kernel structure keeps the modularity, protection, and extensibility demanded of modern-day operating systems. At the same time Synthesis delivers performance an order of magnitude better than existing systems, as evidenced by the experiments in Chapter 7. Its kernel services are subdivided into even finer chunks than kernelized systems like Mach. Any service can be composed of pieces that run at either user- or kernel-level: the distinction is blurred.
<p>Synthesis breaks the batch-mode thinking that has led to systems that wait for all the data to arrive before any subsequent processing is allowed to take place, when in fact subsequent processing could proceed in parallel with the continuing arrival of data. Witness a typical system's handling of network packets: the whole packet is received, buffered, and checksummed before being handed over for further processing, when instead the address fields could be examined and lookups performed in parallel with the reception of the rest of the packet, reducing packet handling latency. Some network gateways do this type of cut-through routing for packet forwarding. But in a general-purpose operating system, the high overhead of system calls and context switches in existing systems discourage this type of thinking in preference to batching. By reconsidering the design, Synthesis compounds the savings. Low-overhead system calls and context switches encourage frequent use to better streamline processing and take advantage of the inherent parallelism achieved by a pipeline, reducing overhead and latency even further.
<h3>2.3.2 Services and Interfaces</h3>
<p>A good operating system provides numerous useful services to make applications easy to write and easy to interconnect. To this end, it establishes conventions for packaging applications so that formats and interfaces are reasonably well standardized. The conventions encompass two forms: the model, which refers to the set of abstractions that guide the overall thinking and design; and the interface, which refers to the set of operations supported and how they are invoked. Ideally, we want a simple model, a powerful interface, and high performance. But these three are often at odds.
<p>Witness the MVS I/O system, which has a complex model but offers a powerful interface and high performance. Its numerous options offer the benefit of detailed, precise control over each device, but with the drawback that even simple I/O requires complex programming.
<p><span class=smallcaps>Unix</span> is at the other end of the scale. <span class=smallcaps>Unix</span> promoted the idea of encapsulating I/O in terms of a single, simple abstraction. All common I/O is accomplished by reading or writing a stream of bytes to a file-like object, regardless of whether the I/O is meant to be viewed on the the user's terminal, stored as a file on disk, or used as input to another program. Treating I/O in a common manner offers great convenience and utility. It becomes trivial to write and test a new program, viewing its output on the screen. Once the program is working, the output can be sent to the intended file on disk without changing a line of code or recompiling.
<p>But an oversimplified model of I/O brings with it a loss of precise control. This loss is not important for the great many <span class=smallcaps>Unix</span> tools -- it is more than compensated by the synergies of a diverse set of connectable programs. But other, more complex applications such as a database management system (DBMS) require more detailed control over I/O [30]. Minimally, for a DBMS to provide reasonable crash recovery, it must know when a write operation has successfully finished placing the data on disk; in <span class=smallcaps>Unix</span>, a write only copies the data to a kernel buffer, movement of data from there to disk occurs later, asynchronously, so in the event of an untimely crash, data waiting in the buffers will be lost. Furthermore, a well-written DBMS has a good idea as to which areas of a file are likely to be needed in the future and its performance improves if this knowledge can be communicated to the operating system; by contrast, <span class=smallcaps>Unix</span> hides the details of kernel buffering, impeding such optimizations in exchange for a simpler interface.
<p>Later versions of <span class=smallcaps>Unix</span> extended the model, making up some of the loss, but these extensions were not "clean" in the sense of the original <span class=smallcaps>Unix</span> design. They were added piecemeal as the need arose. For example, ioctl (for I/O controls) and the select system call help support out-of-band stream controls and non-blocking (polled) I/O, but these solutions are neither general nor uniform. Furthermore, the granularity with which <span class=smallcaps>Unix</span> considers an operation "`non-blocking" is measured in tens of milliseconds. While this was acceptable for the person-typing-on-a-terminal mode of user interaction of the early 1980's, it is clearly inappropriate for handling higher rate interactive data, such as sound and video.
<p>Interactive games and real-time processing are two examples of areas where the classic models are insufficient. <span class=smallcaps>Unix</span> and its variants have no asynchronous read, for example, that would allow a program to monitor the keyboard while also updating the player's screen. A conceptually simple application to record a user's typing along with its timing and later play it back with the correct timing takes several pages of code to accomplish under <span class=smallcaps>Unix</span>, and then it cannot be done well enough if, say, instead of a keyboard we have a musical instrument.
<p>The newer systems, such as Mach, provide extensions and new capabilities but within the framework of the same basic model, hence the problems persist. The result is that the finer aspects of stream control, of real-time processing, or of the handling of time-sensitive data in general have not been satisfactorily addressed in existing systems.
<h3>2.3.3 Managing Diverse Types of I/O</h3>
<p>The multiplexing of I/O and handling of the machine's I/O devices is one of the three most important functions of an operating system. (Managing the processor and memory are the other two.) It is perhaps the most difficult function to perform well, because there can be many different types of I/O devices, each with its own special features and requirements.
<p>Existing systems handle diverse types of I/O devices by defining a few common internal formats for I/O and mapping each device to the closest one. General-purpose routines in the kernel then operate on each format. <span class=smallcaps>Unix</span>, for example, has two major internal formats, which they call "I/O models": the block model for disk-like devices and the character model for terminal-like devices [26].
<p>But common formats force compromise. There is a performance penalty paid when mismatches between the native device format and the internal format make translations necessary. These translations can be expensive if the "distance" between the internal format and a particular device is large. In addition, some functionality might be lost, because common formats, however general, cannot capture everything. There could be some features in a device that do not map well into the chosen format and those features become difficult if not impossible to access. Since operating systems tend to be structured around older formats, chosen at time when the prevalent I/O devices were terminals and disks, it is not surprising that they have difficulty handling the new rich media devices, such as music and video.
<p>Synthesis breaks this tradeoff. The quaject structuring of the kernel allows new I/O formats to be created to suit the performance characteristics of unusual devices. Indeed, it is not inconceivable that every device has its own format, specially tailored to precisely fit its characteristics. Differences between a device format and what the application expects are spanned using translation, as in existing systems. But unlike existing systems, where translation is used to map into a common format, Synthesis maps directly from the device format to the needs of the application, eliminating the intermediate, internal format and its associated buffering and translation costs. This lets knowledgable applications use the highly efficient device-level interfaces when very high performance and detailed control are of utmost importance, but also preserves the ability of any application to work with any device, as in the <span class=smallcaps>Unix</span> common-I/O approach. Since the code is runtime-generated for each specific translation, performance is good. The efficient emulation of <span class=smallcaps>Unix</span> under Synthesis bears witness to this.
<h3>2.3.4 Managing Processes</h3>
<p>Managing the machine's processors is the second important function of an operating system. It involves two parts: multiplexing the processors among the tasks, and controlling task execution and querying its state. But in contrast to the many control and query functions offered for I/O, existing operating systems provide only limited control over task execution. For example, the <span class=smallcaps>Unix</span> system call for this purpose, ptrace, works only between a parent task and its children. It is archaic and terribly inefficient, meant solely for use by debuggers and apparently implemented as an afterthought. Mach threads, while supporting some rudimentary calls, sometimes lacks desirable generality: a Mach thread cannot suspend itself, for example, and the thread controls do not work between threads in different tasks.
<p>In this sense, Mach threads only add parallelism to an existing abstraction - the <span class=smallcaps>Unix</span> process - Mach does not develop the thread idea to its fullest potential. Both these systems lack general functions to start, stop, query, and modify an arbitrary task's execution without arrangements having been made beforehand, for example, by starting the task from within a debugger.
<p>In contrast, Synthesis provides detailed thread control, comparable to the level of control found for other operating system services, such as I/O. Section 4.3.2 lists the operations supported, which work between any pair of threads, even between unrelated threads in different address spaces and even on the kernel's threads, if there is sufficient privilege. Because of their exceptionally low overhead - only ten to twenty times the cost of a null procedure call - they provide unprecedented data collection and measurement abilities and unparalleled support for debuggers.
</div>
</body>
</html>

View File

@ -1,543 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div id="nav">
<a class=home href="../index.html">Alexia's Home</a>
<a href="index.html">Dissertation</a>
<a href="abs.html">Abstract</a>
<a href="ack.html">Acknowledgements</a>
<a href="toc.html">Contents</a>
<a href="ch1.html">Chapter 1</a>
<a href="ch2.html">Chapter 2</a>
<a class=here href="ch3.html">Chapter 3</a>
<a href="ch4.html">Chapter 4</a>
<a href="ch5.html">Chapter 5</a>
<a href="ch6.html">Chapter 6</a>
<a href="ch7.html">Chapter 7</a>
<a href="ch8.html">Chapter 8</a>
<a href="bib.html">Bibliography</a>
<a href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
<h1>3. Kernel Code Generator</h1>
<div id="chapter-quote">
For, behold, I create new heavens and a new earth.<br>
-- The Bible, Isaiah
</div>
<h2>3.1 Fundamentals</h2>
<p>Kernel code synthesis is the name given to the idea of creating executable machine code at runtime as a means of improving operating system performance. This idea distinguishes Synthesis from all other operating systems research efforts, and is what helps make Synthesis efficient.
<p>Runtime code generation is the process of creating executable machine code during program execution for use later during the same execution [16]. This is in contrast to the usual way, where all the code that a program runs has been created at compile time, before program execution starts. In the case of an operating system kernel like Synthesis, the "program" is the operating system kernel, and the term "program execution" refers to the kernel's execution, which lasts from the time the system is started to the time it is shut down.
<p>There are performance benefits in doing runtime code generation because there is more information available at runtime. Special code can be created based on the particular data to be processed, rather than relying on general-purpose code that is slower. Runtime code generation can extend the benefits of detailed compile-time analysis by allowing certain data-dependent optimizations to be postponed to runtime, where they can be done more effectively because there is more information about the data. We want to make the best possible use of the information available at compile-time, and use runtime code generation to optimize data-dependent execution.
<p>The goal of runtime code generation can be stated simply:
<blockquote>Never evaluate something more than once.</blockquote>
<p>For example, suppose that the expression, <em>A * A + A * B + B * B</em> is to be evaluated for many different A while holding B = 1. It is more efficient to evaluate the reduced expression obtained by replacing B with 1: <em>A * A + A + 1</em>. Finding opportunities for such optimizations and performing them is the focus of this chapter.
<p>The problem is one of knowing how soon we can know what value a variable has, and how that information can be used to improve the program's code. In the previous example, if it can be deduced at compile time that B = 1, then a good compiler can perform precisely the reduction shown. But usually we can not know ahead of time what value a variable will have. B might be the result of a long calculation whose value is hard if not impossible to predict until the program is actually run. But when it is run, and we know B, runtime code generation allows us to use the newly-acquired information to reduce the expression.
<p>Specifically, we create specialized code once the value of B becomes known, using an idea called partial evaluation [15]. Partial evaluation is the building of simpler, easierto-evaluate expressions from complex ones by substituting variables that have a known, constant value with that constant. When two or more of these constants are combined in an arithmetic or logical operation, or when one of the constants is an identity for the operation, the operation can be eliminated. In the previous example, we no longer have to compute B * B, since we know it is 1, and we do not need to compute A * B, since we know it is A.
<p>There are strong parallels between runtime code generation and compiler code generation, and many of the ideas and terminology carry over from one to the other. Indeed, anything that a compiler does to create executable code can also be performed at runtime. But because compilation is an off-line process, there is usually less concern about the cost of code generation and therefore one has a wider palette of techniques to choose from. A compiler can afford to use powerful, time-consuming analysis methods and perform sophisticated optimizations - a luxury not always available at runtime.
<p>Three optimizations are of special interest to us, not only because they are easy to do, but because they are also effective in improving code quality. They are: <em>constant folding</em>, <em>constant propagation</em>, and <em>procedure inlining</em>. Constant folding replaces constant expressions like 5 * 4 with the equivalent value, 20. Constant propagation replaces variables that have known, constant value with that constant. For example, the fragment <em>x = 5; y = 4 * x;</em> becomes <em>x = 5; y = 4 * 5;</em> through constant propagation; <em>4 * 5</em> then becomes <em>20</em> through constant folding. Procedure inlining substitutes the body of a procedure, with its local variables appropriately renamed to avoid conflicts, in place of its call.
<p>There are three costs associated with runtime code generation: creation cost, paid each time a piece of code is created; execution cost, paid each time the code is used; and management costs, to keep track of where the code is and how it is being used. The hope is to use the information available at runtime to create better code than would otherwise be possible. In order to win, the savings of using the runtime-created code must exceed the cost of creating and managing that code. This means that for many applications, a fast code generator that creates good code will be superior to a slow code generator that creates excellent code. (The management problem is analogous to keeping track of ordinary, heap-allocated data structures, and the costs are similar, so they will not be considered further.)
<p>Synthesis focuses on techniques for implementing very fast runtime code generation. The goal is to broaden its applicability and extend its benefits, making it cheap enough so that even expressions and procedures that are not re-used often still benefit from having their code custom-created at runtime. To this end, the places where runtime code generation is used are limited to those where it is clear at compile time what the possible reductions will be. The following paragraphs describe the idea, while the next section describes the specific techniques.
<p>A fast runtime code generator can be built by making full use of the information available at compile time. In our example, we know at compile time that B will be held constant, but we do not know what the constant will be. But we can predict at compile-time what form the reduced expression will have: <em>A * A + C1 * A + C2</em>. Using this knowledge, we can build a simple code generator for the expression that copies a code template representing <em>A * A + C1 * A + C2</em> into newly allocated memory and computes and fills the constants: <em>C1 = B</em> and <em>C2 = B * B</em>. A code template is a fragment of code which has been compiled but contains "holes" for key values.
<p>Optimizations to the runtime-created code can also be pre-computed. In this example, interesting optimizations occur when B is 0, 1, or a power of two. Separate templates for each of these cases allow the most efficient code possible to be generated. The point is that there is plenty of information available at compile time to allow not just simple substitution of variables by constants, but also interesting and useful optimizations to happen at runtime with minimal analysis.
<p>The general idea is: treat runtime code generation as if it were just another "function" to be optimized, and apply the idea of partial evaluation recursively. That is, just as in the previous example we partially-evaluate the expression <em>A * A + A * B + B * B</em> with respect to the variable held constant, we can partially-evaluate the optimizations with respect to the parameters that the functions will be specialized under, with the result being specialized code-generator functions.
<p>Looking at a more complex example, suppose that the compiler knows, either through static control-flow analysis, or simply by the programmer telling it through some directives, that the function <em>f(p1, ...) = 4 * p1 + ...</em> will be specialized at runtime for constant p1. The compiler can deduce that the expression <em>4 * p1</em> will reduce to a constant, but it does not know what particular value that constant will have. It can capture this knowledge in a custom code generator for f that computes the value <em>4 * p1</em> when p1 becomes known and stores it in the correct spot in the machine code of the specialized function f, bypassing the need for analysis at runtime. In another example, consider the function g, <em>g(p1, ...) = if(p1 != 10) S1; else S2;</em>, also to be specialized for constant parameter p1. Since parameter p1 will be constant, we know at compile time that the if-statement will be either always true, or always false. We just don't know which. But again, we can create a specialized generator for g, one that evaluates the conditional when it becomes known and emits either S1 or S2 depending on the result.
<p>The idea applies recursively. For example, once we have a code generator for a particular kind of expression or statement, that same generator can be used each time that kind of expression occurs, even if it is in a different part of the program. Doing this limits the proliferation of code generators and keeps the program size small. The resulting runtime code generator has a hierarchical structure, with generators for the large functions calling sub-generators to create the individual statements, which in turn call yet lower-level generators, and so on, until at the bottom we have very simple generators that, for example, move a constant into a machine register in the most efficient way possible.
<h2>3.2 Methods of Runtime Code Generation</h2>
The three methods Synthesis uses to create machine code are: <em>factoring invariants</em>, <em>collapsing layers</em>, and <em>executable data structures</em>.
<h3>3.2.1 Factoring Invariants</h3>
<p>The factoring invariants method is equivalent to partial evaluation where it is known at compile time the variables over which a function will be partially evaluated. It is based on the observation that a functional restriction is usually easier to calculate than the original function. Consider a general function:
<blockquote><em>
F<sub>big</sub>(p1, p2, ... , pn)
</em></blockquote>
If we know that parameter p1 will be held constant over a set of invocations, we can factor it out to obtain an equivalent composite function:
<blockquote><em>
[ F<sup>create</sup>(p1) ] (p2, ... , pn) &#8801; F<sub>big</sub>(p1, p2, ... , pn)
</em></blockquote>
F<sup>create</sup> is a second-order function. Given the parameter p1, F<sup>create</sup> returns another function, F<sub>small</sub>, which is the restriction of F<sub>big</sub> that has absorbed the constant argument p1:
<blockquote><em>F<sub>small</sub>(p2, ... , pn) &#8834; F<sub>big</sub>(p1, p2, ... , pn)</em></blockquote>
If F<sup>create</sup> is independent of global data, then for a given p1, F<sup>create</sup> will always compute the same F<sub>small</sub> regardless of global state. This allows F<sup>create</sup>(p1) to be evaluated once and the resulting F<sub>small</sub> used thereafter. If F<sub>small</sub> is executed m times, generating and using it pays off when
<blockquote><em>
Cost(F<sup>create</sup>) + m * Cost(F<sub>small</sub>) < m * Cost(F<sub>big</sub>)
</em></blockquote>
As the "factoring invariants" name suggests, this method resembles the constant propagation and constant folding optimizations done by compilers. The analogy is strong, but the difference is also significant. Constant folding eliminates static code and calculations. In addition, Factoring Invariants can also simplify dynamic data structure traversals that depend on the constant parameter p1.
<p>For example, we can apply this idea to improve the performance of the read system function. When reading a particular file, constant parameters include the device that the file resides on, the address of the kernel buffers, and the process performing the read. We can use file open as F<sup>create</sup>; the F<sub>small</sub> it generates becomes our read function. F<sup>create</sup> consists of many small procedure templates, each of which knows how to generate code for a basic operation such as "read disk block", "process TTY input", or "enqueue data." The parameters passed to F<sup>create</sup> determine which of these code-generating procedures are called and in what order. The final F<sub>small</sub> is created by filling these templates with addresses of the process table, device registers, and the like.
<h3>3.2.2 Collapsing Layers</h3>
<p>The collapsing layers method is equivalent to procedure inlining where it is known at compile time which procedures might be inlined. It is based on the observation that in a layered design, separation between layers is a part of specification, not implementation. In other words, procedure calls and context switches between functional layers can be bypassed at execution time. Let us consider an example from the layered OSI model:
<blockquote><em>
F<sub>big</sub>(p1, p2, ... , pn) &#8801; F<sub>applica</sub>(p1, F<sub>present</sub>(p2, F<sub>session</sub>( ... F<sub>datalnk</sub>(pn) ... )))
</em></blockquote>
F<sub>applica</sub> is a function at the Application layer that calls successive lower layers to send a message. Through in-line code substitution of F<sub>present</sub> in F<sub>applica</sub>, we can obtain an equivalent flat function by eliminating the procedure call from the Application to the Presentation layer:
<blockquote><em>
F<sub>flatapplica</sub>(p1, p2, F<sub>session</sub>( ... )) &#8801; F<sub>applica</sub>(p1, F<sub>present</sub>(p2, F<sub>session</sub>( ... )))
</em></blockquote>
The process to eliminate the procedure call can be embedded into two second-order functions. F<sup>create</sup>present returns code equivalent to F<sub>present</sub> and suitable for in-line insertion. F<sup>create</sup>applica incorporates that code to generate F flatapplica.
<blockquote><em>
F<sup>create</sup><sub>applica</sub>(p1, F<sup>create</sup><sub>present</sub>(p2, ... ), F<sub>flatapplica</sub>(p1, p2, ... ))
</em></blockquote>
This technique is analogous to in-line code substitution for procedure calls in compiler code generation. In addition to the elimination of procedure calls, the resulting code typically exhibit opportunities for further optimization, such as Factoring Invariants and elimination of data copying.
<p>By induction, F<sup>create</sup><sub>present</sub> can eliminate the procedure call to the Session layer, and down through all layers. When we execute F<sup>create</sup><sub>flatapplica</sub> to establish a virtual circuit, the F<sub>flatapplica</sub> code used thereafter to send and receive messages may consist of only sequential code. The performance gain analysis is similar to the one for factoring invariants.
<h3>3.2.3 Executable Data Structures</h3>
<p>The executable data structures method reduces the traversal time of data structures that are frequently traversed in a preferred way. It works by storing node-specific traversal code along with the data in each node, making the data structure self-traversing.
<p>Consider an active job queue managed by a simple round-robin scheduler. Each element in the queue contains two short sequences of code: <em>stopjob</em> and <em>startjob</em>. The <em>stopjob</em> saves the registers and branches into the next job's <em>startjob</em> routine (in the next element in queue). The <em>startjob</em> restores the new job's registers, installs the address of its own <em>stopjob</em> in the timer interrupt vector table, and resumes processing.
<p>An interrupt causing a context switch will execute the current program's <em>stopjob</em>, which saves the current state and branches directly into the next job's <em>startjob</em>. Note that the scheduler has been taken out of the loop. It is the queue itself that does the context switch, with a critical path on the order of ten machine instructions. The scheduler intervenes only to insert and delete elements from the queue.
<h3>3.2.4 Performance Gains</h3>
<p>Runtime code generation and partial evaluation can be thought of as a way of caching frequently visited states. It is interesting to contrast this type of caching with the caching that existing systems do using ordinary data structures. Generally, systems use data structures to capture state and remember expensive-to-compute values. For example, when a file is opened, a data structure is built to describe the file, including its location on disk and a pointer to the procedure to be used to read it. The read procedure interprets state stored in the data structure to determine what work is to be done and how to do it.
<p>In contrast, code synthesis encodes state directly into generated procedures. The resulting performance gains extend beyond just saving the cost of interpreting a data structure. To see this, let us examine the performance gains obtained from hard-wiring a constant directly into the code compared to fetching it from a data structure. Hardwiring embeds the constant in the instruction stream, so there is an immediate savings that comes from eliminating one or two levels of indirection and obviating the need to pass the structure pointer. These can be attributed to "saving the cost of interpretation." But hardwiring also opens up the possibility of further optimizations, such as constant folding, while fetching from a data structure admits no such optimizations. Constant folding becomes possible because once it is known that a parameter will be, say, 2, all pure functions of that parameter will likewise be constant and can be evaluated once and the constant result used thereafter. A similar flavor of optimization arises with IF-statements. In the code fragment "if(C) S1; else S2;", where the conditional, C, depends only on constant parameters, the generated code will contain either S1 or S2, never both, and no test. It is with this cascade of optimization possibilities that code synthesis obtains its most significant performance gains. The following section illustrates some of the places in the kernel where runtime code generation is used to advantage.
<h2>3.3 Uses of Code Synthesis in the Kernel</h2>
<h3>3.3.1 Buffers and Queues</h3>
<p>Buffers and queues can be implemented more efficiently with runtime code generation than without.
<div class=code>
<pre>
char buf[100], *bufp = &amp;buf[0], *endp = &amp;buf[100];
Put(c)
{
*bufp++ = c;
if(bufp == endp)
flush();
}
Put: // (character is passed register d0)
move.l (bufp),a0 // (1) Load buffer pointer into register a0
move.b d0,(a0)+ // (2) Store the character and increment the a0 register
move.l a0,(bufp) // (3) Update the buffer pointer
cmp.l (endp),a0 // (4) Test for end-of-buffer
beq flush // ... if end, jump to flush routine
rts // ... otherwise return
</pre>
<p class=caption>Figure 3.1: Hand-crafted assembler implementation of a buffer</p>
</div>
<p>Figure 3.1 shows a good, hand-written 68030 assembler implementation of a buffer.
<p>The C language code illustrates the intended function, while the 68030 assembler code shows the work involved. The work consists of: (1) loading the buffer pointer into a machine register; (2) storing the character in memory while incrementing the pointer register; (3) updating the buffer pointer in memory; and (4) testing for the end-of-buffer condition. This fragment executes in 28 machine cycles not counting the procedure call overhead.
<div class=code>
<pre>
Put: // (character is passed register d0)
move.l (P),a0 // Load buffer pointer into register a0
move.b d0,(a0,D) // Store the character
addq.w #1,(P+2) // Update the buffer pointer and test if reached end
beq flush // ... if end, jump to flush routine
rts // ... otherwise return
</pre>
<p class=caption>Figure 3.2: Better buffer implementation using code synthesis</p>
</div>
<table class=table>
<caption>
Table 3.1: CPU Cycles for Buffer-Put<br>
<small>68030 CPU, 25MHz, 1-wait-state main memory</small>
</caption>
<tr class=head><th><th>Cold cache<th>Warm cache
<tr><th>Code-synthesis (CPU cycles)<td class=number>29<td class=number>20
<tr><th>Hand-crafted assembly (CPU cycles)<td class=number>37<td class=number>28
<tr><th>Speedup<td class=number>1.4<td class=number>1.4
</table>
<p>Figure 3.2 shows the code-synthesis implementation of a buffer, which is 40% faster. Table 3.1 gives the actual measurements. The improvement comes from the elimination of the cmp instruction, for a savings of 8 cycles. The code relies on the implicit test for zero that occurs at the end of every arithmetic operation. Specifically, we arrange that the lower 16 bits of the pointer variable be zero when the end of buffer is reached, so that incrementing the pointer also implicitly tests for end-of-buffer.
<p>This is done for a general pointer as follows. The original bufp pointer is represented as the sum of two quantities: a pointer-like variable, <em>P</em>, and a constant displacement, <em>D</em>. Their sum, <em>P + D</em>, gives the current position in the buffer, and takes the place of the original bufp pointer. The character is stored in the buffer using the "<em>move.b d0,(a0,D)</em>" instruction which is just as fast as a simple register-indirect store. The displacement, <em>D</em>, is chosen so that when <em>P + D</em> points to the end of the buffer, P is 0 modulo 2<sup>16</sup>, that is, the least significant 16 bits of <em>P</em> are zero. The "<em>addq.w #1,(P+2)</em>" instruction then increments the lower 16 bits of the buffer pointer and also implicitly tests for end-of-buffer, which is indicated by a 0 result. For buffer sizes greater than 2<sup>16</sup> bytes, the flush routine can propagate the carry-out to the upper bits, flushing the buffer when the true end is reached.
<table class=table>
<caption>
Table 3.2: Comparison of C-Language "stdio" Libraries
</caption>
<tr class=head><th>10<sup>7</sup> Executions of:<th>Execution time, seconds<th>Size: Bytes/Invocation
<tr><th><span class=smallcaps>Unix</span> "putchar" macro<td>21.4 user; 0.1 system<td class=number>132
<tr><th>Synthesis "putchar" macro<td>13.0 user; 0.1 system<td class=number>30
<tr><th>Synthesis "putchar" function<td>19.0 user; 0.1 system<td class=number>8
</table>
<p>This performance gain can only be had using runtime code generation, because <em>D</em> must be a constant, embedded in the buffer's machine code, to take advantage of the fast memory-reference instruction. Were <em>D</em> a variable, the loss of fetching its value and indexing would offset the gain from eliminating the compare instruction. The 40% savings is significant because buffers and queues are used often. Another advantage is improved locality of reference: code synthesis puts both code and data in the same page of memory, increasing the likelihood of cache hits in the memory management unit's address translation cache.
<p>Outside the kernel, the Synthesis implementation of the C-language I/O library, "stdio," uses code-synthesized buffers at the user level. In a simple experiment, I replaced the <span class=smallcaps>Unix</span> stdio library with the Synthesis version. I compiled and ran a simple test program that invokes the putchar macro ten million times, using first the native <span class=smallcaps>Unix</span> stdio library supplied with the Sony NEWS workstation, and then the Synthesis version. Table 3.2 shows the Synthesis macro version is 1.6 times faster, and over 4 times smaller, than the <span class=smallcaps>Unix</span> version.
<p>The drastic reduction in code size comes about because code synthesis can take advantage of the extra knowledge available at runtime to eliminate execution paths that cannot be taken. The putchar operation, as defined in the C library, actually supports three kinds of buffering: block-buffered, line-buffered and unbuffered. Even though only one of these can be in effect at any one time, the C putchar macro must include code to handle all of them, since it cannot know ahead of time which one will be used. In contrast, code synthesis creates only the code handling the kind of buffering actually desired for the particular file being written to. Since putchar, being a macro, is expanded in-line every time it appears in the source code, the savings accumulate rapidly.
<p>Table 3.2 also shows that the Synthesis "putchar" function is slightly faster than the <span class=smallcaps>Unix</span> macro - a dramatic result, that even incurring a procedure call overhead, code synthesis still shows a speed advantage over conventional code in-lined with a macro.
<h3>3.3.2 Context Switches</h3>
<p>One reason that context switches are expensive in traditional systems like <span class=smallcaps>Unix</span> is that they always save and restore the entire CPU context, even though that may not be necessary. For example, a process that did not use floating point since it was switched in does not need to have its floating-point registers saved when it is switched out. Another reason is that saving context is often implemented as a two-step procedure: the CPU registers are first placed in a holding area, freeing them so they can be used to perform calculations and traverse data structures to find out where the context was to have been put, and finally copying it there from the holding area.
<p>A Synthesis context switch takes less time because only the part of the context being used is preserved, not all of it, and because the critical path traversing the ready queue is minimized with an executable data structure.
<p>The first step is to know how much context to preserve. Context switches can happen synchronously or asynchronously with thread execution. Asynchronous context switches are the result of external events forcing preemption of the processor, for example, at the end of a CPU quantum. Since they can happen at any time, it is hard to know in advance how much context is being used, so we preserve all of it. Synchronous context switches, on the other hand, happen as a result of the thread requesting them, for example, when relinquishing the CPU to wait for an I/O operation to finish. Since they occur in specific, well-defined points in the thread's execution, we can know exactly how much context will be needed and therefore can arrange to preserve only that much. For example, suppose a read procedure needs to block and wait for I/O to finish. Since it has already saved some registers on the stack as part of the normal procedure-call mechanism, there is no need to preserve them again as they will only be overwritten upon return.
<div class=code>
<pre>
proc:
:
:
{Save necessary context}
bsr swtch
res:
{Restore necessary context}
:
:
swtch:
move.l (Current),a0 // (1) Get address of current thread's TTE
move.l sp,(a0) // (2) Save its stack pointer
bsr find_next_thread // (3) Find another thread to run
move.l a0,(Current) // (4) Make that one current
move.l (a0),sp // (5) Load its stack pointer
rts // (6) Go run it!
</pre>
<p class=caption>Figure 3.3: Context Switch</p>
</div>
<p>Figure 3.3 illustrates the general idea. When a kernel thread executes code that decides that it should block, it saves whatever context it wishes to preserve on the active stack. It then calls the scheduler, swtch; doing so places the thread's program counter on the stack. At this point, the top of stack contains the address where the thread is to resume execution when it unblocks, with the machine registers and the rest of the context below that. In other words, the thread's context has been reduced to a single register: its stack pointer. The scheduler stores the stack pointer into the thread's control block, known as the thread table entry (TTE), which holds the thread state when it is not executing. It then selects another thread to run, shown as a call to the find next thread procedure in the figure, but actually implemented as an executable data structure as discussed later. The variable Current is updated to reflect the new thread and its stack pointer is loaded into the CPU. A return-from-subroutine (rts) instruction starts the thread running. It continues where it had left off (at label res), where it pops the previously-saved state off the stack and proceeds with its work.
<p>Figure 3.4 shows two TTEs. Each TTE contains code fragments that help with context switching: <em>sw_in</em> and <em>sw_in_mmu</em>, which loads the processor state from the TTE; and <em>sw_out</em>, which stores processor state back into the TTE. These code fragments are created specially for each thread. To switch in a thread for execution, the processor executes the thread's <em>sw_in</em> or <em>sw_in_mmu</em> procedure. To switch out a thread, the processor executes the thread's <em>sw_out</em> procedure.
<!-- FIGURE (IMG) GOES HERE - - FINISH -->
<img src="finish.png">
<p class=caption>Figure 3.4: Thread Context</p>
<p>Notice how the ready-to-run threads (waiting for CPU) are chained in an executable circular queue. A <em>jmp</em> instruction at the end of the <em>sw_out</em> procedure of the preceding thread points to the <em>sw_in</em> procedure of the following thread. Assume thread-0 is currently running. When its time quantum expires, the timer interrupt is vectored to thread-0's <em>sw_out</em>. This procedure saves the CPU registers into thread-0's register save area (TT0.reg). The jmp instruction then directs control flow to one of two entry points of the next thread's (thread-1) context-switch-in procedure, <em>sw_in</em> or <em>sw_in_mmu</em>. Control flows to <em>sw_in_mmu</em> when a change of address space is required; otherwise control flows to <em>sw_in</em>. The switch-in procedure then loads the CPU's vector base register with the address of thread-1's vector table, restores the processor's general registers, and resumes execution of thread-1. The entire switch takes 10.5 microseconds to switch integer-only contexts between threads in the same address space, or 56 microseconds including the floating point context and a change in address space.<sup>1</sup>
<div class=footnote><sup>1</sup> Previous papers incorrectly cite a floating-point context switch time of 15 &#181;s [25] [18]. This error is believed to have been caused by a bug in the Synthesis assembler, which incorrectly filled the operand field of the floating-point move-multiple-registers instruction causing it to preserve just one register, instead of all eight. Since very few Synthesis applications use floating point, this bug remained undetected for a long time.</div>
<p>Table 3.3 summarizes the time taken by the various types of context switches in Synthesis, saving and restoring all the integer registers. These times include the hardware interrupt service overhead -- they show the elapsed time from the execution of the last instruction in the suspended thread to the first instruction in the next thread. Previously published papers report somewhat lower figures [25] [18]. This is because they did not include the interrupt-service overhead, and because of some extra overhead incurred in handling the 68882 floating point unit on the Sony NEWS workstation that does not occur on the Quamachine, as discussed later. For comparison, a call to a null procedure in the C language takes 1.4 microseconds, and the Sony <span class=smallcaps>Unix</span> context switch takes 170 microseconds.
<table class=table>
<caption>
Table 3.3: Cost of Thread Scheduling and Context Switch<br>
<small>68030 CPU, 25MHz, 1-wait-state main memory, cold cache</small>
</caption>
<tr class=head><th>Type of context switch<th>Time (&#181;s)
<tr><th>Integer registers only<td class=number>10.5
<tr><th>Floating-point<td class=number>52
<tr><th>Integer, change address space<td class=number>16
<tr><th>Floating-point, change address space<td class=number>56
<tr><th>Null procedure call (C language)<td class=number>1.4
<tr><th>Sony NEWS, <span class=smallcaps>Unix</span><td class=number>170
<tr><th>NeXT Machine, Mach<td class=number>510
</table>
<p>In addition to reducing ready-queue traversal time, specialized context-switch code enables further optimizations, to move only needed data. The previous paragraph already touched on one of the optimizations: bypassing the MMU address space switch when it is not needed. The other optimizations occur in the handling of floating point registers, described now, and in the handling of interrupts, described in the next section.
<p>Switching the floating point context is expensive because of the large amount of state that must be saved. The registers are 96 bits wide; moving all eight registers requires 24 transfers of 32 bits each. The 68882 coprocessor compounds this cost, because each word transferred requires two bus cycles: one to fetch it from the coprocessor, and one to write it to memory. The result is that it takes about 50 microseconds just to save and restore the hundred-plus bytes of information comprising the floating point coprocessor state. This is more than five times the cost of doing an entire context switch without the floating point.
<p>Since preserving floating point context is so expensive, we use runtime tests to see if floating point had been used to avoid saving state that is not needed. Threads start out assuming floating point will not be used, and their context-switch code is created without it. When context-switching out, the context-save code checks whether the floating point unit had been used. It does this using the fsave instruction of the Motorola 68882 floating point coprocessor, which saves only the internal microcode state of the floating point processor [20]. The saved state can be tested to see if it is not null. If so, the user-visible floating-point state is saved, and the context-switch code re-created to include the floating-point context in subsequent context switches. Since the majority of threads in Synthesis do not use floating point, the savings are significant.
<p>Unfortunately, after a thread executes its first floating point instruction, floating point context will have to be preserved from that point on, even if no further floating-point instructions are issued. The context must be restored upon switch-in because a floating point instruction might be executed. The context must be saved upon switch-out even if no floating point instructions had been executed since switch-in because the 68882 cannot detect a lack of instruction execution. It can only tell us if its state is completely null. This is bad because sometimes a thread may use floating-point at first, for example, to initialize a table, and then not again. But with the 68882, we can only optimize the case when floating point is never used.
<p>The Quamachine has hardware to alleviate the problem. Its floating-point unit - also a 68882 - can be enabled and disabled by software command, allowing a lazyevaluation of floating-point context switches. Switching in a thread for execution loads its integer state and disables the floating-point unit. When a thread executes its first floating point instruction since the switch, it takes an illegal instruction trap. The kernel then loads the necessary state, first saving any prior state that may have been left there, reenables the floating-point unit, and the thread resumes with the interrupted instruction. The trap is taken only on the first floating-point instruction following a switch, and adds only 3 &#181;s to the overhead of restoring the state. This is more than compensated for by the other savings: integer context-switch becomes 1.5 &#181;s faster because there is no need for an fsave instruction to test for possible floating-point use; and even floating-point threads benefit when they block without a floating point instruction being issued since they were switched in, saving the cost of restoring and then saving that context. Indeed, if only a single thread is using floating point, the floating point context is never switched, remaining in the coprocessor.
<h3>3.3.3 Interrupt Handling</h3>
<p>A special case of context switching occurs in interrupt handling. Many systems, such as <span class=smallcaps>Unix</span>, perform a full context switch on each interrupt. For example, an examination of the running Sony <span class=smallcaps>Unix</span> kernel reveals that not only are all integer registers saved on each interrupt, but the active portion of the floating-point context as well. This is one of the reasons that interrupt handling is expensive on a traditional system, and the reason why the designers of those systems try hard to avoid frequent interrupts. As shown earlier, preserving the floating-point state can be very expensive. Doing so is superfluous unless the interrupt handler uses floating point; most do not.
<p>Synthesis interrupt handling is faster because it saves and restores only the part of the context that will be used by the service routine, not all of it. Code synthesis allows partial context to be saved efficiently. Since different interrupt procedures use different amounts of context, we can not, in general, know how much context to preserve until the interrupt is linked to its service procedure. Furthermore, it may be desirable to change service procedures, for example, when changing or installing new I/O drivers in the running kernel. Without code synthesis, we would have to save the union of all contexts used by all procedures that could be called from the interrupt, slowing down all because of the needs of a few.
<p>Examples taken from the Synthesis Sound-IO device driver illustrate the ideas and provide performance numbers. The Sound-IO device is a general-purpose, high-quality audio input and output device with stereo, 16-bit analog-to-digital and digital-to-analog converters, and a direct-digital input channel from a CD player. This device interrupts the processor once for every sound sample - 44100 times per second - a very high number by conventional measures. It is normally inconceivable to attach such high-rate interrupt sources to the main processor. Sony <span class=smallcaps>Unix</span>, for example, can service a maximum of 20,000 interrupts per second, and such a device could not be handled at all.<sup>2</sup> Efficient interrupt handing is mandatory, and the rest of this section shows how Synthesis can service high interrupt rates efficiently.
<div class=footnote><sup>2</sup>The Sony workstation has two processors, one of which is dedicated to I/O, including servicing I/O interrupts using a somewhat lighter-weight mechanism. They solve the overhead problem with specialized processors -- a trend that appears to be increasingly common. But this solution compounds latency, and does not negate my point, which is that existing operating systems have high overhead that discourage frequent invocations.</div>
<p>Several benefits of runtime code generation combine to improve the efficiency of interrupt handing in Synthesis: the use of the high-speed buffering code described in Section 3.3.1, the ability to create interrupt routines that save and restore only the part of the context being used, and the use of layer-collapsing to merge separate functions together.
<div class=code>
<pre>
intr: move.l a0,-(sp) // Save register a0
move.l (P),a0 // Get buffer pointer into reg. a0
move.l (cd_port),(a0,D) // Store CD data into address P+D
addq.w #4,(P+2) // Increment low 16 bits of P.
beq cd_done // ... flush buffer if full
move.l (sp)+,a0 // Restore register a0
rte // Return from interrupt
</pre>
<p class=caption>Figure 3.5: Synthesized Code for Sound Interrupt Processing - CD Active</p>
</div>
<p>Figure 3.5 shows the actual Synthesis code created to handle the Sound-IO interrupts when only the CD-player is active. It begins by saving a single register, a0, since that is the only one used. This is followed by the code for the specific sound I/O channels, in this case, the CD-player. The code is similar to the fast buffer described in 3.3.1, synthesized to move data from the CD port directly into the user's buffer. If the other input sources (such as the A-to-D input) also become active, the interrupt routine is re-written, placing their buffer code immediately following the CD-player's. The code ends by restoring the a0 register and returning from interrupt.
<div class=code>
<pre>
s.intr:
move.l a0,-(sp) // Save register a0
tst.b (cd_active) // Is the CD device active?
beq cd_no // ... no, jump
move.l (cd_buf),a0 // Get CD buffer pointer into reg. a0
move.l (cd_port),(a0)+ // Store CD data; increment pointer
move.l a0,(cd_buf) // Update CD buffer pointer
subq.l #1,(cd_cnt) // Decrement buffer count
beq cd_flush // ... jump if buffer full
cd_no:
tst.b (ad_active) // Is the AD device active?
beq ad_no // ... no, jump
:
: [handle AD device, similar to CD code]
:
ad_no:
tst.b (da_active) // Is the DA device active?
beq da_no // ... no, jump
:
: [handle DA device, similar to CD code]
:
da_no:
move.l (sp)+,a0 // Restore register a0
rte // Return from interrupt
</pre>
<p class=caption>Figure 3.6: Sound Interrupt Processing, Hand-Assembler</p>
</div>
<p>Figure 3.6 shows the best I have been able to achieve using hand-written assembly language, without the use of code synthesis. Like the Synthesis version, this uses only a single register, so we save and restore only that one.<sup>3</sup> But without code synthesis, we must include code for all the Sound-IO sources -- CD, AD, and DA -- testing and branching around the parts for the currently inactive channels. In addition, we can no longer use the fast buffer implementation of section 3.3.1 because that requires code synthesis.
<div class=footnote><sup>3</sup> Most existing systems neglect even this simple optimization. They save and restore all the registers, all the time.</div>
<p>Figure 3.7 shows another version, this one written in C, and invoked by a short assembly-language dispatch routine. It preserves only those registers clobbered by C procedure calls, and is representative of a carefully-written interrupt routine in C.
<div class=code>
<pre>
s_intr:
movem.l &lt;d0-d2,a0-a2&gt;,-(sp)
bsr _sound_intr
movem.l (sp)+,&lt;d0-d2,a0-a2&gt;
rte
sound_intr()
{
if(cd_active) {
*cd_buf++ = *cd_port;
if(--cd_cnt &lt; 0)
cd_flush();
}
if(ad_active) {
...
}
if(da_active) {
...
}
}
</pre>
<p class=caption>Figure 3.7: Sound Interrupt Processing, C Code</p>
</div>
<p>The performance differences are summarized in Table 3.4. Measurements are divided into three groups. The first group, consisting of just a single row, shows the time taken by the hardware to process an interrupt and immediately return from it, without doing anything else. The second group shows the time taken by the various implementations of the interrupt handler when just the CD-player input channel is active. The third group is like the second, but with two active sources: the CD-player and AD channels.
<table class=table>
<caption>
Table 3.4: Processing Time for Sound-IO Interrupts<br>
<small>68030 CPU, 25MHz, 1-wait-state main memory, cold cache</small>
</caption>
<tr class=head><th><th>Time in &#181;S<th>Speedup
<tr><th>Null Interrupt<td>2.0<td class=number>--
<tr><th>CD-in, code-synth<td>3.7<td class=number>--
<tr><th>CD-in, assembler<td>6.0<td class=number>2.4
<tr><th>CD-in, C<td>9.7<td class=number>4.5
<tr><th>CD-in, C &amp; <span class=smallcaps>Unix</span><td>17.1<td class=number>8.9
<tr><th>CD+DA, code-synth<td>5.1<td class=number>--
<tr><th>CD+DA, assembler<td>7.7<td class=number>1.8
<tr><th>CD+DA, C<td>11.3<td class=number>3.0
<tr><th>CD+DA, C &amp; <span class=smallcaps>Unix</span><td>18.9<td class=number>5.5
</table>
<p>Within each group of measurements, there are four rows. The first three rows show the time taken by the code synthesis, hand-assembler, and C implementations of the interrupt handler, in that order. The code fragments measured are those of figures 3.5, 3.6, and 3.7; the C code was compiled on the Sony NEWS workstation using "cc -O". The last row shows the time taken by the C version of the handler, but dispatched the way that Sony <span class=smallcaps>Unix</span> does, preserving all the machines registers prior to the call. The left column tells the elapsed execution time, in microseconds. The right column gives the ratio of times between the code synthesis implementation and the others. The null-interrupt time is subtracted before computing the ratio to give a better picture of what the implementation-specific performance increases are.
<p>As can be seen from the table, the performance gains of using code synthesis are impressive. With only one channel active, we get more than twice the performance of handwritten assembly language, almost five times more efficient than well-written C, and very nearly an order of magnitude better than traditional <span class=smallcaps>Unix</span> interrupt service. Furthermore, the non-code-synthesis versions of the driver supports only the two-channel, 16-bit linearencoding sound format. Extending support, as Synthesis does, to other sound formats, such as &#181;-Law, either requires more tests in the sound interrupt handler or an extra level of format conversion code between the application and the sound driver. Either option adds overhead that is not present in the code synthesis version, and would increase the time shown.
<p>With two channels active, the gain is still significant though somewhat less than that for one channel. The reason is that the overhead-reducing optimizations of code synthesis -- collapsing layers and preserving only context that is used -- become less important as the amount of work increases. But other optimizations of code synthesis, such as the fast buffer, continue to be effective and scale with the work load. In the limit, as the number of active channels becomes large, the C and assembly versions perform equally well, and the code synthesis version is about 40% faster.
<h3>3.3.4 System Calls</h3>
<p>Another use of code synthesis is to minimize the overhead of invoking system calls. In Synthesis the term "system call" is somewhat of a misnomer because the Synthesis system interface is based on procedure calls. A Synthesis system call is really a procedure call that happens to cross the protection boundary between user and kernel. This is important because, as we will see in Chapter 4, each Synthesis service has a set of procedures associated with it that delivers that service. Since the set of services provided is extensible, we need a more general way of invoking them. Combining procedure calls with runtime code generation lets us do this efficiently.
<div class=code>
<pre>
// --- User-level stub procedure ---
proc:
moveq #N,d2 // Load procedure index
trap #15 // Trap to kernel
rts // Return
// --- Dispatch to kernel procedure ---
trap15:
cmp.w #MAX,d2 // Check that procedure index is in range
bhs bad_call // ... jump if not
move.l (tab$,pc,d2*4),a2 // Get the procedure address
jsr (a2) // Call it
rte // Return to user-level
.align 4 // Table of kernel procedure addresses...
tab$:
dc.l fn0, fn1, fn2, fn3, ..., fnN
</pre>
<p class=caption>Figure 3.8: User-to-Kernel Procedure Call</p>
</div>
<p>Figure 3.8 shows how. The generated code consists of two parts: a user part, shown at the top of the figure, and a kernel part, shown at the bottom. The user part loads the procedure index number into the <em>d2</em> register and executes the trap instruction, switching the processor into kernel mode where it executes the kernel part of the code, beginning at label <em>trap15</em>. The kernel part begins with a limit check on the procedure index number, ensuring that the index is inside the table area and preventing cheating by buggy or malicious user code that may pass a bogus number. It then indexes the table and calls the kernel procedure. The kernel procedure typically performs its own checks, such as verifying that all pointers are valid, before proceeding with the work. It returns with the rte instruction, which takes the thread back into user mode, where it returns control to the caller. Since the user program can only specify an index into the procedure table, and not the procedure address itself, only the allowed procedures may be called, and only at the appropriate entry points. Even if the user part of the generated code is overwritten either accidentally or maliciously, it can never cause the kernel to do something that could not have been done through some other, untampered, sequence of calls.
<p>Runtime code generation gives the following advantages: each thread has its own table of vectors for exceptions and interrupts, including <em>trap 15</em>. This means that each thread's kernel calls vector directly to the correct dispatch procedure, saving a level of indirection that would otherwise have been required. This dispatch procedure, since it is thread-specific, can hard-wire certain constants, such as MAX and the base address of the kernel procedure table, saving the time of fetching them from a data structure.
<p>Furthermore, by thinking of kernel invocation not as a system call - which conjures up thoughts of heavyweight processing and large overheads - but as a procedure call, many other optimizations become easier to see. For example, ordinary procedures preserve only those registers which they use; kernel procedures can do likewise. Procedure calling conventions also do not require that all the registers be preserved across a call. Often, a number of registers are allowed to be "trashed" by the call, so that simple procedures can execute without preserving anything at all. Kernel procedures can follow this same convention. The fact that kernel procedures are called from user level does not make them special; one merely has to properly address the issues of protection, which is discussed further in Section 3.4.2.
<p>Besides dispatch, we also need to address the problem of how to move data between user space and kernel as efficiently as possible. There are two kinds of moves required: passing procedure arguments and return values, and passing large buffers of data. For passing arguments, the user-level stub procedures are generated to pass as many arguments as possible in the CPU registers, bypassing the expense of accessing the user stack from kernel mode. Return values are likewise passed in registers, and moved elsewhere as needed by the user-level stub procedure. This is similar in idea to using CPU registers for passing short messages in the V system [9].
<p>Passing large data buffers is made efficient using virtual memory tricks. The main idea is: when the kernel is invoked, it has the user address space mapped in. Synthesis reserves part of each address space for the kernel. This part is normally inaccessible from user programs. But when the processor executes the trap and switches into kernel mode, the kernel part of the address space becomes accessible in addition to the user part, and the kernel procedure can easily move data back and forth using the ordinary machine instructions. Prior to beginning such a move, the kernel procedure checks that no pointer refers to locations outside the user's address space - an easy check due to the way the virtual addresses are chosen: a single limit-comparison (two instructions) suffices.
<p>Further optimizations are also possible. Since the user-level stub is a real procedure, it can be in-line substituted into its caller. This can be done lazily -- the stub is written so that each time a call happens, it fetches the return address from the stack and modifies that point in the caller. Since the stubs are small, space expansion is minimal. Besides being effective, this mechanism requires minimal support from the language system to identify potential in-lineable procedure calls.
<p>Another optimization bypasses the kernel procedure dispatcher. There are 16 possible traps on the 68030. Three of these are already used, leaving 13 free for other purposes, such as to directly call heavily-used kernel procedures. If a particular kernel procedure is expected to be used often, an application can invoke the cache procedure call, and Synthesis will allocate an unused trap, set it to call the kernel procedure directly, and re-create the user-level stub to issue this trap. Since this trap directly calls the kernel procedure, there is no longer any need for a limit check or a dispatch table. Pre-assigned traps can also be used to import execution environments. Indeed, the Synthesis equivalent of the <span class=smallcaps>Unix</span> concept of "stdin" and "stdout" is implemented with cached kernel procedure calls. Specifically, <em>trap 1</em> writes to stdout, and trap 2 reads from stdin.
<p>Combining both optimizations results in a kernel procedure call that costs just a little more than a trap instruction. The various costs are summarized in Table 3.5. The middle block of measurements show the cost of various Synthesis null kernel procedure calls: the general-dispatched, non-inlined case; the general-dispatched, with the user-level stub inlined into the application's code; cached-kernel-trap, non-inlined; and cached-kerneltrap, inlined. For comparison, the cost of a null trap and a null procedure call in the C language is shown on the top two lines, and the cost of the trivial getpid system call in <span class=smallcaps>Unix</span> and Mach is shown on the bottom two lines.
<h2>3.4 Other Issues 3.4.1 Kernel Size</h2>
<p>Kernel size inflation is an important concern in Synthesis due to the potential redundancy in the many Fsmall and F flat programs generated by the same F<sup>create</sup>. This could be particularly bad if layer collapsing were used too enthusiastically. To limit memory use, F<sup>create</sup> can generate either in-line code or subroutine calls to shared code. The decision of when to expand in-line is made by the programmer writing F<sup>create</sup>. Full, memory-hungry in-line expansion is usually reserved for specific uses where its benefits are greatest: the performance-critical, frequently-executed paths of a function, where the performance gains justify increased memory use. Less frequently executed parts of a function are stored in a common area, shared by all instances through subroutine calls.
<table class=table>
<caption>
Table 3.5: Cost of Null System Call<br>
<small>68030 CPU, 25MHz, 1-wait-state main memory</small>
</caption>
<tr class=head><th><th>&#181;S, cold cache<th>&#181;S, warm cache
<tr><th>C procedure call<td class=number>1.2<td class=number>1.0
<tr><th>Null trap<td class=number>1.9<td class=number>1.6
<tr><th>Kernel call, general dispatch<td class=number>4.2<td class=number>3.5
<tr><th>Kernel call, general, in-lined<td class=number>3.5<td class=number>2.9
<tr><th>Kernel call, cached-trap<td class=number>3.5<td class=number>2.7
<tr><th>Kernel call, cached and in-lined<td class=number>2.7<td class=number>2.1
<tr><th><span class=smallcaps>Unix</span>, getpid<td class=number>40<td class=number>--
<tr><th>Mach, getpid<td class=number>88<td class=number>--
</table>
<p>In-line expansion does not always cost memory. If a function is small enough, expanding it in-line can take the same or less space than calling it. Examples of functions that are small enough include character-string comparisons and buffer-copy. For functions with many runtime-invariant parameters, the size expansion of inlining is offset by a size decrease that comes from not having to pass as many parameters.
<p>In practice, the actual memory needs are modest. Table 3.6 shows the total memory used by a full kernel -- including I/O buffers, virtual memory, network support, and a window system with two memory-resident fonts.
<table class=table>
<caption>
Table 3.6: Kernel Memory Requirements
</caption>
<tr class=head><th>System Activity<th>Memory Use, as code + data (Kbytes)
<tr><th>Boot image for full kernel<td class=number>140
<tr><th>One thread running<td class=number>Boot + 0.4 + 8
<tr><th>File system and disk buffers<td class=number>Boot + 6 + 400
<tr><th>100 threads, 300 open files<td class=number>Boot + 80 + 1400
</table>
<h3>3.4.2 Protecting Synthesized Code</h3>
The classic solutions used by other systems to protect their kernels from unauthorized tampering by user-level applications also work in the presence of synthesized code. Like many other systems, Synthesis needs at least two hardware-supported protection domains: a privileged mode that allows access to all the machine's resources, and a restricted mode that lets ordinary calculations happen but restricts access to resources. The privileged mode is called supervisor mode, and the restricted mode, user mode.
<p>Kernel data and code - both synthesized and not - are protected using memory management to make the kernel part of each address space inaccessible to user-level programs. Synthesized routines run in supervisor mode, so they can perform privileged operations such as accessing protected buffer pages.
<p>User-level programs enter supervisor mode using the trap instruction. This instruction provides a controlled - and the only - way for user-level programs to enter supervisor mode. The synthesized routine implementing the desired system service is accessed through a jump table in the protected area of the address space. The user program specifies an index into this table, ensuring the synthesized routines are always entered at the proper entry points. This protection mechanism is similar to Hydra's use of C-lists to prevent the forgery of capabilities [34].
<p>Once in kernel mode, the synthesized code handling the requested service can begin to do its job. Further protection is unnecessary because, by design, the kernel code generator only creates code that touches data the application is allowed to touch. For example, were a file inaccessible, its read procedure would never have been generated. Just before returning control to the caller, the synthesized code reverts to the previous (user-level) mode.
<p> There is still the question of invalidating the code when the operation it performs is no longer valid -- for example, invalidating the read procedure after a files has been closed. Currently, this is done by setting the corresponding function pointers in the KPT to an invalid address, preventing further calls to the function. The function's reference counter is then decremented, and its memory freed when the count reaches zero.
<h3>3.4.3 Non-coherent Instruction Cache</h3>
<p>A common assumption in the design of processors is that a program's instructions will not change as the program runs. For that reason, most processor's instruction caches are not coherent - writes to memory are not reflected in the cache. Runtime code generation violates this assumption, requiring that the instruction cache be flushed whenever code changes happen. Too much cache flushing reduces performance, both because programs execute slower when the needed instructions are not in cache and because flushing itself may be an expensive operation.
<p>The performance of self-modifying code, like that found in executable data structures, suffers the most from an incoherent instruction cache. This is because the ratio of code modification to use tends to be high. Ideally, we would like to flush with cache-line granularity to avoid losing good entries. Some caches provide only an all-or-nothing flush. But even line-at-a-time granularity has its disadvantages: it needs machine registers to hold the parameters, registers that may not be available during interrupt service without incurring the cost of saving and restoring them. Unfortunately for Synthesis, most cases of self-modifying code actually occur inside interrupt service routines where small amounts of data (e.g., one character for a TTY line) must be processed with minimal overhead. Fortunately, in all important cases the cost has been reduced to zero through careful layout of the code in memory using knowledge of the 68030 cache architecture to cause the subsequent instruction fetch to replace the cache line that needs flushing. But that trick is neither general nor portable.
<p>For the vast majority of code synthesis applications, an incoherent cache is not a big problem. The cost of flushing even a large cache contributes relatively little compared to the cost of allocating memory and creating the code. If code generation happens infrequently relative to the code's use, as is usually the case, the performance hit is small.
<p>Besides the performance hit from a cold cache, cache flushing itself may be slow. On the 68030 processor, for example, the instruction to flush the cache is privileged. Although this causes no special problems for the Synthesis kernel, it does force user-level programs that modify code to make a system call to flush the cache. I do not see any protectionrelated reason why that instruction must be privileged; perhaps making it so simplified processor design.
<h2>3.5 Summary</h2>
<p>This chapter showed:
<ol>
<li>that code synthesis allows important operating system functions such as buffering, context switching, interrupt handing, and system call dispatch to be implemented 1.4 to 2.4 times more efficiently than is possible using the best assemblylanguage implementation without code synthesis and 1.5 to 5 times better than well-written C code;
<li>that code synthesis is also effective at the user-level, achieving an 80% improvement for basic operations such as putchar; and
<li>that the anticipated size penalty does not, in fact, happen.
</ol>
<p>Before leaving this section, I want to call a moment's more attention to the interrupt handlers of Section 3.3.3. At first glance - and even on the second and third - the C-language code it looks to be as minimal as it can get. There does not seem to be any fat to cut. Table 3.4 has shown otherwise. The point is that sometimes, sources of overhead are hidden, not so easy to spot and optimize. They are a result of assumptions made and the programming language used, whether it be in the form of a common calling convention for procedures, or in conventions followed to simplify linking routines to interrupts. This section has shown that code synthesis is an important technique that enables general procedure interfacing while preserving -- and often bettering - the efficiency of custom-crafted code.
<p>The next chapter now shows how Synthesis is structured and how synergy between kernel code synthesis and good software engineering leads to a system that is general and easily expandable, but at the same time efficient.
</div>
</body>
</html>

View File

@ -1,411 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div id="nav">
<a class=home href="../index.html">Alexia's Home</a>
<a href="index.html">Dissertation</a>
<a href="abs.html">Abstract</a>
<a href="ack.html">Acknowledgements</a>
<a href="toc.html">Contents</a>
<a href="ch1.html">Chapter 1</a>
<a href="ch2.html">Chapter 2</a>
<a href="ch3.html">Chapter 3</a>
<a class=here href="ch4.html">Chapter 4</a>
<a href="ch5.html">Chapter 5</a>
<a href="ch6.html">Chapter 6</a>
<a href="ch7.html">Chapter 7</a>
<a href="ch8.html">Chapter 8</a>
<a href="bib.html">Bibliography</a>
<a href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
<h1>4. Kernel Structure</h1>
<div id="chapter-quote">
All things should be made as simple as possible, but no simpler.<br>
-- Albert Einstein
</div>
<h2>4.1 Quajects</h2>
<p><em>Quajects</em> are the building blocks out of which all Synthesis kernel services are composed. The name is derived from the term "object" of Object-Oriented (O-O) systems, which they strongly resemble [32]. The similarity is strong, but the difference is significant. Like objects, quajects encapsulate data and provide a well-defined interface to access it. Unlike objects, quajects use a code-synthesis implementation to achieve high performance, but lack high-level language support and inheritance.
<p>Kernel quajects can be broadly classified into four kinds: thread, memory, I/O, and device. Thread quajects encapsulate the unit of execution, memory quajects the unit of data storage, I/O quajects the unit of data movement, and device quajects the machine's interfaces to the outside world. Each kind of quaject is defined and implemented independently.
<p>Basic quajects implement fundamental services that cannot be had through any combination of other quajects. Threads and queues are two examples of basic quajects;
<table class=table>
<caption>
Table 4.1: List of Basic Quajects
</caption>
<tr class=head><th>Name<th>Purpose
<tr><th>Thread<td>Implements threads
<tr><th>Queue<td>Implements FIFO queues
<tr><th>Buffer<td>Data buffering
<tr><th>Dcache<td>Data caching (e.g., for disks)
<tr><th>FSmap<td>File to flat storage mapping
<tr><th>Clock<td>The system clock
<tr><th>CookTTYin<td>Keyboard input editor
<tr><th>CookTTYout<td>Output editor and format conversion
<tr><th>VT-100<td>Emulates DEC's VT100 terminal
<tr><th>Twindow<td>Text display window
<tr><th>Gwindow<td>Graphics (bit-mapped) display window
<tr><th>Probe<td>Measurements and statistics gathering
<tr><th>Sytab<td>Symbol table (associative mapping)
</table>
<p>Table 4.1 contains a list of the basic quajects in Synthesis. More complex kernel services are built out of the basic quajects by composition. For example, the Synthesis kernel has no pre-defined notion of a "process." But a <span class=smallcaps>Unix</span>-like process can be created by instantiating a thread quaject, a memory quaject, some I/O quajects, and interconnecting them in a particular way.
<h3>4.1.1 Quaject Interfaces</h3>
<p>The interface to a quaject consists of callentries, callbacks, and callouts. A client uses the services of a quaject by calling a callentry. Normally a callentry invocation simply returns. Exceptional situations return along callbacks. Callouts are places in the quaject where external calls to other quaject's callentries happen. Tables 4.2, 4.3, and 4.4 list the interfaces to the Synthesis basic kernel quajects.
<!-- - - FINISH - THIS SHOULD BE A FIGURE - - -->
<div class=code>
<pre>
+-----------+--------------------+-----------+
| Qput | | Qget |
+-----------+ ---+--+--+ +-----------+
| Qfull | o o | | | | Qempty |
+-----------+ ---+--+--+ +-----------+
| Qnotfull | | Qnotempty |
+-----------+--------------------+-----------+
</pre>
<p class=caption>Figure 4.1: Queue Quaject</p>
</div>
<p>Callentries are analogous to methods in object-oriented systems. The other two, callbacks and callouts, have no direct analogue in object-oriented systems. Conceptually, a callout is a function pointer that has been initialized to point to another quaject's callentry; callbacks point back to the invoker. Callouts are an important part of the interface because they specify what type of external call is needed, making it possible to dynamically link one of several different quaject's callentries to a particular callout, so long as the type matches. For example, the Synthesis buffer quaject has a flush callout which is invoked when the buffer is full. This enables the same buffer implementation to be used throughout the kernel simply be instantiating a buffer quaject and linking its flush callout to whatever downstream processing is appropriate for the instance.
<p>The quaject interface is better illustrated using a simple quaject as an example - the FIFO queue, shown in Figure 4.1. The Synthesis kernel supports four different types of queues, to optimize for the varying synchronization needs of different combinations of single or multiple producers and consumers (synchronization is discussed in Chapter 5). All four types support the same abstract type [6], defined by two callentry references, <em>Qput</em> and <em>Qget</em>, which put and get elements of the queue. Both these callentry references return synchronously under the normal condition (successful insertion or deletion). Under other conditions, the queue returns through the callbacks.
<p>The queue has four callbacks which are used to return queue-full and queue-empty conditions back to caller. <em>Qempty</em> is invoked when a <em>Qget</em> fails because the queue is empty. <em>Qfull</em> is invoked when a <em>Qput</em> fails because the queue is full. <em>Qnotempty</em> is called after a previous <em>Qget</em> had failed and then an element was inserted. And <em>Qnotful</em> is called after a previous <em>Qput</em> had failed and then an element was deleted. The idea is: instead of returning a condition code for interpretation by the invoker, the queue quaject directly calls the appropriate handling routines supplied by the invoker, speeding execution by eliminating the interpretation of return status codes.
<table class=table>
<caption>
Table 4.2: Interface to I/O Quajects
</caption>
<tr class=head><th>Quaject<th>Interface<th>Name<th>Purpose
<tr><td rowspan=6>Queue<td rowspan=2>Callentry<td><em>Qput</em><td>Insert element into queue
<tr><td><em>Qget</em><td>Remove element from queue
<tr><td rowspan=4>Callback<td><em>Qfull</em><td>Notify that the queue is full
<tr><td><em>Qnotful</em><td>Notify that the queue is no longer full
<tr><td><em>Qempty</em><td>Notify that the queue is empty
<tr><td><em>Qnotempty</em><td>Notify that the queue is no longer empty
<tr><td rowspan=4>BufferOut<td rowspan=3>Callentry<td><em>put</em><td>Insert an element into the buffer
<tr><td><em>write</em><td>Insert a string of elements into the buffer
<tr><td><em>flush</em><td>Force buffer contents to output
<tr><td rowspan=1>Callout<td><em>flush</em><td>Dump out the full buffer
<tr><td rowspan=3>BufferIn<td rowspan=2>Callentry<td><em>get</em><td>Get a single element from the buffer
<tr><td><em>read</em><td>Get a string of elements from the buffer
<tr><td rowspan=1>Callout<td><em>fill</em><td>Replenish the empty buffer
<tr><td rowspan=4>CookTTYin<td rowspan=2>Callentry<td><em>getchar</em><td>Read a processed character from the edit buffer
<tr><td><em>read</em><td>Read a string of characters from the edit buffer
<tr><td rowspan=2>Callout<td><em>raw_get</em><td>Get new characters from user's keyboard
<tr><td><em>echo</em><td>Echo user's typed characters
<tr><td rowspan=3>CookTTYout<td rowspan=2>Callentry<td><em>putchar</em><td>Send a character out for processing
<tr><td><em>write</em><td>Send a string of characters out for processing
<tr><td rowspan=1>Callout<td><em>raw_write</em><td>Write out processed characters to display
<tr><td rowspan=4>VT100<td rowspan=4>Callentry<td><em>putchar</em><td>Write a character to the virtual VT-100 screen
<tr><td><em>write</em><td>Write a string of characters
<tr><td><em>update</em><td>Propagate changes to the virtual screen image
<tr><td><em>refresh</em><td>Propagate the entire virtual screen image
<tr><td rowspan=4>FSmap<td rowspan=2>Callentry<td><em>aread</em><td>Asynchronous read from file
<tr><td><em>awrite</em><td>Asynchronous write to file
<tr><td rowspan=2>Callout<td><em>ca_read</em><td>Read from disk cache
<tr><td><em>ca_write</em><td>Write to disk cache
<tr><td rowspan=4>Dcache<td rowspan=2>Callentry<td><em>read</em><td>Read from data cache
<tr><td><em>write</em><td>Write to data cache
<tr><td rowspan=2>Callout<td><em>bk_read</em><td>Read from backing store
<tr><td><em>bk_write</em><td>Write to backing store
<tr><td rowspan=1>T_window<td rowspan=1>Callentry<td><em>write</em><td>Write a string of (character,attribute) pairs
<tr><td rowspan=1>G_window<td rowspan=1>Callentry<td><em>blit</em><td>Copy a rectangular array of pixels to window
</table>
<table class=table>
<caption>
Table 4.3: Interface to other Kernel Quajects
</caption>
<tr class=head><th>Quaject<th>Interface<th>Name<th>Purpose
<tr><td rowspan=11>Thread<td rowspan=8>Callentry<td><em>suspend</em><td>Suspends thread execution
<tr><td><em>resume</em><td>Resumes thread execution
<tr><td><em>stop</em><td>Prevents execution
<tr><td><em>step</em><td>Executes one instruction then stops
<tr><td><em>interrupt</em><td>Send a software interrupt
<tr><td><em>signal</em><td>Send a software signal
<tr><td><em>wait</em><td>Wait for an event
<tr><td><em>notify</em><td>Notify that event has happened
<tr><td rowspan=3>Callout<td><em>read[i]</em><td>Read from quaject i
<tr><td><em>write[i]</em><td>Write to quaject i
<tr><td><em>call[i][e]</em><td>Call callentry e in quaject i
<tr><td rowspan=5>Clock<td rowspan=4>Callentry<td><em>gettime</em><td>Get the time of day, in "ticks"
<tr><td><em>getunits</em><td>Learn how many "ticks" there are in a second
<tr><td><em>alarm</em><td>Set an alarm: call given procedure at given time
<tr><td><em>cancel</em><td>Cancel an alarm
<tr><td rowspan=1>Callout<td><em>call[i]</em><td>Call procedure i upon alarm expiration
<tr><td rowspan=2>Probe<td rowspan=2>Callentry<td><em>probe</em><td>Tell which procedure to measure
<tr><td><em>show</em><td>Display statistics
<tr><td rowspan=2>Symtab<td rowspan=2>Callentry<td><em>lookup</em><td>Lookup a string; return its associated value
<tr><td><em>add</em><td>Add entry to symbol table
</table>
<table class=table>
<caption>
Table 4.4: Interface to Device Quajects
</caption>
<tr class=head><th>Quaject<th>Interface<th>Name<th>Purpose
<tr><td rowspan=3>Serial_in<td rowspan=2>Callentry<td><em>enable</em><td>Enable input
<tr><td><em>disable</em><td>Disable input
<tr><td rowspan=1>Callout<td><em>putchar</em><td>Write received characher
<tr><td rowspan=3>Serial_out<td rowspan=2>Callentry<td><em>enable</em><td>Enable output
<tr><td><em>disable</em><td>Disable output
<tr><td rowspan=1>Callout<td><em>getchar</em><td>Obtain characher to send
<tr><td rowspan=3>Sound_CD<td rowspan=2>Callentry<td><em>enable</em><td>Enable input
<tr><td><em>disable</em><td>Disable input
<tr><td rowspan=1>Callout<td><em>put_sample</em><td>Store sound sample received from CD player
<tr><td rowspan=3>Sound_DA<td rowspan=2>Callentry<td><em>enable</em><td>Enable output
<tr><td><em>disable</em><td>Disable output
<tr><td rowspan=1>Callout<td><em>get_sample</em><td>Get new sound sample to send to A/D device
<tr><td rowspan=4>Framebuffer<td rowspan=2>Callentry<td><em>blit</em><td>Copy memory bitmap to framebuffer
<tr><td><em>intr_ctl</em><td>Enable or disable interrupts
<tr><td rowspan=2>Callout<td><em>Vsync</em><td>Vertical sync interrupt
<tr><td><em>Hsync</em><td>Horizontal sync interrupt
<tr><td rowspan=5>Disk<td rowspan=4>Callentry<td><em>aread</em><td>Asyncronous read
<tr><td><em>awrite</em><td>Asynchronous write
<tr><td><em>format</em><td>Format the disk
<tr><td><em>blk_size</em><td>Learn the disk's block size
<tr><td rowspan=1>Callout<td><em>new disk</em><td>(Floppy) disk has been changed
</table>
<h3>4.1.2 Creating and Destroying Quajects</h3>
<p>Each class of quaject has create and destroy callentries that instantiate and destroy members of that class, including creating all their runtime-generated code. Creating a quaject involves allocating a single block of memory for its data and code, then initializing portions of that memory. With few exceptions, all of a quaject's runtime-generated code is created during this initialization. This generally involves copying the appropriate code template, determined by the type of quaject being created and the situation in which it is to be used, and then filling in the address fields in the instructions that reference quaject-specific data items. There are two exceptions to the rule. One is when the quaject implementation uses self-modifying code. The other occurs during the handling of callouts when linking one quaject to another. This is covered in the next section.
<p>Kernel quajects are created whenever they are needed to build higher-level services. For example, opening an I/O pipe creates a queue; opening a text window creates three quajects: a window, a VT-100 terminal emulator, and a TTY-cooker. Which quajects get created and how they are interconnected is determined by the implementation of each service.
<p>Quajects may also be created at the user level, simply by calling the class's create callentry. from a user-level thread. The effect is identical to creating kernel quajects, except that user memory is allocated and filled, and the resulting quajects execute in user-mode, not kernel. The kernel does not concern itself with what happens to such user-level quajects. It merely offers creation and linkage services to applications that want to use them.
<p>Quajects are destroyed when they are no longer needed. Invoking the destroy callentry signals that a particular thread no longer needs a quaject. The quaject itself is not actually destroyed until all references to it are severed. Reference counts are used. There is the possibility that circular references prevent destruction of otherwise useless quajects but this has not been a problem because quajects tend to be connected in cyclefree graphs. Destroying quajects does not immediately deallocate their memory. They are instead placed in the inactive list for their class. This speeds subsequent creation because much of the code-generation and initialization work had been already done.<sup>1</sup> As heap memory runs out, memory belonging to quajects on the inactive list is recycled.
<div class=footnote><sup>1</sup> Performance measurements in this dissertation were carried out without using the inactive list, but creating fresh quajects as needed.</div>
<h3>4.1.3 Resolving References</h3>
<p>The kernel resolves quaject callentry and callbacks references when linking quajects to build services. Conceptually, callouts and callback are function pointers that are initialized to point to other quaject's callentries when quajects are linked. For example, when attaching a queue to a source of data, the kernel fills the callouts of the data source with the addresses of the corresponding callentries in the queue and initializes the queue's callbacks with the addresses of the corresponding exception handlers in the data source. If the source of data is a thread, the address of the queue's <em>Qput</em> callentry is stored in the thread's write callout, the queue's <em>Qfull</em> callback is linked to the thread's suspend callentry, and the queue's <em>Qnotful</em> callback is linked to the thread's resume callentry. See Figure 4.2.
<p>In the actual implementation, a callout is a "hole" in the quaject's memory where linkage-specific runtime generated code is placed. Generally, this code consists of zero or more instructions that save any machine registers used by both caller and callee quajects, followed by a jsr instruction to invoke the target callentry, followed by zero or more instructions to restore the previously saved registers. The callout's code might also perform a context switch if the called quaject is in a different address space. Or, in the case when the code comprising the called quaject's callentry is in the same address space and is smaller than the space set aside for the callout, the callentry is copied in its entirety into the callout. This is how the layer-collapsing, in-line expansion optimization of Section 3.2.2 works. A flag bit in each callentry tells if it uses self-modifying code, in which case, the copy does not happen.
<p>Most linkage is done without referencing any symbol tables, but using information that is known at system generation time. Basically, the linking consists of blindly storing addresses in various places, being assured that they will always "land" in the correct place in the generated code. Similarly, no runtime type checking is required, as all such information has been resolved at system generation time.
<p>Not all references must be specified or filled. Each quaject provides default values for its callout and callbacks that define what happens when a particular callout or callback is needed but not connected. The action can be as simple as printing an error message and aborting the operation or as complicated as dynamically creating the missing quaject, linking the reference, and continuing.
<p>In addition, the kernel can also resolve references in response to execution traps that invoke the dynamic linker. Such references are represented by ASCII names. The name <em>Qget</em>, for example, refers to the queue's callentry. A symbol-table quaject maps the string names into the actual addresses and displacements. For example, the <em>Qget</em> callentry is represented in the symbol table as a displacement from the start of the queue quaject. Which quaject is being referenced is usually clear from context. For example, callentries are usually invoked using a register-plus-offset addressing mode; the register contains the address of the quaject in question. When not, an additional parameter disambiguates the reference.
<h3>4.1.4 Building Services</h3>
<p>Higher-level kernel services are built by composing several basic quajects. I now show, by means of an example, how a data channel is put together. The example illustrates the usage of queues and reference resolution. It also shows how a data channel can support two kinds of interfaces, blocking and non-blocking, using the same quaject building block. The queue quaject used is of type ByteQueue.<sup>2</sup>
<div class=footnote><sup>2</sup> The actual implementation of Synthesis V.1 uses an optimized version of ByteQueue that has a string-oriented interface to reduce looping, but the semantics is the same.</div>
Figure 4.2 shows a producer thread using the <em>Qput</em> callentry to store bytes in the queue. The ByteQueue's <em>Qfull</em> callback is linked to the thread's suspend callentry; the ByteQueue's <em>Qnotful</em> callback is linked to the thread's resume callentry. As long as the queue is not full, calls to <em>Qput</em> enqueue the data and return normally. When the queue becomes full, the queue invokes the <em>Qfull</em> callback, suspending the producer thread. When the ByteQueue's reader removes a byte, the <em>Qnotful</em> callback is invoked, awakening the producer thread. This implements the familiar synchronous interface to an I/O stream.
<table class=fig>
<caption>
Figure 4.2: Blocking write
</caption>
<tr><td>Kind of Reference<td>User Thread<td><td colspan=2>ByteQueue<td>Device Driver<td>Hardware
<tr><td>callentry<td>write<td>&#8658;<td><em>Qput</em><td><em>Qget</em><td>&#8656;<td>send-complete interrupt
<tr><td>callback<td>suspend<td>&#8656;<td><em>Qfull</em><td><em>Qempty</em><td>&#8658;<td>turn off send-complete
<tr><td>callback<td>resume<td>&#8656;<td><em>Qnotful</em><td><em>Qnotempty</em><td>&#8658;<td>turn on send-complete
</table>
<table class=fig>
<caption>
Figure 4.3: Non-blocking write
</caption>
<tr><td>Reference<td>Thread<td><td>ByteQueue
<tr><td>callentry<td>write<td>&#8658;<td><em>Qput</em>
<tr><td>callback<td>return to caller<td>&#8656;<td><em>Qfull</em>
<tr><td>callback<td>if(more work) goto Qput<td>&#8656;<td><em>Qnotful</em>
</table>
<p>Contrast this with Figure 4.3, which shows a non-blocking interface to the same data channel implemented using the same queue quaject. Only the connections between ByteQueue and the thread change. The thread's write callout still connects to the queue's <em>Qput</em> callentry. But the queue's callbacks no longer invoke procedures that suspend or resume the producer thread. Instead, they return control back to the producer thread, functioning, in effect, like interrupts that signal events -- in this example, the filling and emptying of the queue. When the queue fills, the <em>Qfull</em> callback returns control back to the producer thread, freeing it to do other things without waiting for output to drain and without having written the bytes that did not fit. The thread knows the write is incomplete because control flow returns through the callback, not through <em>Qput</em>. After output drains, <em>Qnotful</em> is called, invoking an exception handler in the producer thread which checks whether there are remaining bytes to write, and if so, it goes back to <em>Qput</em> to finish the job.
<p>Ritchie's Stream I/O system has a similar flavor: it too provides a framework for attaching stages of processing to an I/O stream [27]. But stream-I/O's queueing structure is fixed, the implementation is based on messages, and the I/O is synchronous. Unlike StreamI/O, quajects offer a finer level of control and expanded possibilities for connection. The previous example illustrates this by showing how the same queue quaject can be connected in different ways to provide either synchronous or asynchronous I/O. Furthermore, quajects extend the idea to include non-I/O services as well, such as threads.
<h3>4.1.5 Summary</h3>
<p>In the implementation of Synthesis kernel, quajects provide encapsulation and make all inter-module dependencies explicit. Although quajects differ from objects in traditional O-O systems because of a procedural interface and run-time code generation implementation, the benefits of encapsulation and abstraction are preserved in a highly efficient implementation.
<p>I have shown, using the data channel as an example, how quajects are composed to provide important services in the Synthesis kernel. That example also illustrates the main points of a quaject interface:
<ul>
<li>Callentry references implement object-oriented-like methods and bypass interpretation in the invoked quaject.
<li>Callback references implement return codes and bypass interpretation in the invoker.
<li>The operation semantics are determined dynamically by the quaject interconnections, independent of the quaject's implementation.
</ul>
<p>This last point is fundamental in allowing a true orthogonal quaject implementation, for example, enabling a queue to be implemented without needing any knowledge of how threads work - not even how to suspend and resume them.
<p>The next section shows how the quaject ideas fit together to provide user-level services.
<h2>4.2 Procedure-Based Kernel</h2>
<p>Two fundamental ideas underlie how Synthesis is structured and how the services are invoked:
<ul>
<li>Every callentry is a real, distinct procedure.
<li>Services are invoked by calling these procedures.
</ul>
<p>Quaject callentries are small procedures stored at known, fixed offsets from the base of the block of memory that holds the quaject's state. For simple callentries, the entire procedure is stored in the allocated space of the structure. Quajects such as buffers and queues have their callentries expanded in this manner, using all the runtime code-generation ideas discussed in Chapter 3. For more complex callentries, the procedures usually consist of some instance-specific code to handle the common execution paths, followed by code that loads the base pointer of the quaject's structure into a machine register and jumps to shared code implementing the rest of the callentry.
<p>This representation differs from that of methods in object-oriented languages such as C++. In these languages, the object's structure contain pointers to generic methods for that class of object, not the methods themselves. The language system passes a pointer to the object's structure as an extra parameter to the procedure implementing each method. This makes it hard to use an object's method as a real function, one whose address can be passed to other functions without also passing and dealing with the extra parameter.
<p>It is this difference that forms the basis of Synthesis quaject composition and extensible kernel service. Every callentry is a real procedure, each with a unique address and expecting no "extraneous" parameters. Each queue's <em>Qput</em>, for example, takes exactly one parameter: the data to be enqueued. This property is fundamental for easy quaject composition: each quaject in a chain simply calls the next, without passing an arbitrarily long array of structure pointers downstream, one for each quaject.
<h3>4.2.1 Calling Kernel Procedures</h3>
<p>The discussion until now assumes that the callentries reside in the same address space and execute at the same privilege level as their caller, so that direct procedure call is possible. But when user-level programs invoke kernel quajects, e.g., to read a file, the invocation crosses a protection boundary. A direct procedure call would not work because the kernel routine needs to run in supervisor mode.
<p>In a conventional operating system, such as <span class=smallcaps>Unix</span>, application programs invoke the kernel by making system calls. But while system calls provide a controlled, protected way for a user-level program to invoke procedures in the kernel, they are limited in that they allow access to only a fixed set of procedures in the kernel. For Synthesis to be extensible, it needs an extensible kernel call mechanism; a mechanism that supports a protected, userlevel interface to arbitrary kernel quajects.
<p>The user-level interface is supplied with stub quajects. Stub quajects reside in the user address space and have the same callentries, with the same offsets, as the kernel quaject which they represent. Invoking a stub's callentry from user-level results in the corresponding kernel quaject's callentry being invoked and the results returned back.
<p>This is implemented in the following way. The stub's callentries consist of tiny procedures that load a number into a machine register and then executes a trap instruction. The number identifies the desired kernel procedure. The trap switches the processor into kernel mode, where it executes the kernel-procedure dispatcher. The dispatcher uses the procedure number parameter to index a thread-specific table of kernel procedure addresses. Simple limit checks ensure the index is in range and that only the allowed procedures are called. If the checks pass, the dispatcher invokes the kernel procedure on the behalf of the user-level application.
<p>There are many benefits to this design. One is that it extends the kernel quaject interface transparently to user-level, allowing kernel quajects to be composed with user-level quajects. Its callentries are real procedures: their addresses can be passed to other functions or stored in tables; they can be in-line substituted into other procedures and optimized using the code-synthesis techniques of Section 3.2 applied at the user level. Another advantage, which has already been discussed in Section 3.3.4, is that a very efficient implementation exists. The result is that the protection boundary becomes fluid; what is placed in the kernel and what is done at user-level can be chosen at will, not dictated by the design of the system. In short, all the advantages of kernel quajects have been extended out to user level.
<h3>4.2.2 Protection</h3>
<p>Kernel procedure calls are protected because the user program can only specify indices into the kernel procedure table (KPT), so the kernel quajects are guaranteed to execute only from legitimate entry points, and because the index is checked before being used, only valid entries in the table can be accessed.
<h3>4.2.3 Dynamic Linking</h3>
<p>Synthesis supports two flavors of dynamic linking: load-link, which resolves external references at program load time, before execution begins; and run-link, which resolves references at runtime as they are needed. Run-link has the advantage of allowing execution of programs with undefined references as long as the execution path does not cross them, simplifying debugging and testing of unfinished programs.
<p>Dynamic linking does not prevent sharing or paging of executable code. It is possible to share dynamically-linked code because the runtime libraries always map to the same address in all address spaces. It is possible to page run-linked code and throw away infrequently used pages instead of writing them to backing store because the dynamic linker will re-link the references should the old page be needed again.
<h2>4.3 Threads of Execution</h2>
<p>Synthesis threads are light-weight processes, implemented by the thread quaject. Each Synthesis thread (called simply "thread" from now on) executes in a context, defined by the thread table entry (TTE), which is the data part of the thread quaject holding the thread state and which contains:
<ul>
<li>The register save area to hold the thread's machine registers when the thread is not executing.
<li>The kernel procedure table (KPT) - that table of callouts described in 4.2.1. ffl The signal table, used to dispatch software signals. ffl The address mapping tables for virtual memory.
<li>The vector table - the hardware-defined array of starting addresses of exception handlers. The hardware consults this table to dispatch the hardware-detected exceptions: hardware interrupts, error traps (like division by zero), memory faults, and software-traps (system calls).
<li>The context-switch-in and context-switch-out procedures comprising the executable data structure of the ready queue.
</ul>
<p>Of these, the last two are unusual. The context-switch-in and -out procedures were already discussed in Section 3.3.2, which explains how executable data structures are used to implement fast context switching. Giving each thread its own vector table also differs from usual practice, which makes the vector table a global structure, shared by all threads or processes. By having a separate vector table per thread, Synthesis saves the dispatching cost of thread-specific exceptions. Since most of the exceptions are thread specific, the savings is significant. Examples include all the error traps, such as division by zero, and the VM-related traps, such as translation fault.
<h3>4.3.1 Execution Modes</h3>
<p>Threads can execute in one of two modes: supervisor mode and user mode. When a thread calls the kernel by issuing the trap instruction, it changes modes from user to supervisor. This view of things is in contrast to having a kernel server process run the kernel call on the behalf of the client thread. Each thread's memory mapping tables are set so that as the thread switches to supervisor mode, the kernel memory space becomes accessible in addition to the user space, in effect, "unioning" the kernel memory space with the user memory space. (This implies the set of addresses used must be disjoint.) Consequently, the kernel call may move data between the user memory and the kernel memory easily, without using special machine instructions, such as "moves" (move from/to alternate address space), that take longer to execute. Other memory spaces are outside the kernel space, inaccessible even from supervisor mode except through special instructions. Since no quaject's code contains those special instructions, Synthesis can easily enforce memory access restrictions for its kernel calls by using the normal user-level memory-access checks provided by the memory management unit. It first checks that no pointer is in the kernel portion of the address space (an easy check), and then proceeds to move the data. If an illegal access happens, or if a non-resident page is referenced, the thread will take a translation-fault exception, even from supervisor mode; the fault handler then reads in the referenced page from backing store if it was missing or prints the diagnostic message if the access is disallowed. (All this works because all quajects are reentrant, and since system calls are built out of quajects, all system calls are reentrant.)
<p>Synthesis threads also provide a mechanism where routines executing in supervisor mode can make protected calls to user-mode procedures. It is mostly used to allow usermode handling of exceptions that arise during supervisor execution, for example, someone typing "Control-C" while the thread is in the middle of a kernel call. It is also expected to find use in a future implementation of remote procedure call. The hard part in allowing user-level procedure calls is not in making the call, but arranging for a protected return from user-mode back to supervisor. This is done by pushing a special, exception-causing return address on the user stack. When the user procedure finishes and returns, the exception is raised, putting the thread back into supervisor mode.
<h3>4.3.2 Thread Operations</h3>
<p>As a quaject, the thread supports several operations, defined by its callentries. They are: <em>suspend</em>, <em>resume</em>, <em>stop</em>, <em>step</em>, <em>interrupt</em>, <em>signal</em>, <em>setsignal</em>, <em>wait</em>, and <em>notify</em>. The last four overlap functionality with the first five, but are included for programmer convenience.<sup>3</sup>
<div class=footnote><sup>3</sup> In the current implementation, the thread quaject is really a composition of two lower-level quajects, neither of them externally visible: a basic thread quaject which supports the five fundamental operations listed; and a hi thread quaject, which adds the higher-level operations. I'm debating whether I want to make the basic thread quaject visible.</div>
<p><em>Suspend</em> and <em>resume</em> control thread execution, disabling or re-enabling it. They are often the targets of I/O quajects' callbacks, implementing blocking I/O. <em>Stop</em> and <em>step</em> support debuggers: <em>stop</em> prevents thread execution; <em>step</em> causes a stopped thread to execute a single machine instruction and then re-enter the stopped state. The difference between <em>stop</em> and <em>suspend</em> is that a suspended thread still executes in response to interrupts and signals while a stopped one does not. Resume continues thread execution from either the stopped or suspended state.
<p><em>Interrupt</em> causes a thread to call a specified procedure, as if a hardware interrupt had happened. It takes two parameters, an address and a mode, and it causes the thread to call the procedure at the specified address in either user or supervisor mode according to the mode parameter. Suspended threads can be interrupted: they will execute the interrupt procedure and then re-enter the suspended state.
<p><em>Signal</em> is like interrupt, but with a level of indirection for protection and isolation. It takes an integer parameter, the signal number, and indexes the thread's signal-table with it, obtaining the address and mode parameters that are then passed to interrupt. <em>Setsignal</em> associates signal numbers with addresses of interrupt procedures and execution modes. It takes three parameters: the signal number, an address, and a mode; and it fills the table slot corresponding to the signal number with the address and mode.
<p><em>Wait</em> waits for events to happen. It takes one parameter, an integer representing an event, and it suspends the thread until that event occurs. <em>Notify</em> informs the thread of the occurrence of events. It too takes one parameter, an integer representing an event, and it resumes the thread if it had been waiting for this event. The thread system does not concern itself with what is an event nor how the assignment of events to integers is made.
<h3>4.3.3 Scheduling</h3>
<p>The Synthesis scheduling policy is round-robin with an adaptively adjusted CPU quantum per thread. Instead of priorities, Synthesis uses fine-grain scheduling, which assigns larger or smaller quanta to threads based on a "need to execute" criterion. A detailed explanation on fine-grain scheduling is postponed to Chapter 6. Here, I give only a brief informal summary.
<p>A thread's "need to execute" is determined by the rate at which I/O data flows through its I/O channels compared to the rate at which which the running thread produces or consumes this I/O. Since CPU time consumed by the thread is an increasing function of the data flow, the faster the I/O rate the faster a thread needs to run. Therefore, the scheduling algorithm assigns a larger CPU quantum to the thread. This kind of scheduling must have a fine granularity since the CPU requirements for a given I/O rate and the I/O rate itself may change quickly, requiring the scheduling policy to adapt to the changes.
<p>Effective CPU time received by a thread is determined by the quantum assigned to that thread divided by the sum of quanta assigned to all threads. Priorities can be simulated and preferential treatment can be given to certain threads in two ways: raise a thread's CPU quantum and reorder the ready queue as threads block and unblock. As an event unblocks a thread, its TTE is placed at the front of the ready queue, giving it immediate access to the CPU. This minimizes response time to events. Synthesis' low-overhead context switch allows quanta to be considerably shorter than that of other operating systems without incurring excessive overhead. Nevertheless, to minimize time spent context switching, CPU quanta are adjusted to be as large as possible while maintaining the fine granularity. A typical quantum is on the order of a few hundred microseconds.
<h2>4.4 Input and Output</h2>
<p>In Synthesis, I/O includes all data flow among hardware devices and address spaces. Data move along logical channels called data channels, which connect sources of data with the destinations.
<h3>4.4.1 Producer/Consumer</h3>
<p>The Synthesis implementation of the channel model I/O follows the well-known producer/consumer paradigm. Each data channel has a control flow that directs its data flow. Depending on the origin and scheduling of the control flow, a producer or consumer can be either active or passive. An active producer (or consumer) runs on a thread and calls functions submitting (or requesting) its output (or input). A thread performing writes is active. A passive producer (or consumer) does not run of its own; it sits passively, waiting for one of its I/O functions to be called, then using the thread that called the function to initiate the I/O. A TTY window is passive; characters appear on the window only in response to other thread's I/O. There are three cases of producer/consumer relationships, which we shall consider in turn.
<p>The simplest is an active producer and a passive consumer, or vice-versa. This case, called active-passive, has a simple implementation. When there is only one producer and one consumer, a procedure call does the job. If there are multiple producers, we serialize their access. If there are multiple consumers, each consumer is called in turn.
<p>The most common producer/consumer relationship has both an active producer and an active consumer. This case, called active-active, requires a queue to mediate the two. For a single producer and a single consumer, an ordinary queue suffices. For cases with multiple participants on either the producer or consumer side, we use one of the optimistically-synchronized concurrent-access queues described in section 5.2.2. Each queue may be synchronous (blocking) or asynchronous (using signals) depending on the situation.
<p>The last case is a passive producer and a passive consumer. Here, we use a pump quaject that reads data from the producer and writes it to the consumer. This works for multiple passive producers and consumers as well.
<h3>4.4.2 Hardware Devices</h3>
<p>Physical I/O devices are encapsulated in quajects called device servers. The device server interface generally mirrors the basic, "raw" interface of the physical device. Its I/O operations typically include asynchronous read and write of fixed-length data records and device-specific query and control functions. Each device server may have its own thread(s) or not. A polling I/O server runs continuously on its own thread. An interrupt-driven server blocks after initialization. The server without threads runs when its physical device generates an interrupt, invoking one of its callentries. Device servers are created at boot time, one server for each device, and persist until the system is shut down. Device servers can also be added as the system runs, but this must be done from a kernel thread -- currently there is no protected, user-level way to do this.
<p>Higher-level I/O streams are created by composing a device server with one or more filter quajects. There are three important functions that a filter quaject can perform: mapping one style of interface to another (e.g., asynchronous to synchronous), mapping one data format to another (e.g., EBCDIC to ASCII, byte-reversal), and editing data (e.g., backspacing). For example, the Synthesis equivalent of <span class=smallcaps>Unix</span> cooked tty interface is a filter that processes the output from the raw tty device server, buffers it, and performs editing as called for by the erase and kill control characters.
<h2>4.5 Virtual Memory</h2>
<p>A full discussion of virtual memory will not be presented in this dissertation because all the details have not been completely worked out as of the time of this writing. Here, I merely assert that Synthesis does support virtual memory, but the model and interface are still in flux.
<h2>4.6 Summary</h2>
<p>The positive experience in using quajects shows that a highly efficient implementation of an object-based system can be achieved. The main ingredients of such an implementation are:
<ul>
<li>a procedural interface using callout and callentry references,
<li>explicit callback references for asynchronous return,
<li>run-time code generation and linking.
</ul>
Chapter 7 backs this up with measurements. But now, we will look at issues involving multiprocessors.
</div>
</body>
</html>

View File

@ -1,485 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div id="nav">
<a class=home href="../index.html">Alexia's Home</a>
<a href="index.html">Dissertation</a>
<a href="abs.html">Abstract</a>
<a href="ack.html">Acknowledgements</a>
<a href="toc.html">Contents</a>
<a href="ch1.html">Chapter 1</a>
<a href="ch2.html">Chapter 2</a>
<a href="ch3.html">Chapter 3</a>
<a href="ch4.html">Chapter 4</a>
<a class=here href="ch5.html">Chapter 5</a>
<a href="ch6.html">Chapter 6</a>
<a href="ch7.html">Chapter 7</a>
<a href="ch8.html">Chapter 8</a>
<a href="bib.html">Bibliography</a>
<a href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
<h1>5. Concurrency and Synchronization</h1>
<div id="chapter-quote">
</div>
<h2>5.1 Synchronization in OS Kernels</h2>
<p>In single-processor machines, the need for synchronization within an operating system arises because of hardware interrupts. They may happen in the middle of sensitive kernel data structure modifications, compromising their integrity if not properly handled. Even if the operating system supports multiprogramming, as most do, it is always an interrupt that causes the task switch, leading to inter-task synchronization problems.
<p>In shared-memory multiprocessors, there is interference between processors accessing the shared memory, in addition to hardware interrupts. When different threads of control in the kernel need to execute in specific order (e.g., to protect the integrity of kernel data structures), they use synchronization mechanisms to ensure proper execution ordering. In this chapter, we discuss different ways to ensure synchronization in the kernel with emphasis on the Synthesis approach based on lock-free Synchronization.
<h3>5.1.1 Disabling Interrupts</h3>
<p>In a single-processor kernel (including most flavors of <span class=smallcaps>Unix</span>), all types of synchronization problems can be solved cheaply by disabling the hardware interrupts. While interrupts are disabled the executing procedure is guaranteed to continue uninterrupted. Since disabling and enabling interrupts cost only one machine instruction each, it is orders of magnitude cheaper than other synchronization mechanisms such as semaphores, therefore its use is widespread. For example, 112 of the 653 procedures that make up version 3.3 of the Sony NEWS kernel (a BSD 4.3 derivative) disable interrupts.
<p>But synchronization by disabling interrupts has its limitations. Interrupts cannot remain disabled for too long, otherwise frequent hardware interrupts such as a fast clock may be lost. This places a limit on the length of the execution path within critical regions protected by disabled interrupts. Furthermore, disabling interrupts is not always sufficient. In a shared-memory multiprocessor, data structures may be modified by different CPUs. Therefore, some explicit synchronization mechanism is needed.
<h3>5.1.2 Locking Synchronization Methods</h3>
<p>Mutual exclusion protects a critical section by allowing only one process at a time to execute in it. The many styles of algorithms and solutions for mutual exclusion may be divided into two kinds: busy-waiting (usually implemented as spin-locks) and blocking (usually implemented as semaphores). Spin-locks sit in tight loops while waiting for the critical region to clear. Blocking semaphores (or monitors) explicitly send a waiting process to a queue. When the currently executing process exits the critical section, the next process is dequeued and allowed into the critical section.
<p>The main problem with spin-locks is they waste CPU time while waiting. The justification in multiprocessors is that the process holding the lock is running and will soon clear the lock. This assumption may be false when multiple threads are mapped to the same physical processor, and results either in poor performance, or complicated scheduling to ensure the bad case does not happen. The main difficulty with blocking semaphores is the considerable overhead to maintain a waiting queue and to set and reset the semaphore. Furthermore, the waiting queue itself requires some kind of lock, resulting in a catch-22 situation that is resolved by disabling interrupts and spin-locks, Finally, having to choose between the two implementations leads to non-trivial decisions and algorithms for making it.
<p>Besides the overhead in acquiring and releasing locks, locking methods suffer from three major disadvantages: contention, deadlock, and priority inversion. Contention occurs when many competing processes all want to access the same lock. Important global data structures are often points of contention. In Mach, for example, a single lock serializes access to the global run-queue [7]. This becomes a point of contention if several processors want to access the queue at the same time, as would occur when the scheduler clocks are synchronized. One way to reduce the lock contention in Mach relies on scheduling "hints" from the programmer. For example, hand-off hints may give control directly to the destination thread, bypassing the run queue. Although hints may decrease lock contention for specific cases, their use is difficult and their benefits uncertain.
<p>Deadlock results when two or more processes both need locks held by the other. Typically, deadlocks are avoided by imposing a strict request order for the resources. This is a difficult solution because it requires system-wide knowledge to perform a local function; this goes against the modern programming philosophy of information-hiding.
<p>Priority inversion occurs when when a low priority process in a critical section is preempted and causes other, higher priority processes to wait for that critical section. This can be particularly problematic for real-time systems where rapid response to urgent events is essential. There are sophisticated solutions for the priority inversion problem [8], but they contribute to make locks more costly and less appealing.
<p>A final problem with locks is that they are state. In an environment that allows partial failure - such as parallel and distributed systems - a process can set a lock and then crash. All other processes needing that lock then hang indefinitely.
<h3>5.1.3 Lock-Free Synchronization Methods</h3>
<p>It is possible to perform safe updates of shared data without using locks. Herlihy [14] introduced a general methodology to transform a sequential implementation of any data structure into a wait-free, concurrent one using the <em>Compare-&amp;-Swap</em> primitive, which he shows is more powerful than test-and-set, the primitive usually used for locks. Compare-&amp;- Swap takes three parameters: a memory location, a compare value, and an update value. If contents of the memory location is identical to the compare value, the update value is stored there and the operation succeeds; otherwise the memory location is left unchanged and the operation fails.
<p>Figure 5.1 shows how <em>Compare-&amp;-Swap</em> is used to perform an arbitrary atomic update of single-word data in a lock-free manner. Initially, the current value of the word is read into a private variable, old val. This value is passed to the update function which places its result in a private variable, new val. <em>Compare-&amp;-Swap</em> then checks if interference happened by testing whether the word still has value old val. If it does, then the word is atomically updated with new val. Otherwise, there was interference, so the operation is retried. For reference, Figures 5.2 and 5.3 shows the definition of CAS, the <em>Compare-&amp;-Swap</em> function, and of CAS2, the two-word <em>Compare-&amp;-Swap</em> function, which is used later.
<div class=code>
<pre>
int data_val;
AtomicUpdate(update_function)
{
retry:
old_val = data_val;
new_val = update_function(old_val);
if(CAS(&amp;data_val, old_val, new_val) == FAIL)
goto retry;
return new_val;
}
</pre>
<p class=caption>Figure 5.1: Atomic Update of Single-Word Data</p>
</div>
<div class=code>
<pre>
CAS(mem_addr, compare_value, update_value)
{
if(*mem_addr == compare_value) {
*mem_addr = update_value;
return SUCCEED;
} else
return FAIL;
}
</pre>
<p class=caption>Figure 5.2: Definition of Compare-and-Swap</p>
</div>
<p>Updating data of arbitrary-length using <em>Compare-&amp;-Swap</em> is harder. Herlihy's general method works like this: each data structure has a "root" pointer, which points to the current version of the data structure. An update is performed by allocating new memory and copying the old data structure into the new memory, making the changes, using <em>Compare-&amp;-Swap</em> to swing the root pointer to the new structure, and deallocating the old.
<div class=code>
<pre>
CAS2(mem_addr1, mem_addr2, compare1, compare2, update1, update2)
{
if(*mem_addr1 == compare1 &amp;&amp; *mem_addr2 == compare2) {
*mem_addr1 = update1;
*mem_addr2 = update2;
return SUCCEED;
} else
return FAIL;
}
</pre>
<p class=caption>Figure 5.3: Definition of Double-Word Compare-and-Swap</p>
</div>
<p>He provides methods of partitioning large data structures so that not all of it needs to be copied, but in general, his methods are expensive.
<p>Herlihy defines an implementation of a concurrent data structure to be wait-free if it guarantees that each process modifying the data structure will complete the operation in a finite number of steps. He defines an implementation to be non-blocking if it guarantees that some process will complete an operation in a finite number of steps. Both prevent deadlock. Wait-free also prevents starvation. In this paper, we use the term lock-free as synonymous with non-blocking. We have chosen to use lock-free synchronization instead of wait-free because the cost of wait-free is much higher and the chances of starvation in an OS kernel is low -- I was unable to construct a test case where that would happen.
<p>Even with the weaker goal of non-blocking, Herlihy's data structures are expensive, even when there is no interference. For example, updating a limited-depth stack is implemented by copying the entire stack to a newly allocated block of memory, making the changes on the new version, and switching the pointer to the stack with a <em>Compare-&amp;-Swap</em>. This cost is much too high, and we want to find ways to reduce it.
<h3>5.1.4 Synthesis Approach</h3>
<p>The Synthesis approach to synchronization is motivated by a desire to do each job using the minimum resources. The previous sections outlined the merits and problems of various synchronization methods. Here are the ideas that guided our search for a synchronization primitive for Synthesis:
<ul>
<li>We wanted a synchronization method that avoids the problem of priority inversion so as to simplify support of real-time signal processing.
<li>We did not want to disable interrupts because we wanted to support I/O devices that interrupt at a very high rate, such as the Sound-IO devices. Also, disabling interrupts by itself does not work for multiprocessors.
<li>We wanted a synchronization method that does not have the problem of deadlock. The reason is that we wanted as much flexibility as possible to examine and modify running kernel threads. We wanted to be able to suspend threads to examine their state without affecting the rest of the system.
</ul>
<p>Given these desires, lock-free synchronization is the method of choice. Lock-free synchronization does not have the problems of priority inversion and deadlock. I also feel it leads to more robust code because there can never be the problem of a process getting stuck and hanging while holding a lock.
<p>Unfortunately, Herlihy's general wait-free methods are too expensive. So instead of trying to implement arbitrary data structures lock-free, we take a different tack: We ask "what data structures can be implemented lock-free, efficiently?" We then build the kernel out of these structures. This differs from the usual way: typically, implementors select a synchronization method that works generally, such as semaphores, then use that everywhere. We want to use the cheapest method for each job. We rely on the quaject structuring of the kernel and on code synthesis to create special synchronization for each need.
<p>The job is made easier because the Motorola 68030 processor supports a two-word <em>Compare-&amp;-Swap</em> operation. It is similar in operation to the one-word <em>Compare-&amp;-Swap</em>, except that two words are compared, and if they both match, two updates are performed. Two-word <em>Compare-&amp;-Swap</em> lets us efficiently implement many basic data structures such as stacks, queues, and linked lists because we can atomically update both a pointer and the location being pointed to in one step. In contrast, Herlihy's algorithms, using single-word <em>Compare-&amp;-Swap</em>, must resort to copying.
<p>The first step is to see if synchronization is necessary at all. Many times the need for synchronization can be avoided through code isolation, where only specialized code that is known to be single-threaded handles the manipulation of data. An example of code isolation is in the run-queue. Typically a run-queue is protected by semaphores or spin-locks, such as in the <span class=smallcaps>Unix</span> and Mach implementations [7]. In Synthesis, only code residing in each element can change it, so we separate the run-queue traversal, which is done lock-free, safely and concurrently, from the queue element update, which is done locally by its associated thread. Another example occurs in a single-producer, single-consumer queue. As long as the queue is neither full nor empty, the producer and consumer work on different parts of it and need not synchronize.
<p>Once it has been determined that synchronization is unavoidable, the next step is to try to encode the shared information into one or two machine words. If that succeeds, then we can use <em>Compare-&amp;-Swap</em> on the one or two words directly. Counters, accumulators, and state-flags all fall in this category. If the shared data is larger than two words, then we try to encapsulate it in one of the lock-free quajects we have designed, explained in the next section: LIFO stacks, FIFO queues, and general linked lists. If that does not work, we try to partition the work into two pieces, one part that can be done lock-free, such as enqueueing the work and setting a "work-to-be-done" flag, and another part that can be postponed and done at a time when it is known interference will not happen (e.g., code isolation). Suspending of threads, which is discussed in Section 5.3.2, follows this idea - a thread is marked suspended; the actual removal of the thread from the run-queue occurs when the thread is next scheduled.
<p>When all else fails, it is possible to create a separate thread that acts as a server to serialize the operations. Communication with the server happens using lock-free queues to assure consistency. This method is used to update complex data structures, such as those in the VT-100 terminal emulator. Empirically, I have found that after all the other causes of synchronization have been eliminated or simplified as discussed above, the complex data structures that remain are rarely updated concurrently. In these cases, we can optimize, dispensing with the server thread except when interference occurs. Invoking an operation sets a "busy" flag and then proceeds with the operation, using the caller's thread to do the work. If a second thread now attempts to invoke an operation on the same data, it sees the busy-flag set, and instead enqueues the work. When the first thread finishes the operation, it sees a non-empty work queue, and spawns a server thread to process the remaining work. This server thread persists as long as there is work in the queue. When the last request has been processed, the server dies.
<p>In addition to using only lock-free objects and optimistic critical sections, we also try to minimize the length of each critical section to decrease the probability of retries. The longer a process spends in the critical section, the greater the chance of outside interference forcing a retry. Even a small decrease in length can have a profound effect on retries. Sometimes a critical section can be divided into two shorter ones by finding a consistent intermediate state. Shifting some code between readers and writers will sometimes produce a consistent intermediate state.
<h2>5.2 Lock-Free Quajects</h2>
<p>The Synthesis kernel is composed of quajects, chunks of code with data structures. Some quajects represent OS abstractions, such as threads, memory segments, and I/O devices described earlier in chapter 4. Other quajects are instances of abstract data types such as stacks, queues, and linked lists, implemented in a concurrent, lock-free manner, This section describes them.
<div class=code>
<pre>
Insert(elem)
{
retry:
old_first = list_head;
*elem = old_first;
if(CAS(&amp;list_head, old_head, elem) == FAIL)
goto retry;
}
Delete()
{
retry:
old_first = list_head;
if(old_first == NULL)
return NULL;
second = *old_first;
if(CAS2(&amp;list_head, old_first, old_head, second, second, 0) == FAIL)
goto retry;
return old_first;
}
</pre>
<p class=caption>Figure 5.4: Insert and Delete at Head of Singly-Linked List</p>
</div>
<h3>5.2.1 Simple Linked Lists</h3>
<p>Figure 5.4 shows a lock-free implementation of insert and delete at the head of a singly-linked list. Insert reads the address of the list's first element into a private variable (old.first), copies it into the link field of the new element to be inserted, and then uses <em>Compare-&amp;-Swap</em> to atomically update the list's head pointer if it had not been changed since the initial read. Insert and delete to the end of the list can be carried out in a similar manner, by maintaining a list-tail pointer. This method is similar to that suggested in the 68030 processor handbook [21].
<h3>5.2.2 Stacks and Queues</h3>
<p>One can implement a stack using insert and delete to the head of a linked list, using the method of the previous section. But this requires node allocation and deallocation, which adds overhead. So I found a way of doing an array-based implementation of a stack using two-word <em>Compare-&amp;-Swap</em>. This implementation also has the advantage that it works on the hardware-defined processor stacks, which is important for delivering signals to threads. I believe this is a new result, though not a "big" one.
<div class=code>
<pre>
Push(elem)
{
retry:
old_SP = SP;
new_SP = old_SP - 1;
old_val = *new_SP;
if(CAS2(&amp;SP, new_SP, old_SP, old_val, new_SP, elem) == FAIL)
goto retry;
}
Pop()
{
retry:
old_SP = SP;
new_SP = old_SP + 1;
elem = *old_SP;
if(CAS2(&amp;SP, old_SP, old_SP, elem, new_SP, elem) == FAIL)
goto retry;
return elem;
}
</pre>
<p class=caption>Figure 5.5: Stack Push and Pop</p>
</div>
<p>Figure 5.5 shows a lock-free implementation of a stack. Pop is implemented in almost the same way as a counter increment. The current value of the stack pointer is read into a private variable, which is de-referenced to get the top item on the stack and incremented past that item. The stack pointer is then updated using <em>Compare-&amp;-Swap</em> to test for interfering accesses and retry when they happen.
<p>Push is more complicated because it must atomically update two things: the stack pointer and the top item on the stack. This needs a two-word <em>Compare-&amp;-Swap</em>. The current stack pointer is read into a private variable and decremented, placing the result into another private variable. This decremented stack pointer contains the memory address where the new item will be put. But first, the data at this address is read into a third private variable, then the new item stored there and the stack pointer updated using a a two-word <em>Compare-&amp;-Swap</em>. (The data must be read to give <em>Compare-&amp;-Swap-2</em> two comparison values. <em>Compare-&amp;-Swap-2</em> always performs two tests; sometimes one of them is undesirable.)
<p>Figure 5.6 shows a lock-free implementation of a circular queue. It is very similar to the stack implementation, and will not be discussed further.
<div class=code>
<pre>
Put(elem)
{
retry:
old_head = Q_head;
new_head = old_head + 1;
if(new_head &gt;= Q_end)
new_head = Q_begin;
if(new_head == Q_tail)
return FULL;
old_elem = *new_head;
if(CAS2(&amp;Q_head, new_head, old_head, old_elem, new_head, elem) == FAIL)
goto retry;
}
Get()
{
retry:
old_tail = Q_tail;
if(old_tail == Q_head)
return EMPTY;
elem = *old_tail;
new_tail = old_tail + 1;
if(new_tail &gt;= Q_end)
new_tail = Q_begin;
if(CAS2(&amp;Q_tail, old_tail, old_tail, elem, new_tail, elem) == FAIL)
goto retry;
return elem;
}
</pre>
<p class=caption>Figure 5.6: Queue Put and Get</p>
</div>
<h3>5.2.3 General Linked Lists</h3>
<p>The "simple" linked lists described earlier allow operations only on the head and tail of the list. General linked lists also allow operations on interior nodes.
<p>Deleting nodes at the head of a list is easy. Deleting an interior node of the list is much harder because the permanence of its neighbors is not guaranteed. Linked list traversal in the presence of deletes is hard for a similar reason - a node may be deleted and deallocated while another thread is traversing it. If a deleted node is then reallocated and reused for some other purpose, its new pointer values may cause invalid memory references by the other thread still traversing it.
<p>Herlihy's solution [14] uses reference counts. The idea is to keep deleted nodes "safe." A deleted node is safe if its pointers continue to be valid; i.e., pointing to nodes that eventually take it back to the main list where the <em>Compare-&amp;-Swap</em> will detect the change and retry the operation. Nodes that have been deleted but not deallocated are safe.
<div class=code>
<pre>
VisitNextNode(current)
{
nextp = &amp; current-&gt;next; // Point to current node's next-node field
retry:
next_node = *nextp; // Point to the next node
if(next_node != NULL) { // If node exists...
refp = &amp; next_node-&gt;refcnt; // Point to next node's ref. count
old_ref = *refp; // Get value of next node's ref. count
new_ref = old_ref + 1; // And increment
if(CAS2(nextp, refp, next_node, old_ref, next_node, new_ref) == FAIL)
goto retry;
}
return next_node;
}
ReleaseNode(current)
{
refp = &amp; current-&gt;refcnt; // Point to current node's ref. count field
retry:
old_ref = *refp; // Get value of current node's ref. count
new_ref = old_ref - 1; // ... Decrement
if(CAS(old_ref, new_ref, refp) == FAIL)
goto retry;
if(new_ref == 0) {
Deallocate(current);
return NULL;
} else {
return current;
}
}
</pre>
<p class=caption>Figure 5.7: Linked List Traversal</p>
</div>
<p>Figure 5.7 shows an implementation of Herlihy's idea, simplified by using a two-word <em>Compare-&amp;-Swap</em>. Visiting a node loads the pointer and increments the reference count. Leaving a node decrements the reference count. A deleted node is not actually freed until the reference count reaches zero. Deleting a node still requires the permanence of its neighbors. We do this in two steps: (1) mark the nodes to be deleted and leave them in the list; (2) if the previous node is not marked for deletion, sit on it and delete the original node marked for deletion. Since step 2 may require going back through the list an arbitrary number of nodes, usually we do step 2 the next time we traverse the list to avoid the overhead of traversing the list just for deletion.
<p>Optimizations are possible if we can eliminate some sources of interference. In the Synthesis run queue, for example, there is only one thread visiting a TTE at any time. So we simplify the implementation to use a binary marker instead of counters. We set the mark when we enter the node using a two-word <em>Compare-&amp;-Swap</em>. This is easier than incrementing a counter because we don't have to read the mark beforehand - it must be zero to allow entrance. Non-zero means that node is being visited by some other processor, so we skip to the next one and repeat the test.
<h3>5.2.4 Lock-Free Synchronization Overhead</h3>
<table class=table>
<caption>
Table 5.1: Comparison of Different Synchronization Methods<br>
Times in microseconds<br>
<small>68030 CPU, 25MHz, 1-wait-state main memory, cold cache</small>
</caption>
<tr><th class=head>Operation<th>Non Sync<th>Locked<th>Lock-free<sub>noretry</sub><th>Lock-free<sub>oneretry</sub>
<tr><th>null procedure call in C<td class=number>1.4<td>-<td>-<td>-
<tr><th>Increment counter<td class=number>0.3<td class=number>2.4<td class=number>1.3<td class=number>2.3
<tr><th>Linked-list Insert<td class=number>0.6<td class=number>2.7<td class=number>1.4<td class=number>2.3
<tr><th>Linked-list Delete<td class=number>1.0<td class=number>3.2<td class=number>2.1<td class=number>4.1
<tr><th>Circular-Queue Insert<td class=number>2.0<td class=number>4.2<td class=number>3.3<td class=number>6.0
<tr><th>Circular-Queue Delete<td class=number>2.1<td class=number>4.3<td class=number>3.3<td class=number>6.0
<tr><th>Stack Push<td class=number>0.7<td class=number>2.8<td class=number>2.0<td class=number>3.5
<tr><th>Stack Pop<td class=number>0.7<td class=number>2.8<td class=number>2.0<td class=number>3.5
<tr><th>get_semaphore Sony NEWS, <span class=smallcaps>Unix</span><td class=number>8.74<td>-<td>-<td>-
</table>
Table 5.1 shows the overhead measured for the lock-free objects described in this section, and compares it with the overhead of two other implementations: one using locking and one that is not synchronized. The column labeled "Non Sync" shows the time taken to execute the operation without synchronization. The column labeled "Locked" shows the time taken by a locking-based implementation of the operation without interference. The column labeled "Lock-freenoretry" shows the time taken by the lock-free implementation when there is no interference. The column labeled "Lock-freeoneretry" shows the time taken when interference causes the first attempt to retry, with success on the second attempt.<sup>1</sup> For reference, the first line of the table gives the cost of a null procedure call in the C language, and the last line gives the cost of a get semaphore operation in Sony's RTX kernel. (The RTX kernel runs in the I/O processor of Sony's dual-processor workstation, and is meant to be a light-weight kernel.)
<div class=footnote><sup>1</sup>This case is produced by generating an interfering memory reference between the initial read and the <em>Compare-&amp;-Swap</em>. The Quamachine's memory controller, implemented using programmable gate arrays, lets us do things like this. Otherwise the interference would be very difficult to produce and measure.</div>
<div class=code>
<pre>
retry: move.l (head),d1 // Get head node into reg. d1
move.l d1,a0 // ... copy to register 'a0'
beq empty // ... jump if list empty
lea (a0,next),a2 // Get address of head node's next ptr.
move.l (a2),d2 // Get 2nd node in list into reg. d2
cas2.l d1:d2,d2:d2,(head):(a2) // Update head if both nodes still same
bne retry // ... go try again if unsucessful
</pre>
<p class=caption>Figure 5.8: Lock-Free Delete from Head of Singly-Linked List</p>
</div>
<p>The numbers shown are for in-line assembly-code implementation and assume a pointer to the relevant data structure already in a machine register. The lock-free code measured is the same as that produced by the Synthesis kernel code generator. The nonsynchronized code is the best I've been able to do writing assembler by hand. The lockbased code is the same as the non-synchronized, but preceded by some code that disables interrupts and then obtains a spinlock, and followed by code to clear the spinlock and reenable interrupts. The reasoning behind disabling interrupts is to make sure that the thread does not get preempted in its critical section, guaranteeing that the lock is cleared quickly. This represents good use of spin-locks, since any contention quickly passes.
<p>Besides avoiding the problems of locking, the table shows that the lock-free implementation is actually faster than the lock-based one, even in the case of no interference. In fact, the performance of lock-free in the presence of interference is comparable to locking without interference.
<div class=code>
<pre>
move.w %sr,d0 // Save CPU status reg. in reg. d0
or.w #0x0700,%sr // Disable interrupts.
spin: tas lock // Obtain lock
bne spin // ... busy wait
move.l (head),a0 // Get head node into reg. a0
tst.l a0 // Is there a node?
bne go // ... yes, jump
clr.l lock // ... no: Clear the lock
move.w d0,%sr // Reenable interrupts
bra empty // Go to 'empty' callback
go: move.l (a0,next),(head)// Make 'head' point to 2nd node
clr.l lock // Clear the lock
move.w d0,%sr // Reenable interrupts
</pre>
<p class=caption>Figure 5.9: Locked Delete from Head of Singly-Linked List</p>
</div>
<p>Let us study the reason for this surprising result. Figures 5.8 and 5.9 show the actual code that was measured for the linked-list delete operation. Figure 5.8 shows the lock-free code, while Figure 5.9 shows the locking-based code. The lock-free code closely follows the principles of operation described earlier. The lock-based code begins by disabling processor interrupts to guarantee that the process will not be preempted. It then obtains the spinlock; interference at this point is limited to that from other CPUs in a multiprocessor, and the lock should clear quickly. The linked-list delete is then performed, followed by clearing the lock and reenabling interrupts. (Don't be fooled by its longer length: part of the code is executed only when the list is empty.)
<p>Accounting for the costs, the actual process of deleting the element from the list takes almost the same time for both versions, with the the lock-free code taking a few cycles longer. (This is because the <em>Compare-&amp;-Swap</em> instruction requires its compare-operands to be in D registers while indirection is best done through the A registers, whereas the lockbased code can use whatever registers are most convenient.) The cost advantage of lock-free comes from the much higher cost of obtaining and clearing the lock compared to the cost of <em>Compare-&amp;-Swap</em> .The two-word <em>Compare-&amp;-Swap</em> instruction takes 26 machine cycles to execute on the 68030 processor. By comparison, obtaining and then clearing the lock costs 46 cycles, with the following breakdown: 4 to save the CPU status register; 14 to disable interrupts; 12 to obtain the lock; 6 to clear the lock following the operation; and 10 to reenable interrupts. (For reference, fetching from memory costs 6 cycles and a single word <em>Compare-&amp;-Swap</em> takes 13 cycles.)
<p>Some people argue that one should not disable interrupts when obtaining a lock. They believe it is better to waste time spin-waiting in the rare occasion that the process holding the lock is preempted, than to pay the disable/enable costs each time.<sup>2</sup> I disagree. I believe that in operating systems, it is better that an operation perhaps cost a little more, than to have it be a little faster but occasionally exhibit very high cost. Repeatability and low variance are often times as important if not more important than low average cost. Furthermore, allowing interrupts in a critical section opens the possibility that a process which has been interrupted might not return to release the lock. The user may have typed Control-C, for example, terminating the program. Recovering from this situation or preventing it from happening requires tests and more code which adds to the cost - if not to the lock itself, then somewhere else in the system.
<div class=footnote><sup>2</sup> For the data structures of interest here, not disabling interrupts makes the cost of locking when no process is preempted very nearly identical to the cost of lock-free.</div>
<h2>5.3 Threads</h2>
<p>This section describes how thread operations can be implemented so they are lockfree and gives timing figures showing their cost.
<h3>5.3.1 Scheduling and Dispatching</h3>
<p>Section 3.3.2 described how thread scheduling and dispatching works using an executable data structure to speed context switching. Each thread is described by a thread table entry (TTE). The TTE contains the thread-specific procedures implementing the dispatcher and scheduler, the thread context save area, and other thread-specific data. The dispatcher is divided into two halves: the switch-out routine, which is executed from the currently running thread's TTE and which saves the thread's context; and the switch-in routine, which is executed from the new thread's TTE and loads new thread's context and installs its switch-out routine into the quantum clock interrupt handler.
<p>In the current version of Synthesis, the TTEs are organized into multiple levels of runqueues for scheduling and dispatching. The idea is that some threads need more frequent attention from the CPU than others, and we want to accommodate this while maintaining an overall round-robin-like policy that is easy to schedule cheaply. The policy works like this: on every second context switch, a thread from level 0 is scheduled, in round-robin fashion. On every fourth context switch, a thread from level 1 is scheduled, also in roundrobin fashion. On every eighth context switch, a thread from level 2 is scheduled. And so on, for 8 levels. Each level gets half the attention of the previous level. If there are no threads at a particular level, that level's quanta is distributed among the rest of the levels.
<p>A global counter and a lookup table tells the dispatcher which level's queue is next. The lookup table contains the scheduling policy described above -- a 0 every other entry, 1 every fourth entry, 2 every eighth entry, like this: (0, 1, 0, 2, 0, 1, 0, 3, 0, 1, ... ). Using the counter to follow the priority table, the kernel dispatches a thread from level 0 at every second context-switch, from level 1 at every fourth context-switch, level 2 at every eighth, and so on.
<!-- Need Thread state transition diagram -->
<!-- Numbers in parenthesis: (STOPME,STOPPED) -->
<p class=caption>Figure 5.10: Thread State Transition Diagram</p> <!-- FINISH -->
<p>When multiple CPUs attempt to dispatch threads from the run-queues, each active dispatcher (switch-out routine) acquires a new TTE by marking it using <em>Compare-&amp;-Swap</em>. If successful, the dispatcher branches to the switch-in routine in the marked TTE. Otherwise, some other dispatcher has just acquired the attempted TTE, so this dispatcher moves on to try to mark the next TTE. The marks prevent other dispatchers from accessing a particular TTE, but not from accessing the rest of the run queues.
<h3>5.3.2 Thread Operations</h3>
<p>We now explain how the other thread operations are made lock-free. The general strategy is the same. First, mark the intended operation on the TTE. Second, perform the operation. Third, check whether the situation has changed. If negative, the operation is done. If positive, retry the operation. An important observation is that all state transitions and markings are done atomically through <em>Compare-&amp;-Swap</em>.
<p>Figure 5.10 shows the thread state-transition diagram for the suspend and resume operations.
<p><em>Suspend</em>: The thread-suspend procedure sets the <em>STOPME</em> flag in the target thread's TTE indicating that it is to be stopped. If the target thread is currently running on a different CPU, a hardware interrupt is sent to that CPU by writing to a special I/O register, forcing a context-switch. We optimize the case when a thread is suspending itself by directly calling the scheduler instead. Thread-suspend does not actually remove the thread from the run-queue.
<p>When a scheduler encounters a thread with the <em>STOPME</em> flag set, it removes its TTE from the run-queue and sets the <em>STOPPED</em> flag to indicate that the thread has been stopped. This is done using the two-word compare-and-swap instruction to synchronize with other CPU's schedulers that may be operating on the adjacent queue elements. The mark on the TTE guarantees that only one CPU is visiting each TTE at any given time. This also makes the delete operation safe.
<p><em>Resume</em>: First, the <em>STOPME</em> and <em>STOPPED</em> flags are read and the <em>STOPME</em> flag is cleared to indicate that the thread is ready to run. If the previously-read <em>STOPPED</em> flag indicates that the thread had not yet been removed from the run-queue, we are done. Otherwise, we remove the TTE and insert the thread directly into the run queue. The main problem we have to avoid is the case of a neighboring TTE being deleted due to the thread being killed. To solve that problem, when a thread is killed, we mark its TTE as "killed," but do not remove it from the run-queue immediately. When a dispatcher realizes the next TTE is marked "killed" during a context switch, it can safely remove it.
<p>Signal: Thread-signal is synchronized in a way that is similar to thread-resume. Each thread's TTE has a stack for pending signals which contains addresses of signal-handler procedures. Thread-signal uses a two-word <em>Compare-&amp;-Swap</em> to push a new procedure address onto this stack. It then sets a signal-pending flag, which the scheduler tests. The scheduler removes procedures from the pending-signal stack, one at a time, and constructs procedure call frames on the thread's runtime stack to simulate the thread having called that procedure.
<p><em>Step</em>: Thread-step is intended for instruction-at-a-time debugging; concurrent calls defeats its purpose. So we do not give any particular meaning to concurrent calls of this function except to preserve the consistency of the kernel. In the current implementation, all calls after the first fail. We implement this using an advisory lock.
<h3>5.3.3 Cost of Thread Operations</h3>
<p>Table 5.2 shows the time taken to perform the various thread operations implemented using the lock-free synchronization methods of this chapter. They were measured on the Sony NEWS 1860 machine, a dual 68030 each at 25 MHz, with no interference from the other processor.
<table class=table>
<caption>
Table 5.2: Thread operations
</caption>
<tr class=head><th>Thread Operation<th>Time (&#181;s)
<tr><th>Create<sub>shared vector table</sub><td>19.2
<tr><th>Create<sub>separate vector table</sub><td>97
<tr><th>Destroy<td>2.2 + 6.1 in dispatcher
<tr><th>Suspend<td>2.2 + 3.7 in dispatcher
<tr><th>Resume<td>3.0 if in Q; 6.6 not in Q
<tr><th>Signal<td>4.6 + 4.4 in scheduler
<tr><th>Step (no FP, no VM switch)<td>25
</table>
<p>Thread suspend, destroy, and signal have been split into two parts: the part done by the requester and the part done by the dispatcher. The time for these are given in the form "X + Y ," the first number is the time taken by the requester, the second number is the time taken by the dispatcher. Thread resume has two cases, the case where the thread had been stopped but the scheduler had not removed it from the run queue yet, shown by the first number, and the case where it was removed from the run queue and must be re-inserted, shown by the second number.
<p>Thread create has been made significantly faster with a copy-on-write optimization. Recall from Section 4.3 that each thread has a separate vector table. The vector table contains pointers to synthesized routines that handle the various system calls and hardware interrupts. These include the 16 system-call trap vectors, 21 program exception vectors, 19 vectors for hardware failure detection, and, depending on the hardware configuration, from 8 to 192 interrupt vectors. This represents a large amount of state information that had to be initialized - 1024 bytes.
<p>Newly-created threads point their vector table to the vector table of their creator and defer the creation of their own until they need to change the vector table. There are only two operations that change a thread's vector table: opening and closing quajects. If a quaject is not to be shared, open and close test if the TTE is being shared, and if so they first make a copy of the TTE and then modify the new copy. Alternatively, several threads may share the changes in the common vector table. For example, threads can now perform system calls such as open file and naturally share the resulting file access procedures with the other threads using the same vector table.
<table class=table>
<caption>
Table 5.3: Overhead of Thread Scheduling and Context Switch
</caption>
<tr class=head><th>Type of context switch<th>Synthesis V.1 (&#181;s)
<tr><th>Integer registers only<td>14
<tr><th>Floating-point<td>56
<tr><th>Integer, change address space<td>20 + 1.6 * TLB<sub>fill</sub>
<tr><th>Floating-point, change address space<td>60 + 1.6 * TLB<sub>fill</sub>
</table>
<p>Table 5.3 shows the cost of context switching and scheduling. Context-switch is somewhat slower than shown earlier, in Table 3.3, because now we schedule from multiple run queues, and because there is synchronization that was not necessary in the single-CPU version discussed in Section 3.3.2. When changing address spaces, loading the memory management unit's translation table pointer and flushing the translation cache increases the context switch time. Extra time is then used up to fill the translation cache. This is the "<em>+1.6 * TLB<sub>fill</sub></em>" time. Depending on the thread's locality of reference, this can be as low as 4.5 microseconds for 3 pages (code, global data, and stack) to as high as 33 microseconds to fill the entire TLB cache.
<h2>5.4 Summary</h2>
<p>We have used only lock-free synchronization techniques in the implementation of Synthesis multiprocessor kernel on a dual-68030 Sony NEWS workstation. This is in contrast to other implementations of multiprocessor kernels that use locking. Lock-based synchronization methods such as disabling interrupts, spin-locking, and waiting semaphores have many problems. Semaphores carry high management overhead and spin-locks may waste significant amount of CPU. (A typical argument for spin-locks is that the processor would be idle otherwise. This may not apply for synchronization inside the kernel.) A completely lock-free implementation of a multiprocessor kernel demonstrates that synchronization overhead can be reduced, concurrency increased, deadlock avoided, and priority inversion eliminated.
<p>This completely lock-free implementation is achieved with a careful kernel design using the following five-point plan as a guide:
<ul>
<li>Avoid synchronization whenever possible.
<li>Encode shared data into one or two machine words.
<li>Express the operation in terms of one or more fast lock-free data structure operations.
<li>Partition the work into two parts: a part that can be done lock-free, and a part that can be postponed to a time when there can be no interference.
<li>Use a server thread to serialize the operation. Communications with the server happens using concurrent, lock-free queues.
</ul>
<p>First we reduced the kind of data structures used in the kernel to a few simple abstract data types such as LIFO stacks, FIFO queues, and linked lists. Then, we restricted the uses of these abstract data types to a small number of safe interactions. Finally we implemented efficient special-purpose instances of these abstract data types using single-word and doubleword <em>Compare-&amp;-Swap</em>. The kernel is fully functional, supporting threads, virtual memory, and I/O devices such as window systems and file systems. The measured numbers show the very high efficiency of the implementation, competitive with user-level thread management systems.
<p>Two lessons were learned from this experience. The first is that a lock-free implementation is a viable and desirable alternative to the development of shared-memory multiprocessor kernels. The usual strategy -- to evolve a single-processor kernel into a multiprocessor kernel by surrounding critical sections with locks -- carries some performance penalty and potentially limits the system concurrency. The second is that single and double word <em>Compare-&amp;-Swap</em> are important for lock-free shared-memory multiprocessor kernels. Architectures that do not support these instructions may suffer performance penalties if operating system implementors are forced to use locks. Other synchronization instructions, such as the Load-Linked/Store-Conditional found on the MIPS processor, may also yield efficient lock-free implementations.
</div>
</body>
</html>

View File

@ -1,385 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
<style type="text/css">
table.fig {
text-align: center;
border: thin solid #080;
margin-top: 1em;
width: 90%;
margin-left: 5%;
margin-right: 5%;
}
table.fig caption {
background-color: #8C8;
caption-side: bottom;
width: 90%;
margin-left: 5%;
margin-right: 5%;
}
td.fig-rd { background-color: #D98; border: thin solid black }
td.fig-wr { background-color: #8CD; border: thin solid black }
td.fig-p1 { background-color: #CC8; border: thin solid black }
td.fig-p2 { background-color: #8D8; border: thin solid black }
</style>
</head>
<body>
<div id="nav">
<a class=home href="../index.html">Alexia's Home</a>
<a href="index.html">Dissertation</a>
<a href="abs.html">Abstract</a>
<a href="ack.html">Acknowledgements</a>
<a href="toc.html">Contents</a>
<a href="ch1.html">Chapter 1</a>
<a href="ch2.html">Chapter 2</a>
<a href="ch3.html">Chapter 3</a>
<a href="ch4.html">Chapter 4</a>
<a href="ch5.html">Chapter 5</a>
<a class=here href="ch6.html">Chapter 6</a>
<a href="ch7.html">Chapter 7</a>
<a href="ch8.html">Chapter 8</a>
<a href="bib.html">Bibliography</a>
<a href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
<h1>6. Fine-Grain Scheduling</h1>
<div id="chapter-quote">
The most exciting phrase to hear in science, the<br>
one that heralds new discoveries, is not "Eureka!"<br>
(I found it!) but "That's funny ..."<br>
-- Isaac Asimov
</div>
<h2>6.1 Scheduling Policies and Mechanisms</h2>
<p>There are two parts to scheduling: the policy and the mechanism. The policy determines when a job should run and for how long. The mechanism implements the policy.
<p>Traditional scheduling mechanisms have high overhead that discourages frequent scheduler decision making. Consequently, most scheduling policies try to minimize their actions. We observe that high scheduling and dispatching overhead is a result of implementation, not an inherent property of all scheduling mechanisms. We call scheduling mechanisms fine-grain if their scheduling/dispatching costs are much lower than a typical CPU quantum, for example, context switch overhead of tens of microseconds compared to CPU quanta of milliseconds.
<p>Traditional timesharing scheduling policies use some global property, such as job priority, to reorder the jobs in the ready queue. A scheduling policy is adaptive if the global property is a function of the system state, such as the total amount of CPU consumed by the job. A typical assumption in global scheduling is that all jobs are independent of each other. But in a pipeline of processes, where successive stages are coupled through their input and output, this assumption does not hold. In fact, a global adaptive scheduling algorithm may lower the priority of a CPU-intensive stage, making it the bottleneck and slowing down the whole pipeline.
<p>To make better scheduling decisions for I/O-bound processes, we take into account local information and coupling between jobs in addition to the global properties. We call such scheduling policies fine-grain because they use local information. An example of interesting local information is the amount of data in the job's input queue: if it is empty, dispatching the job will merely block for lack of input. This chapter focuses on the coupling between jobs in a pipeline using as the local information the amount of data in the queues linking the jobs.
<p>Fine-grain scheduling is implemented in the Synthesis operating system. The approach is similar to feedback mechanisms in control systems. We measure the progress of each job and make scheduling decisions based on the measurements. For example, if the job is "too slow," say because its input queue is getting full, we schedule it more often and let it run longer. The measurements and adjustments occur frequently, accurately tracking each job's needs.
<p>The key idea in fine-grain scheduling policy is modeled after the hardware phase locked loop (PLL). A PLL outputs a frequency synchronized with a reference input frequency. Our software analogs of the PLL track a reference stream of interrupts to generate a new stable source of interrupts locked in step. The reference stream can come from a variety of sources, for example an I/O device, such as disk index interrupts that occur once every disk revolution, or the interval timer, such as the interrupt at the end of a CPU quantum. For readers unfamiliar with control systems, the PLL is summarized in Section 6.2.
<p>Fine-grain scheduling would be impractical without fast interrupt processing, fast context switching, and low dispatching overhead. Interrupt handling should be fast, since it is necessary for dispatching another process. Context switch should be cheap, since it occurs often. The scheduling algorithm should be simple, since we want to avoid a lengthy search or calculations for each decision. Chapter 3 already addressed the first two requirements. Section 6.2.3 shows that the scheduling algorithms are simple.
<!-- Need PLL picture (page 91) -->
<p class=caption>Figure 6.1: PLL Picture 6.2 Principles of Feedback</p> <!-- FINISH -->
<h3>6.2.1 Hardware Phase Locked Loop</h3>
<p>Figure 6.1 shows the block diagram of a PLL. The output of the PLL is an internallygenerated frequency synchronized to a multiple of the external input frequency. The phase comparator compares the current PLL output frequency, divided by N , to the input frequency. Its output is proportional to the difference in phase (frequency) between its two inputs, and represents an error signal that indicates how to adjust the output to better match the input. The filter receives the signal from the phase comparator and tailors the time-domain response of the loop. It ensures that the output does not respond too quickly to transient changes in the input. The voltage-controlled oscillator (VCO) receives the filtered signal and generates an output frequency proportional to it. The overall loop operates to compensate the variations on input, so that if the output rate is lower than the input rate, the phase comparator, filter, and oscillator work together to increase the output rate until it matches the input. When the two rates match, the output rate tracks the input rate and the loop is said to be locked to the input rate.
<h3>6.2.2 Software Feedback</h3>
<p>The Synthesis fine-grain scheduling policies have the same three elements as the hardware PLL. They track the difference between the running rate of a job and the reference frame in a way analogous to the phase comparator. They use a filter to dampen the oscillations in the difference, like the PLL filter. And they re-schedule the running job to minimize its error compared to the reference, in the same way the VCO adjusts the output frequency.
<!-- Need PLL picture (page 92) -->
<p class=caption>Figure 6.2: Relationship between ILL and FLL</p> <!-- FINISH -->
<p>Let us consider a practical example from a disk driver: we would like to know which sector is under the disk head to perform rotational optimization in addition to the usual seek optimizations. This information is not normally available from the disk controller. But by using feedback, we can derive it from the index-interrupt that occurs once per disk revolution, supplied by some ESDI disk controllers. The index-interrupt supplies the input reference. The rate divider, N , is set to the number of sectors per track. An interval timer functions as the VCO and generates periodic interrupts corresponding to the passage of new sectors under the drive head. The phase comparator and filter are algorithms described in Section 6.2.3.
<p>When we use software to implement the PLL idea, we find more flexibility in measurement and control. Unlike hardware PLLs, which always measure phase differences, software can measure either the frequency of the input (events per second), or the time interval between inputs (seconds per event). Analogously, we can adjust either the frequency of generated interrupts or the intervals between them. Combining the two kinds of measurements with the two kinds of adjustments, we get four kinds of software locked loops. This dissertation looks only at software locked loops that measure and adjust the same variable. We call a software locked loop that measures and adjusts frequency an FLL (frequency locked loop) and a software locked loop that measures and adjusts time intervals an ILL (interval locked loop).
<p>In general, all stable locked loops minimize the error (feedback signal). Concretely, an FLL measures frequency by counting events, so its natural behavior is to maintain the number of events (and thus the frequency) equal to the input. An ILL measures intervals, so its natural behavior is to maintain the interval between consecutive output interrupts equal to the interval between inputs. At first, this seems to be two ways of looking at the same thing. And if the error were always zero, it would be. But when a change in the input happens, there is a period of time when the loop oscillates before it converges to the new output value. During this time, the differences between ILL and FLL show up. An FLL tends to maintain the correct number of events, although the interval between them may vary from the ideal. An ILL tends to maintain the correct interval, even though it might mean losing some events to do so.
<p>This natural behavior can be modified with filters. The overall response of a software locked loop is determined by the kind of filter it uses to transform measurements into adjustments. A low-pass filter makes the FLL output frequency or the ILL output intervals more uniform, less sensitive to transient changes in the input. But it also delays the response to important changes in the input. An integrator filter allows the loop to track linearly changing input without error. Without an integrator, only constant input can be tracked error-free. Two integrators allows the loop to track quadratically changing input without error. But too many integrators tend to make the loop less stable and lengthens the time it takes to converge. A derivative filter improves response to sudden changes in the input, but also makes the loop more prone to noise. Like their hardware analogs, these filters can be combined to improve both the response time and stability of the SLL.
<h3>6.2.3 FLL Example</h3>
<p>Figure 6.3 shows the general algorithm for an FLL that generates a stream of interrupts at four times the rate of a reference stream. The procedure <em>i1</em> services the reference stream of interrupts, while the procedure <em>i2</em> services the generated stream. The variable freq holds the frequency of <em>i2</em> interrupts and is updated whenever <em>i1</em> or <em>i2</em> runs. The variable <em>residue</em> keeps track of differences between <em>i1</em> and <em>i2</em>, serving the role of the phase comparator in a hardware PLL. Each time <em>i1</em> executes, it adds 4 to <em>residue</em>. Each time <em>i2</em> executes, it subtracts 1 from <em>residue</em>. The Filter function determines how the <em>residue</em> affects the frequency adjustments.
<div class=code>
<pre>
int residue=0, freq=0;
/* Master (reference frame) */ /* Slave (derived interrupt) */
i1() i2()
{ {
residue += 4; residue--;
freq += Filter(residue); freq += Filter(residue);
: :
: &lt;do work&gt;
&lt;do work&gt; :
: next_time = NOW + 1/freq;
: schedintr(i2, next_time);
return; return;
} }
</pre>
<p class=caption>Figure 6.3: General FLL</p>
</div>
<div class=code>
<pre>
LoPass(x)
{
static int lopass;
lopass = (7*lopass + x) / 8;
return lopass;
}
</pre>
<p class=caption>Figure 6.4: Low-pass Filter</p>
</div>
<p>If <em>i2</em> and <em>i1</em> were running at the perfect relative rate of 4 to 1, <em>residue</em> would tend to zero and freq would not be changed. But if <em>i2</em> is slower than 4 times <em>i1</em>, <em>residue</em> becomes positive, increasing the frequency of <em>i2</em> interrupts. Similarly, if <em>i2</em> is faster than 4 times <em>i1</em>, <em>i2</em> will be slowed down. As the difference in relative speeds increases, the correction becomes correspondingly larger. As <em>i1</em> and <em>i2</em> approach the exact ratio of 1:4, the difference decreases and we reach the minimum correction with <em>residue</em> being decremented by one and incremented by four, cycling between -2 and +2. Since <em>residue</em> can never converge to zero - only hover around it - the <em>i2</em> execution frequency will always jitter slightly. In practice, <em>residue</em> would be scaled down by an appropriate factor so that the jitter is negligible.
<div class=code>
<pre>
Integrate(x)
{
static int accum;
accum = accum + x;
return accum;
}
</pre>
<p class=caption>Figure 6.5: Integrator Filter</p>
</div>
<div class=code>
<pre>
Deriv(x)
{
static int old_x;
int dx;
dx = x - old_x;
old_x = x;
return dx;
}
</pre>
<p class=caption>Figure 6.6: Derivative Filter</p>
</div>
<p>Figures 6.4, 6.5, and 6.6 show some simple filters that can be used alone or in combination to improve the responsiveness and stability of the FLL. In particular, the lowpass filter shown in Figure 6.4 helps eliminate the jitter mentioned earlier at the expense of a longer settling time. The variable lopass keeps a "history" of what the most recent <em>residue</em> were. Each update adds 1=8 of the new <em>residue</em> to 7=8 of the old lopass. This has the effect of taking a weighted average of recent residues. When <em>residue</em> is positive for many iterations, as is the case when <em>i2</em> is too slow, lopass will eventually be equal to <em>residue</em>. But if <em>residue</em> oscillates rapidly, as in the situation described in the previous paragraph, lopass will go to zero. The derivative is never used alone, but can be used in combination with other filters to improve response to rapidly-changing inputs.
<h3>6.2.4 Application Domains</h3>
<p>We choose between measuring and adjusting frequency and intervals depending on the desired accuracy and application. Accuracy is an important consideration because we can measure only integer quantities: either the number of events (frequency), or the clock ticks between events (interval). We would like to measure the larger quantity of the two since it carries higher accuracy.
<p>Let us consider a scenario that favors ILL. Suppose you have a microsecond-resolution interval timer and the input event occurs about once per second. To make the output interval match the input interval, the ILL measures second-long intervals with a microsecond resolution timer, achieving high accuracy with few events. Consequently, ILL stabilizes very quickly. In contrast, by measuring frequency (counting events), an FLL needs more events to detect and adjust the error signal. Empirically, it takes about 50 input events (in about 50 seconds) for the output to stabilize to within 10% of the desired value.
<p>A second scenario favors FLL. Suppose you have an interval timer with the resolution of one-sixtieth of a second. The input event occurs 30 times a second. Since the FLL is independent of timer resolution, its output will still stabilize to within 10% after seeing about 50 events (in about 1.7 seconds). However, since the event interval is comparable to the resolution of the timer, an ILL will suffer loss of accuracy. In this example, the measured interval will be either 1, 2 or 3 ticks, depending on the relative timing between the clock and input. Thus the ILL's output can have an error of as much as 50%.
<p>Generally, slow input rates and high resolution timers favor ILL, while high input rates and low resolution timers favor FLL. Sometimes the problem at hand forces a particular choice. For example, in queue handling procedures, the number of get-queue operations must equal the number of put-queue operations. This forces the use of an FLL, since the actual number of events control the actions. In another example, subdivision of a time interval (like in the disk sector finder), an ILL is best.
<h2>6.3 Uses of Feedback in Synthesis</h2>
<p>We have used feedback-based scheduling policies for a wide variety of purposes in Synthesis. These are:
<ul>
<li>An FLL in the thread scheduler to support real-time signal-processing applications. ffl An ILL rhythm tracker for a special effects sound processing program.
<li>A digital oversampling filter for a CD player. An FLL adjusts the filter I/O rate to match the CD player.
<li>An ILL that adjusts itself to the disk rotation rate, generating an interrupt a few microseconds before each sector passes under the disk head.
</ul>
<h3>6.3.1 Real-Time Signal Processing</h3>
<p>Synthesis uses the FLL idea in its thread scheduler. This enables a pipeline of threads to process high-rate, real-time data streams and simplifies the programming of signal-processing applications. The idea is quite simple: if a thread's input queue is filling or if its output queue is emptying, increase its share of CPU. Conversely, if a thread's input queue is emptying or if its output queue is filling, decrease its share of CPU. The effect of this scheduling policy is to allocate enough CPU to each thread in the pipeline so it can process its data. Threads connected to the high-speed Sound-IO devices find their input queues being filled -- or their output queues being drained -- at a high rate. Consequently, their share of CPU increases until the rate at which they process data equals the rate that it arrives. As these threads run and produce output, the downstream threads find that their queues start to fill, and they too receive more CPU. As long as the total CPU necessary for the entire pipeline does not exceed 100%, the pipeline runs in real-time.
<div class=code>
<pre>
main()
{
char buf[100];
int n, fd1, fd2;
fd1 = open("/dev/cd", 0);
fd2 = open("/dev/speaker", 1);
for(;;) {
n = read(fd1, buf, 100);
write(fd2, buf, n);
}
}
</pre>
<p class=caption>Figure 6.7: Program to Play a CD</p>
</div>
<p>The simplification in applications programming that occurs using this scheduler cannot be overstated. One no longer needs to worry about assigning priorities to jobs, or of carefully crafting the inner loops so that everything is executed frequently enough. For example, in Synthesis, reading from the CD player is no different than reading from any other device or file. Simply open "/dev/cd" and read from it. To listen to the CD player, one could use the program in Figure 6.7. The scheduler FLL keeps the data flowing smoothly at the 44.1 KHz sampling rate -- 176 kilobytes per second for each channel -- regardless of how many CPU-intensive jobs might be executing in the background.
<p>Several music-oriented signal-processing applications have been written for Synthesis and run in real-time using the FLL-based thread scheduler. The Synthesis music and signal-processing toolkit includes many simple programs that take sound input, process it in some way, and produce sound output. These include delay elements, echo and reverberation filters, adjustable low-pass, band-pass and high-pass filters, Fourier transform, and a correlator and feature extraction unit. These programs can be connected together in a pipeline to perform more complex sound processing functions, in a similar way that text filters in <span class=smallcaps>Unix</span> can be cascaded using the shell's "--" notation. The thread scheduler ensures the pipeline runs in real-time.
<h3>6.3.2 Rhythm Tracking and The Automatic Drummer</h3>
<p>Besides scheduling, the feedback idea finds use in the actual processing of music signals. In one application, a correlator extracts rhythm pulses from the music on a CD. These are fed to an ILL, which subdivides the beat interval and generates interrupts synchronized to the beat of the music. These interrupts are then used to drive a drum synthesizer, which adds more drum beats to the original music. The interrupts also adjust the delay in the reverberation unit making it equal to the beat interval of the music. You can also get pretty pictures synchronized to the music when you plot the ILL input versus output on a graphics display.
<h3>6.3.3 Digital Oversampling Filter</h3>
<p>In another music application, an FLL is used to generate the timing information for a digital interpolation filter. A digital interpolator takes as input a stream of sampled data and creates additional samples between the original ones by interpolation. This oversampling increases the accuracy of analog reconstruction of digital signals. We use 4:1 oversampling, i.e. we generate 4 samples using interpolation from each CD sample. The CD player has a new data sample available 44,100 times per second, or one every 22.68 microseconds. The interpolated data output is four times this rate, or one every 5.67 microseconds.<sup>1</sup> We use an FLL to generate an interrupt source at this rate, synchronized with the CD player. This also serves as an example of just how fine-grained the timing can be: an interrupt every 5.67&nbsp;&#181;s corresponds to over 175,000 interrupts per second.
<div class=footnote><sup>1</sup> This program runs on the Quamachine at 50 MHz clock rate.</div>
<h3>6.3.4 Discussion</h3>
<p>A formal analysis of fine-grain scheduling is beyond the scope of this dissertation. However, I would like to give readers an intuitive feeling about two situations: saturation and cheating. As the CPU becomes saturated, the FLL-based scheduler degrades gracefully. The processes closest to externally generated interrupts (device drivers) will still get the necessary CPU time. The CPU-intensive processes away from I/O interrupts will slow down first, as they should at saturation.
<p>Another potential problem is cheating by consuming resources unnecessarily to increase priority. This is possible because fine-grain scheduling tends to give more CPU to processes that consume more. However, cheating cannot be done easily from within a thread or by cooperation of several threads. First, unnecessary I/O loops within a program does not help the cheater, since they do not speed up data flow in the pipeline of processes. Second, I/O within a group of threads only shifts CPU quanta within the group. A thread that reads from itself gains quanta for input, but loses the exact amount in the self-generated output. To increase the priority of a process, it must read from a real input device, such as the CD player. In this case, it is virtually impossible for the OS kernel to distinguish the real I/O from cheating I/O.
<h2>6.4 Other Applications</h2>
<h3>6.4.1 Clocks</h3>
<p>The FLL provides integral stability. This means the long-term drift between the reference frame and generated interrupts tends to zero, even though any individual interval may differ from the reference. This is in contrast with differential stability, in which the consecutive intervals are all the same, but any systematic error, no matter how small, will accumulate into a long-term drift. To illustrate, the interval timers found on many machines provide good differential stability: all the intervals are of very nearly the same length. But they do not provide good integral stability: they do not keep good time.
<p>The integral stability property of the FLL lets it increase the resolution of precise timing sources. The idea is to synchronize a higher-resolution but less precise timing device, such as the machine's interval timer, to the precise one. The input to the FLL would be an interrupt derived from a very precise source of timing, for example, from an atomic clock. The output is a new stream of interrupts occurring at some multiple of the input rate.
<p>Suppose the atomic clock ticks once a second. If the FLL's rate divider, N , is set to 1000, then the FLL will subdivide the second-long intervals into milliseconds. The FLL adjusts the interval timer so that each 1=1000-th interrupt occurs as close to the "correct" time of arrival as possible given the resolution of the interval timer, while maintaining integral stability -- N interrupts out for every interrupt in. If the interval timer used exhibits good differential stability, as most interval timers do, the output intervals will be both precise and accurate.
<p>But for this to work well, one must be careful to avoid the accumulation of round-off error when calculating successive intervals. A good rule-of-thumb to remember is: calculate based on elapsed time; not on intervals. Use differences of elapsed times whenever an interval is required. This is crucial to guaranteeing convergence. The sample FLL in figure 6.3 follows these guidelines.
<p>To illustrate this, suppose that the hardware interval timer ticks every 0.543 microseconds.<sup>2</sup> Using this timer, a millisecond is 1843.2 ticks long. But when scheduling using intervals, 1843.2 is truncated to 1843 since interrupts can happen only on integer ticks. This gains time. One second later, the FLL will compensate by setting the interval to 1844. But now it loses time. The FLL ends up oscillating between 1843 and 1844, and never converging. Since the errors accumulate all in the same direction for the entire second before the adjustment occurs, the resulting millisecond subdivisions are not very accurate.
<div class=footnote><sup>2</sup> This is a common number on machines that derive timing from the baud-rate generator used in serial communications.</div>
<p>A better way to calculate is this: let the desired interval (1/frequency) be a floatingpoint number and accumulate intervals into an elapsed-time accumulator using floatingpoint addition. Interrupts are scheduled by taking the integer part of the elapsed-time accumulator and subtracting it from the previous elapsed-time to obtain the integer interval. Once convergence is reached, 4 out of 5 interrupts will be scheduled every 1843 ticks and 1 out of 5 every 1844 ticks, evenly interspersed, averaging to 1843.2. Each interrupt will occur as close to the 1-millisecond mark as possible given the resolution of the timer (e.g., they will differ by at most &#177;0.272&nbsp;&#181;s). In practice, the same effect can be achieved using appropriately scaled integer arithmetic, and floating point arithmetic would not be used.
<h3>6.4.2 Real-Time Scheduling</h3>
<p>The adaptive scheduling strategy might be improved further, possibly encompassing many hard real-time scheduling problems. Hard real-time scheduling is a harder problem than the real-time stream processing problem discussed earlier. In stream processing, each job has a small queue where data can sit if the scheduler makes an occasional mistake. The goal of fine-grain scheduling is to converge to the correct CPU assignments for all the jobs before any of the queues overflow or underflow. In contrast, hard real-time jobs must meet their deadline, every single time. Nevertheless, I believe that the feedback-based scheduling idea will find useful application in this area. In this section, I only outline the general idea, without offering proof or examples. For a good discussion of issues in real-time computing, see [29].
<p>We divide hard-deadline jobs into two categories: the short ones and the long ones. A short job is one that must be completed in a time frame within an order of magnitude of interrupt and context switch overhead. For example, a job taking up to 100 microseconds would be a short job in Synthesis. Short jobs are scheduled as they arrive and run to completion without preemption.
<p>Long jobs take longer than 100 times the overhead of an interrupt and context switch. In Synthesis this includes all the jobs that take more than 1 millisecond, which includes most of the practical applications. The main problem with long jobs is the variance they introduce into scheduling. If we always take the worst scenario, the resulting hardware requirement is usually very expensive and unused most of the time.
<p>To use fine-grain scheduling policies for long jobs, we break down the long job into small strips. For simplicity of analysis we assume each strip to have the same execution time ET. We define the estimated CPU power to finish job J as:
<table class=equation>
<tr><td><td><td>(strips in J) * ET
<tr><td>Estimate(J)<td>=<td><hr>
<tr><td><td><td>Deadline(J) - Now
</table>
<p>For a long job, it is not necessary to know ET exactly since the locked loop "measures" it and continually adjusts the schedule in lock step with the actual execution time. In particular, if <em>Estimate(J) &gt; 1</em> then we know from the current estimate that J will not make the deadline. If we have two jobs, A and B, with <em>Estimate(A) + Estimate(B) &gt; 1</em> then we may want to consider aborting the less important one and calling a short emergency routine to recover.
<p>Unlike traditional hard-deadline scheduling algorithms, which either guarantee completion or nothing, fine-grain scheduling provides the ability to predict the deadline miss under dynamically changing system loads. I believe this is an important practical concern to real-time application programmers, especially in recovery from faults.
<h3>6.4.3 Multiprocessor and Distributed Scheduling</h3>
<p>I also believe the adaptiveness of FLL promises good results in multiprocessor and distributed systems. But like in the previous section, the idea can be offered at this writing, but with little support. At the risk of oversimplification, I describe an example with fixed buffer size and execution time. Recognize that at a given a load, we can always find the optimal scheduling statically by calculating the best buffer size and CPU quantum. But I emphasize the main advantage of feedback: the ability to dynamically adjust towards the best buffer size and CPU quantum. This is important when we have a variable system load, jobs with variable demands, or a reconfigurable system with a variable number of CPUs.
<table class=fig>
<caption>
Figure 6.8: Two Processors, Static Scheduling
</caption>
<tr>
<td>disk
<td class=fig-rd>read<td colspan=2>
<td class=fig-rd>read<td><td class=fig-wr>write
<td class=fig-rd>read<td><td class=fig-wr>write
<td>. . .
<tr>
<td>P1
<td><td class=fig-p1 colspan=2>execute
<td><td class=fig-p1 colspan=2>execute
<td><td class=fig-p1 colspan=2>execute
<td>. . .
<tr>
<td>P2
<td colspan=3>
<td class=fig-p2 colspan=2>execute<td>
<td class=fig-p2 colspan=2>execute<td>
<td>. . .
<tr>
<td>time (ms)
<td>50<td>100
<td>150<td>200
<td>250<td>300
<td>350<td>400
<td>450<td>500
</table>
<p>Figure 6.8 shows the static scheduling for a two-processor shared-memory system with a common disk (transfer rate of 2 MByte/second). We assume that both processes access the disk drive at the full transfer rate, e.g. reading and writing entire tracks. Process 1 runs on processor 1 (P1) and process 2 runs on processor 2 (P2). Process 1 reads 100 KByte from the disk into a buffer, takes 100 milliseconds to process them, and writes 100 KByte through a pipe into process 2. Process 2 reads 100 KByte from the pipe, takes another 100 milliseconds to process them, and writes 100 KByte out to disk. In the figure, process 1 starts to read at time 0. All disk activities appear in the bottom row, P1 and P2 show the processor usage, and shaded quadrangles show idle time.
<p>Figure 6.9 shows the fine-grain scheduling mechanism (using FLL) for the same system. We assume that process 1 starts by filling its 100 KByte buffer, but soon after it starts to write to the output pipe, process 2 starts. Both processes run to exhaust the buffer, when process 1 will read from the disk again. After some settling time, depending on the filter used in the locked loop, the stable situation is for the disk to remain continuously active, alternatively reading into process 1 and writing from process 2. Both processes will also run continuously, with the smallest buffer that maintains the nominal transfer rate.
<p>The above example illustrates the benefits of fine-grain scheduling policies in parallel processing. In a distributed environment, the analysis is more complicated due to network message overhead and variance. In those situations, calculating statically the optimal scheduling becomes increasingly difficult. We expect the fine-grain scheduling to show increasing usefulness as it adapts to an increasingly complicated environment.
<table class=fig>
<caption>
Figure 6.9: Two Processors, Fine-Grain Scheduling
</caption>
<tr>
<td>disk
<td class=fig-rd>r<td>&nbsp;<td class=fig-rd>r
<td>&nbsp;<td class=fig-rd>r<td class=fig-wr>w
<td class=fig-rd>r<td class=fig-wr>w<td class=fig-rd>r
<td class=fig-wr>w<td class=fig-rd>r<td class=fig-wr>w
<td class=fig-rd>r<td class=fig-wr>w<td class=fig-rd>r
<td class=fig-wr>w<td class=fig-rd>r<td class=fig-wr>w
<td class=fig-rd>r<td class=fig-wr>w<td class=fig-rd>r
<td class=fig-wr>w<td class=fig-rd>r<td class=fig-wr>w
<td class=fig-rd>r<td class=fig-wr>w<td class=fig-rd>r
<td>. . .
<tr>
<td>p1
<td><td colspan=2 class=fig-p1>ex
<td colspan=2 class=fig-p1>ex<td colspan=2 class=fig-p1>ex<td colspan=2 class=fig-p1>ex
<td colspan=2 class=fig-p1>ex<td colspan=2 class=fig-p1>ex<td colspan=2 class=fig-p1>ex
<td colspan=2 class=fig-p1>ex<td colspan=2 class=fig-p1>ex<td colspan=2 class=fig-p1>ex
<td colspan=2 class=fig-p1>ex<td colspan=2 class=fig-p1>ex<td colspan=2 class=fig-p1>ex
<td>. . .
<tr>
<td>p2
<td colspan=3>
<td colspan=2 class=fig-p2>ex<td colspan=2 class=fig-p2>ex<td colspan=2 class=fig-p2>ex
<td colspan=2 class=fig-p2>ex<td colspan=2 class=fig-p2>ex<td colspan=2 class=fig-p2>ex
<td colspan=2 class=fig-p2>ex<td colspan=2 class=fig-p2>ex<td colspan=2 class=fig-p2>ex
<td colspan=2 class=fig-p2>ex<td colspan=2 class=fig-p2>ex<td colspan=2 class=fig-p2>ex
<td>. . .
<tr>
<td>time (ms)
<td colspan=3>50
<td colspan=3>100
<td colspan=3>150
<td colspan=3>200
<td colspan=3>250
<td colspan=3>300
<td colspan=3>350
<td colspan=3>400
<td colspan=3>450
<td>500
</table>
<p>Another application of FLL to distributed systems is clock synchronization. Given some precise external clocks, we would like to synchronize the rest of machines with the reference clocks. Many algorithms have been published, including a recent probabilistic algorithm by Christian [10]. Instead of specialized algorithms, we use an FLL to synchronize clocks, where the external clock is the reference frame, the message delays introduce the jitter in the input, and we need to find the right combination of filters to adapt the output to the varying message delays. Since an FLL exhibits integral stability, the clocks will tend to synchronize with the reference once they stabilize. We are currently collecting data on the typical message delay distributions and finding the appropriate filters for them.
<h2>6.5 Summary</h2>
<p>We have generalized scheduling from job assignments as a function of time, to job assignments as a function of any source of interrupts. The generalized scheduling is most useful when we have fine-grain scheduling, that uses frequent state checks and dispatching actions to adapt quickly to system changes. Relevant new applications of the generalized fine-grain scheduling include I/O device management, such as a disk sector interrupt source, and adaptive scheduling, such as real-time scheduling and distributed scheduling.
<p>The implementation of fine-grain scheduling in Synthesisis based on feedback systems, in particular the phase locked loop. Synthesis' fine-grain scheduling policy means adjustments every few hundreds of microseconds on local information, such as the number of characters waiting in an input queue. Very low overhead scheduling and context switch for dispatching form the foundation of our fine-grain scheduling mechanism. In addition, we have very low overhead interrupt processing to allow frequent checks on the job progress and quick, small adjustments to the scheduling policy.
<p>There are two main advantages of fine-grain scheduling: quick adjustment to changing situations, and early warning of potential deadline misses. Quick adjustments make better use of system resources, since we avoid queue/buffer overflow and other mismatches between the old scheduling policy and the new situation. Early warning of deadline misses allows real-time application programmers to anticipate a disaster and attempt an emergency recovery before the disaster strikes.
<p>We have only started exploring the many possibilities that generalized fine-grain scheduling offers. Distributed applications stand to benefit from the locked loops, since they can track the input interrupt stream despite jitters introduced by message delays. Concrete applications we are studying include load balancing, distributed clock synchronization, smart caching in memory management and real-time scheduling. To give one example, load balancing in a real-time distributed system can benefit greatly from fine-grain scheduling, since we can detect potential deadline misses in advance; if a job is making poor progress towards its deadline locally, it is a good candidate for migration.
</div>
</body>
</html>

View File

@ -1,293 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div id="nav">
<a class=home href="../index.html">Alexia's Home</a>
<a href="index.html">Dissertation</a>
<a href="abs.html">Abstract</a>
<a href="ack.html">Acknowledgements</a>
<a href="toc.html">Contents</a>
<a href="ch1.html">Chapter 1</a>
<a href="ch2.html">Chapter 2</a>
<a href="ch3.html">Chapter 3</a>
<a href="ch4.html">Chapter 4</a>
<a href="ch5.html">Chapter 5</a>
<a href="ch6.html">Chapter 6</a>
<a class=here href="ch7.html">Chapter 7</a>
<a href="ch8.html">Chapter 8</a>
<a href="bib.html">Bibliography</a>
<a href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
<h1>7. Measurements and Evaluation</h1>
<div id="chapter-quote">
15. Everything should be built top-down, except the first time.</br>
-- Alan J. Perlis Epigrams on Programming
</div>
<h2>7.1 Measurement Environment</h2>
<h3>7.1.1 Hardware</h3>
<p>The current implementation of Synthesis runs on two machines: the Quamachine and the Sony NEWS 1860 workstation. As described in section 1.3.4, the Quamachine is a home-brew, experimental 68030-based computer system designed to aid systems research and measurement. Its measurement facilities include an instruction counter, a memory reference counter, hardware program tracing, and a memory-mapped clock with 20-nanosecond resolution. The processor can operate at any clock speed from 1 MHz up to 50 MHz. Normally it runs at 50 MHz. But by changing the processor speed and introducing waitstates into the main memory access, the Quamachine can closely emulate the performance characteristics of common workstations, simplifying measurements and comparisons. The Quamachine also has special I/O devices that support digital music and audio signal processing: stereo 16-bit analog output, stereo 16-bit analog input, and a compact disc (CD) player digital interface.
<p>The Sony NEWS 1860 is a commercially-available workstation with two 68030 processors. Its architecture is not symmetric. One processor is meant to be the main processor and the other is meant to be the I/O processor. Synthesis tries to treat it as if it were a symmetric multiprocessor, scheduling most tasks on either processor without preference, except those that require something that is accessible from one processor and not the other. While this is not a large number of processors, it nevertheless helps demonstrate Synthesis multiprocessor support. But for measurement purposes of this chapter, only one processor -- the slower I/O processor -- was used. (With the kernel's multiprocessor support kept intact.)
<h3>7.1.2 Software</h3>
<p>A partial emulator for <span class=smallcaps>Unix</span> runs on top of the Synthesis kernel and emulates some of the SUNOS (version 3.5) kernel calls. This provides a direct way of measuring and comparing two otherwise very different operating systems. Since the executables are the same, the comparison is direct. The emulator further demonstrates the generality of Synthesis by setting the lower bound - Synthesis is at least as general as <span class=smallcaps>Unix</span> if it can emulate <span class=smallcaps>Unix</span>. It also helps with the problem of acquiring application software for a new operating system by allowing the use of SUN-3 binaries instead. Although the emulator supports a subset of the <span class=smallcaps>Unix</span> system calls - time constraints have forced an "implement-as-the-need-arises" strategy - the set supported is sufficiently rich to provide a good idea of what the relative times for the basic operations are.
<h2>7.2 User-Level Measurements</h2>
<h3>7.2.1 Comparing Synthesis with SUNOS 3.5</h3>
<p>This section describes a comparison between Synthesis and SUNOS 3.5. The benchmark programs consist of simple loops that exercise a particular system function many times. The source code for the programs is in appendix A. All benchmark programs were compiled on the SUN 3/160, using <em>cc -O</em> under SUNOS release 3.5. The executable <em>a.out</em> was timed on the SUN, then brought over to the Quamachine and executed using the <span class=smallcaps>Unix</span> emulator.
<table class=table>
<caption>
Table 7.1: Measured <span class=smallcaps>Unix</span> System Calls (in seconds)
</caption>
<tr class=head><th rowspan=2>Program<th colspan=4>Raw Sun Data<th rowspan=2>Sun usr+sys<th rowspan=2>Synthesis Emulator<th rowspan=2>Ratio<th rowspan=2>I/O Rate (MB/Sec)
<tr class=head><th>usr<th>sys<th>total<th>watch
<tr><th>1 Compute<td class=number>19.8<td class=number>0.5<td class=number>20<td class=number>20.9<td class=number>20.3<td class=number>21.42<td class=number>0.95<td class=number>-
<tr><th>2 R/W pipe (1)<td class=number>0.4<td class=number>9.6<td class=number>10<td class=number>10.2<td class=number>10.0<td class=number>0.18<td class=number>56.<td class=number>0.1
<tr><th>3 R/W pipe (1024)<td class=number>0.5<td class=number>14.6<td class=number>15<td class=number>15.3<td class=number>15.1<td class=number>2.42<td class=number>6.2<td class=number>8
<tr><th>4 R/W pipe (4096)<td class=number>0.7<td class=number>37.2<td class=number>38<td class=number>38.2<td class=number>37.9<td class=number>9.64<td class=number>3.9<td class=number>8
<tr><th>5 R/W file<td class=number>0.5<td class=number>20.1<td class=number>21<td class=number>23.4<td class=number>20.6<td class=number>2.91<td class=number>7.1<td class=number>6
<tr><th>6 open null/close<td class=number>0.5<td class=number>17.3<td class=number>17<td class=number>17.4<td class=number>17.8<td class=number>0.69<td class=number>26.<td class=number>-
<tr><th>7 open tty/close<td class=number>0.5<td class=number>42.1<td class=number>43<td class=number>43.1<td class=number>42.6<td class=number>0.88<td class=number>48.<td class=number>-
</table>
<p>Ideally, we would want to run both Synthesis and SUNOS on the same hardware. Unfortunately, we could not obtain detailed information about the Sun-3 machine, so Synthesis has not been ported to the Sun. Instead, we closely emulate the hardware characteristics of a Sun-3 machine using the Quamachine. This involves three changes: replace the 68030 CPU with a 68020, set the CPU speed to 16MHz, and introduce one wait-state into the main-memory access. To validate faithfulness of the hardware emulation, the first benchmark program is a compute-bound test. This test program implements a function producing a chaotic sequence.<sup>1</sup> It touches a large array at non-contiguous points, which ensures that we are not just measuring the "in-the-cache" performance. Since it does not use any operating system resources, the measured times on the two machines should be the same.
<div class=footnote><sup>1</sup> Pages 137-138 in Godel, Escher, Bach: An Eternal Golden Braid, by Douglas Hofstadter.</div>
<p>Table 7.1 summarizes the results of the measurements. The columns under "Raw SUN data" were obtained using the <span class=smallcaps>Unix</span> time command and verified with a stopwatch. The SUN was unloaded during these measurements and time reported more than 99% CPU available for them. The columns labeled "usr," "sys," and "total" give the time spent in the user's program, in the SUNOS kernel, and the total elapsed time, as reported by the time command. The column labeled "usr+sys" is the sum of the user and system times, and is the number used for comparisons with Synthesis. The Synthesis emulator data were obtained by using the microsecond-resolution real-time clock on the Quamachine, rounded to hundredths of a second. These times were also verified with stopwatch, sometimes by running each test 10 times to obtain a more easily measured time interval. The column labeled "Ratio" gives the ratio of the preceding two columns. The last column, labeled "I/O Rate", gives the overall Synthesis I/O rate in megabytes per second for those test programs performing I/O.
<p>The first program is a compute-intensive calibration function to validate the hardware emulation.
<p>Programs 2, 3, and 4 write and then read back data from a <span class=smallcaps>Unix</span> pipe in chunks of 1, 1024, and 4096 bytes. Program 2 shows a remarkable speed advantage - 56 times - for the single-byte read/write operations. Here, the low overhead of the Synthesis kernel calls really makes a difference, since the amount of data moved is small and most of the time is spent in overhead. But even as the I/O size grows to the page size, the difference remains significant -- 4 to 6 times. Part of the reason is that the SUNOS overhead is still significant even when amortized over more data. Another reason is the fast synthesized routines that move data across address spaces. The generated code loads words from one address space into registers and stores them back in the other address space. With unrolled loops this achieves the data transfer rate of about 8MB per second.
<p>Program 5 reads and writes a file (cached in main memory) in chunks of 1K bytes. It too shows a remarkable speed improvement over SUNOS.
<p>Programs 6 and 7 repeatedly open and close /dev/null and /dev/tty. They show that Synthesis kernel code generation is very efficient. The open operations create executable code for later read and write, yet they are 20 to 40 times faster than the <span class=smallcaps>Unix</span> open that does not do code generation. Table 7.3 contains more details of file system operations that are discussed in the next section.
<h3>7.2.2 Comparing Window Systems</h3>
<p>A simple measurement gives an idea of the speed of interactive I/O on various machines running different window systems. We use "cat /etc/termcap" to a TTY window. The local termcap file is 110620 bytes long. The window size is 80 characters wide by 24 lines, using a 16 by 24 pixel font, and with scrollbars enabled.
<table class=table>
<caption>
Table 7.2: Time to "<em>cat /etc/termcap</em>" to a 80*24 TTY window
</caption>
<tr class=head><th>OS, Window System<th>Machine<th>CPU<th>Time (Seconds)
<tr><th>Synthesis<td>Sony NEWS<td>68030, 25mhz<td class=number>2.9
<tr><th><span class=smallcaps>Unix</span>, X11 R5<td>Sony NEWS<td>68030, 25mhz<td class=number>23
<tr><th><span class=smallcaps>Unix</span>, console<td>Sony NEWS<td>68030, 25mhz<td class=number>127
<tr><th>Mach, NextStep<td>NeXT<td>68030, 25mhz<td class=number>55
<tr><th>Mach, NextStep<td>NeXT<td>68040, 25mhz<td class=number>13
<tr><th>SUNOS, X11 R5<td>Sun SparcStation II<td>Sparc<td class=number>6.5
</table>
<p>Table 7.2 summarizes the times taken by the various machines and window systems. There are many good reasons why the other window systems are slow. The Sony console device driver, for example, scrolls the whole screen one line at a time, even when there are several lines of output waiting. The X window system uses RPC to communicate between client and server; no doubt this adds to the overhead. The NextStep window system is based on Postscript, which is overkill for the task at hand.
<p>The point is not to parade Synthesis speed nor justify the other's slowness. It is to point out that that speed is possible through careful thought and program structuring that provides just the right level of abstraction for each application. For example, one application that runs under Synthesis reads music data from the CD player, computes its Fourier transform (1024 point), and displays the result in a window, all in real-time. It displays 88200 data points per second. This is impossible to do today using any other single-processor workstation and operating system because the abstractions provided are too expensive and just plain wrong for this particular task. This is true even though the newer Sparc-based workstations from SUN are more than four times faster then the machine running Synthesis. Section 7.3.3 shows detailed measurements for the Synthesis window system.
<table class=table>
<caption>
Table 7.3: File and Device I/O (in microseconds)
</caption>
<tr class=head><th>Operation<th>Native Time<th><span class=smallcaps>Unix</span> Emulation
<tr><th>emulation trap<td>--<td>2
<tr><th>open /dev/null<td>43<td>49
<tr><th>open /dev/tty<td>62<td>68
<tr><th>open (disk file)<td>73<td>85
<tr><th>close<td>18<td>22
<tr><th>read 1 byte from file<td>9<td>10
<tr><th>read N bytes from file<td>9+N/8<td>10+N/8
<tr><th>read N from /dev/null<td>6<td>8
</table>
<h2>7.3 Detailed Measurements</h2>
<p>The Quamachine's 20-nanosecond resolution memory-mapped clock enables precise measurement of the time taken by each individual system call. To obtain direct timings in microseconds, we surround the system call to be measured with two "read clock" machine instructions and subtract to find the elapsed time.
<h3>7.3.1 File and Device I/O</h3>
<p>Table 7.3 gives the time taken by various file- and device-related I/O operations. It compares the timings measured for the native Synthesis system calls and for the equivalent call in SUNOS emulation mode. For these tests, the Quamachine was running at 25MHz using the 68030 CPU.
<p>Worth noting is the cost of <em>open</em>. The simplest case, <em>open /dev/null</em>, takes 49 microseconds, of which about 70% are used to find the name in the directory structure and 30% for memory allocation and code synthesis to create the null read and write procedures. The additional 19 microseconds in opening /dev/tty come from generating more involved code to read and write the TTY device. Finally, opening a file requires synthesizing more sophisticated code and buffer allocations, costing 17 additional microseconds.
<table class=table>
<caption>
Table 7.4: Low-level Memory Management Overhead (Page Size = 4KB)
</caption>
<tr class=head><th>Operation<th>Time (&#181;s)
<tr><th>Service Translation Fault<td>13.6
<tr><th>Allocate page (pre-zeroed)<td>2.4 + 13.6 = 16.0
<tr><th>Allocate page (needs zeroing)<td>152 + 13.6 = 166
<tr><th>Allocate page (none free; replace)<td>154 + 13.6 + T<sub>replace</sub> = 168 + T<sub>replace</sub>
<tr><th>Copy a page (4 Kbytes)<td>260 + 13.6 = 274
<tr><th>Free page<td>1.6
</table>
<h3>7.3.2 Virtual Memory</h3>
<p>Table 7.4 gives the time taken by various basic operations related to virtual memory. The first row, labeled "Service Translation Fault," gives the time taken to service a translation fault exception. It represents overhead that is always incurred, regardless of the reason for the fault. Translation faults happen whenever a memory reference can not be completed because the address could not be translated. The reasons are manifold: the page is not present, or it is copy-on-write, or it has not been allocated, or that reference is not allowed. This number includes the time taken by the hardware to detect the translation fault, save the machine state, and dispatch to the fault handler. It includes the time taken by the Synthesis fault handler to interpret the saved hardware state, determine the reason for the fault, and dispatch to the correct sub-handler. And it includes the time to re-load the machine state and retry the reference once the sub-handler has fixed the situation.
<p>Subsequent rows give the additional time taken by the various sub-handlers, as a function of the cause of the fault. The numbers are shown in the form "X + 13.6 = Y ," where X is the time taken by the sub-handler alone, and Y the total time including the fault overhead. The second row of the table gives the time to allocate a zeroed page when one already exists. (Synthesis uses idle CPU time to maintain a pool of pre-zeroed pages for faster allocation.) The third row gives the time taken to allocate and zero a free page. If no page is free, one must be replaced, and this cost is given in the fourth row.
<table class=table>
<caption>
Table 7.5: Selected Window System Operations
</caption>
<tr class=head><th>Quaject<th>&#181;s to Create<th>&#181;s to Write
<tr><th>TTY-Cooker<td>27<td>2.3 + 2.1/char
<tr><th>VT-100 terminal emulator<td>532<td>14.2 + 1.3/char
<tr><th>Text window <td>71<td>23.9 + 27.7/char
</table>
<h3>7.3.3 Window System</h3>
<p>A terminal window is composed of a pipeline of three quajects: a TTY-Cooker, a VT100 Terminal Emulator, and a Text-Window. Each quaject has a fixed cost of invocation and a per-character cost that varies depending on the character being processed. These costs are summarized in Table 7.5. The numbers are show in the form "<em>X + Y/char</em>," where <em>X</em> is the invocation cost and <em>Y</em> the average per-character costs. The average is taken over the characters in <em>/etc/termcap</em>.
<p>The numbers in Table 7.5 can be used to predict the elapsed time for the "<em>cat /etc/termcap</em>" measurement done in Section 7.2.2. Performing the calculation, we get 3.4 seconds if we ignore the invocation overhead and use only the per-character costs. Notice that this exceeds the elapsed time actually observed (Table 7.2). This unexpected result happens because Synthesis kernel can optimize the data flow, resulting in fewer calls and less actual work than a straight concatenation of the three quajects would indicate. For example, in a fast window system, many characters may be scrolled off the screen between the consecutive vertical scans of the monitor. Since these characters would never be seen by a user, they need not be drawn. The Synthesis window manager bypasses the drawing of those characters by using fine-grained scheduling. It samples the content of the virtual VT100 screen 60 times a second, synchronized to the vertical retrace of the monitor, and draws the parts of the screen that have changed since the last time. This is a good example of how fine-grain scheduling can streamline processing, bypassing I/O that does not affect the visible result. The data is not lost, however. All the data is available for review using the window's scrollbars.
<h3>7.3.4 Other Figures</h3>
<p>Other performance figures at the same level of detail were already given in the previous chapters. In Table 5.2 on page 85, we see that Synthesis kernel threads are lightweight, with less than 20 microsecond creation time; Table 5.3 on page 86 shows that thread context switching is fast. Table 3.4 on page 40 gives the time taken to handle the high-rate interrupts from the Sound-IO devices.
<h2>7.4 Experience</h2>
<h3>7.4.1 Assembly Language</h3>
<p>The current version of Synthesis is written in 68030 macro assembly language. This section reports on the experience.
<p>Perhaps the first question people ask is, "Why is Synthesis written in assembler?" This is soon followed by "How much of Synthesis could be re-written in a high-level language?" and "At what performance loss?".
<p>There are several reasons why assembler language was chosen, some of them researchrelated, and some of them historical. One reason is I felt that it would be an interesting experiment to write a medium-size system in assembler, which allows unrestricted access to the machine's architecture, and perhaps discover new coding idioms that have not yet been captured in a higher-level language. Later paragraphs talk about these. Another reason is that much of the early work involved discovering the most efficient way of working with the machine and its devices. It was a fast prototyping language, one in which I could write and test simple I/O drivers without the trouble of supporting a complex language runtime environment.
<p>But perhaps the biggest reason is that in 1984, at the time the seed ideas were being developed, I could not find a good, reliable (bug-free) C compiler for the 68000 processor. I had tried the compilers on several 68000-based <span class=smallcaps>Unix</span> machines and repeatedly found that compilation was slow, that the compilers were buggy, that they produced terrible machine code, and that their runtime libraries were not reentrant. These qualities interfered with my creativity and desire to experiment. Slow compilation dampens the enthusiasm of trying new ideas because the edit-compile-test cycle is lengthened. Buggy compilers makes it that much harder to write correct code. Poor code-generation makes my optimization efforts seem meaningless. And non-reentrant runtime libraries makes it harder to write a multithreaded kernel that can take advantage of multiprocessor architecture.
<p>Having started coding in assembler, it was easier to continue that way than to change. I had written an extensive library of utilities, including a fully reentrant C-language runtime library and subroutines for music and signal processing. In particular, I found my signal processing algorithms difficult to express in C. To achieve the high performance necessary for real-time operation, I use fixed-point arithmetic for the calculations, not floating-point. The C language provides poor support for fixed-point math, particularly multiply and divide. The Synthesis "printf" output conversion and formatting function provides a stunning example of the performance improvements that result with carefully-coded fixedpoint math. This function converts a floating-point number into a fully-formatted ASCII string, <em>1.5 times faster than the machine instruction</em> on the 68882 floating-point coprocessor converts binary floating-point to unformatted BCD (binary-coded decimal).
<p>Overall, the experience has been a positive one. A powerful macro facility helped minimize the difficulty of writing complex programs. The Synthesis assembler macro processor borrows heavily from the C-language macro processor, sharing much of the syntax and semantics. It provides important extensions, including macros that can define macros and quoting and "eval" mechanisms. Quaject definition, for example, is a declarative macro instruction in the assembler. It creates all the code and data structures needed by the kernel code generator, so the programmer need not worry about these details and can concentrate on the quaject's algorithms. Also, the Synthesis assembler (written in C, by the way) assembles 5000 lines per second. Complete system generation takes only 15 seconds. The elapsed time from making a change to the Synthesis source to having a new kernel booted and running is less than a minute. Since the turn-around time is so fast, I am much more likely to try different things.
<p>To my surprise, I found that there are some things that were distinctly easier to do using Synthesis assembler than using C. In many of these, the powerful macro processor played an important role, and I believe that the C language could be usefully improved with this macro processor. One example is the procedure that interprets receiver status code bits in the driver for the LANCE Ethernet controller chip. Interpreting these bits is a little tricky because some of the error conditions are valid only when present in conjunction with certain other conditions. One could always use a deeply-nested if-then-else structure to separate out the cases. It would work and also be quite readable and maintainable. But a jump-table implementation is faster. Constructing this table is difficult and error-prone. So we use macros to do it. The idea is to define a macro that evaluates the jump-address corresponding to a constant status-value passed as its argument. This macro is defined using preprocessor "#if" statements to evaluate the complex conditionals, which is just as readable and maintainable as regular if statements. The jump-table is then constructed by passing this macro to a counting macro which repeatedly invokes it, passing it 0, 1, 2, ... and so on, up to the largest status register value (128).
<p>The VT-100 terminal emulator is another place where assembly language made the job of coding easier. The VT-100 terminal emulator takes as input a buffer of data and interprets it, making changes to the virtual terminal screen. A problem arises when the input buffer runs out while in the middle of processing an escape sequence, for example, one which sets the cursor to an (X,Y ) position on the screen. When this happens, we must save enough state so that processing can resume where it left off when the emulator is called again with more data. Saving the state variables is easy. Saving the position within the program is harder. There is no way to access the program counter from the C language. This is a big problem because the VT-100 emulator is very complex, and there are many places where execution may be suspended. Using C, one must label all these places, and surround the whole piece of code with a huge switch statement to take execution flow to the right place when the function is called again. Using assembly language, this problem does not arise. We can encode the state machine directly, using the different program counter addresses to represent the different states.
<p>I believe much of Synthesis could be re-written in C, or a C-like high-level language. Modern compilers now have much better code generators, and I feel that performance of the static runtime code would not degrade too much -- perhaps less than 50%. Runtime code-generation could be handled by writing machine instructions into integer arrays and this code would continue to be highly efficient but still unportable. However, with the code generator itself written in a high-level language, porting it might be easier.
<p>I feel that adding a few new features to the C language can simplify the rewriting of Synthesis and help minimize the performance loss. Features I would like to see include:
<ul>
<li>A code-address data type to hold program-counter values, and an expanded "goto" to transfer control to such addresses. State machines in particular can benefit from a "<em>goto a[i]</em>" programming construct.
<li>A concept of a subroutine within a procedure, analogous to the "<em>jsr...rts</em>" instructions in assembly language. These would allow direct language model of the underlying hardware stack. They are useful to separate out into subroutines common blocks of code within a procedure, without the argument passing and procedure call overhead of ordinary functions, since subroutines implicitly inherit all local variables. Among other things, I have found that LALR(1) context-free parsers can be implemented very efficiently by representing the parser stack using the hardware, and using jsr and rts to perform the state transitions.
<li>Better support for fixed-point math. Even an efficient way of obtaining the full 64-bit result from a 32-bit integer multiplication would go a long way in this regard.
</ul>
<p>The inclusion of features like these does <em>not</em> mean that I encourage programmers to write spaghetti-code. Rather, these features are intended to supply the needed hooks for automatic program generators, for example, a state machine compiler, to take maximum benefit of the underlying hardware.
<h3>7.4.2 Porting Synthesis to the Sony NEWS Workstation</h3>
<p>Synthesis was first developed for the Quamachine, and like many substantial software systems, has gone through several revisions. The early kernel had several shortcomings. While the kernel showed impressive speed gains over conventional operating systems such as <span class=smallcaps>Unix</span>, its internal structure was not clean. The quaject structuring idea had come late in kernel development, so there were many parts that had been written in an ad hoc manner. Furthermore, the Quamachine kernel did not support virtual memory or networking.
<p>The goal of the Synthesis port to the Sony workstation was to alleviate the shortcomings, for example, by cleaning up the kernel structure and adding virtual memory and networking support. In particular, we wanted to show that the additional functionality would not significantly slow down the Synthesis kernel. This section reports on the experience and discusses the problems encountered while porting.
<p>The Synthesis port happened in three stages: first, a minimal Synthesis is ported to run under Sony's native <span class=smallcaps>Unix</span>. Then we wrote drivers for the keyboard and screen, and got minimal Synthesis to run on the raw hardware. This was followed by a full port, including all the devices.
<p>The first step went fast, taking two to three weeks. The reason is that most of the quajects do not need to run in kernel mode in order to work. The difference between Synthesis under <span class=smallcaps>Unix</span> and native Synthesis is that instead of connecting the final-stage I/O quajects to I/O device driver quajects (which are the only quajects that must be in the kernel), we connect them to <span class=smallcaps>Unix</span> read and write system calls on appropriately opened file descriptors. This is ultimate proof that Synthesis services can run in user-level as well as kernel.
<p>Porting to the raw machine was much harder, primarily because we chose to do our own device drivers. Some problems were caused by incomplete documentation on how to program the I/O devices on the Sony NEWS workstation. It was further complicated by the fact that each CPU has a different mapping of the I/O devices onto memory addresses and not everything is accessible by both CPUs. A simple program was written to patch the running <span class=smallcaps>Unix</span> kernel and install a new system call -- "execute function in kernel mode." Using this utility (carefully!), we were able to examine the running kernel and discover a few key addresses. After a bit more poking around, we discovered how to alter the page mappings so that sections of kernel and I/O memory were directly mapped into all user address spaces.<sup>2</sup> (The mmap system call on /dev/mem did not work.) Then using the Synthesis kernel monitor running on minimal Synthesis under a <span class=smallcaps>Unix</span> process, we were able to "hand access" the remaining I/O devices to verify their address and operation.
<div class=footnote><sup>2</sup> Talk about security holes!</div>
<p>(The Synthesis kernel monitor is basically a C-language parser front-end with direct access to the kernel code generators. It was crucial to both development and porting of Synthesis because it let us run and test sections of code without having the full kernel present. A typical debug cycle goes something like this: using the kernel monitor, we instantiate the quaject we want to test. We create a thread and point it at one of the quaject's callentries. We then single-step the thread and verify that the control flows where it is supposed to.)
<p>But the most difficult porting problems were caused by timing sensitivities in the various I/O devices. Some devices would "freeze" when accessed twice in rapid succession. These problems never showed up in the <span class=smallcaps>Unix</span> code because <span class=smallcaps>Unix</span> encapsulates device access in procedures. Calling a procedure to read a status value or change a control register allows enough time for the device to "recover" from the previous operation. But with code synthesis, device access frequently consists of a single machine instruction. Often the same device is accessed twice in rapid succession by two consecutive instructions, causing the timing problem. Once the cause of the problem was found, it was easy to correct: I made the kernel code generator insert an appropriate number of "nop" instructions between consecutive accesses.
<p>Once we had the minimal kernel running, getting the rest of the kernel and its associated libraries working was relatively easy. All of the code that did not involve the I/O devices ran without change. This includes the user-level shared runtime libraries, such as the C functions library and the signal-processing library. It also includes all the "intermediate" quajects that do not directly access the machine and its I/O devices, such as buffers, symbol tables (for name service), and mappers and translators (for file system mapping). Code involving I/O devices was harder, since that required writing new drivers. Finally, there are some unfinished drivers such as the SCSI disk driver.
<p>The thread system needed some changes to support the two CPUs on the Sony workstation; these were discussed in Chapter 5. Most of the changes were in the scheduling and dispatching code, to synchronize between the processors. This involved developing efficient, lock-free data structures which were then used to implement the algorithms. The scheduling policy was also changed from a single round-robin queue to one that uses a multiple-level queue structure. This helped guarantee good response time to urgent events even when there are many threads running, making it feasible to run thousands of threads on Synthesis.
<p>The most time-consuming part was implementing the new services: virtual memory, Ethernet driver, and window system. They were all implemented "from scratch," using all the performance-improving ideas discussed in this dissertation, such as kernel code generation. The measurements in this chapter show high performance gains in these areas as well. The Ethernet driver, for example, is fast enough to record all the packet traffic of a busy Ethernet (400 kilobytes/second, or about 4 megabits per second) into RAM using only 20% of a 25MHz, 68030 CPU's time. This is a problem that has been worked on and dismissed as impractical except when using special hardware.
<p>Besides the Sony workstation, the new kernel runs on the Quamachine as well. Of course, each machine must use the appropriate I/O drivers, but all the new services added to the Sony version work on the Quamachine.
<h3>7.4.3 Architecture Support</h3>
<p>Having worked very close to the hardware for so long, I have acquired some insight of what kinds of things would be useful for better operating systems support in future CPUs. Rather than pour out everything I ever thought useful for a machine to have, I will keep my suggestions to those that fit reasonably well with the "RISC" idea of processor design.
<ul>
<li>Better cache control to support runtime code generation. Ideally, I would like to see fully coherent instruction caches. But I recognize the expense involved, both in silicon area and degraded signal propagation times. But full coherence is probably not necessary. A cheap, non-privileged instruction to invalidate changed cache lines provides very good support at minimal cost for both hardware and code-modifying software. After all, if you've just modified an instruction, you know it's address, and it is easy to issue a cache-line invalidate on that address.
<li>Faster interrupt handling. Chapter 6 discussed the advantages of fine-grained handling of computation, particularly when it comes to interrupts. Further benefits result by also reducing the hardware-imposed overhead of interrupt handing. Perhaps this can be achieved at not-too-great expense by replicating the CPU pipeline registers much like register-windows enable much faster procedure call. I expect even a single level of duplication to really help, if we assume that interrupts are handled fast enough that the chances are small of receiving a second interrupt in the middle of processing the first.
<li>Hardware support for lock-free synchronization. Chapter 5 discussed the virtues of lock-free synchronization. But lock-free synchronization requires hardware support in the form of machine instructions that are more powerful than the test-and-set instruction used to implement locking. I have found that double-word Compare-&amp;-Swap is sufficient to implement an operating system kernel, and I conjecture that single-word Compare-&amp;-Swap is too weak. There may also be other kinds of instructions that also work.
<li>Hardware support for fast context switching. As processors become faster and more complex, they have increasing amounts of state that must be saved and restored on every context switch. Earlier sections had discussed the cost of switching the floating-point context, which is high because of the large amount of data that must be moved: 8 registers, each 96 bits long, requires 24 memory cycles to save them, and another 24 cycles to re-load them. Newer architectures, for example, one that supports hardware matrix multiply, can have even more state. I claim that a lot of this state does not change between switch-in and switch-out. I propose hardware support to efficiently save and restore only the part of the state that was used: a modified-bit on each register, and selective disabling of hardware function units. Modified-bits on each register lets the operating system save only those registers that have been changed since switch-in. Selective disabling of function units lets the operating system defer loading that unit's state until it is needed. If a functional unit goes unused between switch-in and the subsequent switch-out, its state will not have been loaded nor saved.
<li>Faster byte-operations. Many I/O-related functions tend to be byte-oriented, whereas CPU and memory tends to be word-oriented. This means it costs no more to fetch a full 32-bit word as it does to fetch a byte. We can take advantage to this with two new instructions: "load-4-bytes" and "store-4-bytes". These would move a word from memory into four registers, one byte to a register. The program can then operate on the four bytes in registers without referencing memory again. Another suggestion, probably less useful, is a "carry-suppress" option for addition, to suppress carry-out at byte-boundaries, allowing four additions or subtractions to take place simultaneously on four bytes packed into a 32-bit integer. I foresee the primary use of this to be in low-level graphics routines that deal with 8-bit pixels.
<li>Improved bit-wise operation support. The current complement of bitwise-logical operations and shifts are already pretty good, what is lacking is a perfect shuffle of bits in a register. This is very useful for bit-mapped graphics operations, particularly things like bit-matrix transpose, which is heavily used when unpacking byte-wide pixels into separate bit-planes, as is required by certain framebuffer architectures.
</ul>
<h2>7.5 Other Opinions</h2>
<p>In any line of research, there are often significant differences of opinion over what assumptions and ideas are good ones. Synthesis is no exception, and it has its share of critics. I feel it is my duty to point out where differences of opinion exist, to allow readers to come to their own conclusions. In this section, I try to address some of the more frequently raised objections regarding Synthesis, and rebut those that are, in my opinion, ill-founded.
<blockquote>Objection 1: "How much of the performance improvement is due to my ideas, and how much is due to writing in assembler, and tuning the hell out of the thing?"</blockquote>
<p>This is often asked by people who believe it to be much more of the latter and much less of the former.
<p>Section 3.3 outlined several places in the kernel where code synthesis was used to advantage. For data movement operations, it showed that code synthesis achieves 1.4 to 2.4 times better performance than the best assembly-language implementation not using code synthesis. For more specialized operations, such as context switching, code synthesis delivers as much as 10 times better performance. So, in a terse answer to the question, I would say "40% to 140%".
<p>But those figures do not tell the whole story. They are detailed measurements, designed to compare two versions of the same thing, in the same execution environment. Missing from those measurements is a sense of how the interaction between larger pieces of a program changes when code synthesis is used. For example, in that same section, I show that a procedural implementation of "putchar" using code synthesis is slightly faster than the C-language "putchar" macro, which is in-line expanded into the user's code. The fact that enough savings could be had through code synthesis to more than amortize the cost of a procedure call -- even in a simple, not-easily-optimized operation such as "putchar" -- changes the nature of how data is passed between modules in a program. Many modules that process streams of data are currently written to take as input a buffer of data and produce as output a new buffer of data. Chaining several such modules involves calling each one in turn, passing it the previous module's output buffer as the input. With a fast "putchar" procedure, it is no longer necessary to pass buffers and pointers around; we can now pass the address of the downstream module for "putchar," and the address of the upstream module for "getchar." Each module makes direct calls to its neighbors to get the data, eliminating the memory copy and all consequent pointer and counter manipulations.
<blockquote>Objection 2: "Self-modifying data structures are troublesome on pipelined machines, and code generation has problems with machines that don't allow finegrained control of the instruction cache. In other words, Synthesis techniques are dependent on hardware features that aren't present in all machines, and, worse, are becoming increasingly scarce."</blockquote>
<p>Pipelined machines pose no special difficulties because Synthesis does not modify instructions ahead of the program counter. Code modification, when it happens, is restricted to patching just-executed code, or unrelated code. In both cases, even a long instruction pipeline is not a problem.
<p>The presence of a non-coherent and hard-to-flush instruction cache is the harder problem. By "hard-to-flush," I mean a cache that must be flushed whole instead of line-ata-time, or one that cannot be flushed in user mode without taking a protection exception. Self-modifying code is still effective, but such a cache changes the breakeven point when it becomes more economical to interpret data than to modify code. For example, conditions that change frequently are best represented using a boolean flag, as is usually done. But for conditions that are tested much more frequently than changed, code modification remains the method of choice. The cost of flushing the cache determines at what ratio of testing to modification the decision is made.
<p>Relief may come from advances in the design of multiprocessors. Recent studies show that, for a wide variety of workloads, software-controlled caches are nearly as effective as fully coherent hardware caches and much easier to build, as they require no hardware [23] [2]. Further extensions to this idea stem from the observation that full coherency is often not necessary, and that it is beneficial to rely on the compiler to maintain coherency in software only when required [2]. This line of thinking leads to cache designs that have the necessary control to efficiently support code-modifying programs.
<p>But it is true that the assumption that code is read-only is increasingly common, and that hardware designs are more and more using this assumption. Hardware manufacturers design according to the needs of their market. Since nobody is doing runtime code generation, is it little wonder that it is not well supported. But then, isn't this what research is for? To open people's eyes and to point out possibilities, both new and overlooked. This dissertation points out certain techniques that increase performance. It happens that the techniques are unusual, and make demands of the hardware that are not commonly made. But just as virtual memory proved to be a useful idea and all new processors now support memory management, one can expect that if Synthesis ideas prove to be useful, they too will be better supported.
<blockquote>Objection 3: "Does this matter? Hardware is getting faster, and anything that is slow today will probably be fast enough in two years."</blockquote>
<p>Yes, it matters!
<p>There is more to Synthesis than raw speed. Cutting the cost of services by a factor of 10 is the kind of change that can fundamentally alter the structure of those services. One example is the PLL-based process scheduling. You couldn't do that if context switch was expensive -- driving the time way below one millisecond is what made it possible to move to a radically different scheduler, with nice properties, besides speed.
<p>For another example, I want to pose a question: if threads were as cheap as procedure calls, what would you do with them? One answer is found in the music synthesizer applications that run on Synthesis. Most of them create a new thread for every note! Driving the cost of threads to within a few factors of the cost of procedure call changes the way applications are structured. The programmer now only needs to be concerned that the waveform is synthesized correctly. The Synthesis thread scheduler ensures that each thread gets enough CPU time to perform its job. You could not do that if threads were expensive.
<p>Finally, hardware may be getting faster, but it is not getting faster fast enough. Look at the window-system figures given in Table 7.2. Synthesis running on 5-year-old hardware technology outperforms conventional systems running on the latest hardware. Even with faster hardware, it is not fast enough to overtake Synthesis.
<blockquote>Objection 4: "Why is Synthesis written in assembler? How much of the reason is that you wanted no extraneous instructions? How much of the reason is that code synthesis requires assembler? How much of Synthesis could be re-written in a high-level language?"</blockquote>
<p>Section 7.4.1 answers these questions in detail.
</div>
</body>
</html>

View File

@ -1,119 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div id="nav">
<a class=home href="../index.html">Alexia's Home</a>
<a href="index.html">Dissertation</a>
<a href="abs.html">Abstract</a>
<a href="ack.html">Acknowledgements</a>
<a href="toc.html">Contents</a>
<a href="ch1.html">Chapter 1</a>
<a href="ch2.html">Chapter 2</a>
<a href="ch3.html">Chapter 3</a>
<a href="ch4.html">Chapter 4</a>
<a href="ch5.html">Chapter 5</a>
<a href="ch6.html">Chapter 6</a>
<a href="ch7.html">Chapter 7</a>
<a class=here href="ch8.html">Chapter 8</a>
<a href="bib.html">Bibliography</a>
<a href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
<h1>8. Conclusion</h1>
<div id="chapter-quote">
A dissertation is never finished.<br>
You just stop writing.<br>
-- Everyone with a Ph.D.
</div>
<p>This dissertation has described Synthesis, a new operating system kernel that provides fundamental services an order of magnitude more efficiently than traditional operating systems.
<p>Two options define the direction in which research of this nature may proceed. Firstly, an existing system may be adopted as a platform upon which incremental development may take place. Studying a piece of an existing system limits the scope of the work, ensures that one is never far from a functioning system that can be measured to guide development, and secures a preexisting base of users, upon completion. On the down side, such an approach may necessarily limit the amount of innovation and creativity brought to the process and possibly carry along any preexisting biases built in by the originators of the environment, reducing the impact the research might have on improving overall performance.
<p>Alternatively one can start anew. Such an effort removes the burden of preexisting decisions and tradeoffs, and allows use of knowledge and hindsight acquired from past systems to avoid making the same mistakes. The danger, however, is of making new, possibly fatal mistakes. In addition, so much valuable time can be spent building up the base and re-inventing wheels, that little innovation takes place.
<p>I have chosen the second direction. I felt that the potential benefits of an important breakthrough far outweighed the dangers of failure. Happily, I believe that a positive outcome may be reported. I would like to summarize both the major contributions and shortcomings of this effort.
<p>A basic assumption of this research effort has been that low overhead and low latency are important properties of an operating system. Supporting this notion is the prediction that as distributed computing becomes ubiquitous, responsiveness and overall performance will suffer at the hands of the high overhead and latency of current systems. Advances in networking technology, impressive as they are, will bear little fruit unless operating systems software is efficient enough to to make full use of the higher bandwidths. Emerging application areas such as interactive sound, video, and the future panoply of interface technologies subsumed under the umbrella of "multi-media" place strict timing requirements on operating system services -- requirements that existing systems have difficulty meeting, in part, because of their high overhead and lack of real-time support.
<p>The current leading suggestion to address the performance problems is to move function out of the kernel, thus avoiding crossing the kernel boundary and allowing customization of traditional kernel services to individual applications. Synthesis shows that it is not necessary to accept that kernel services will be slow, and to find work-arounds to them, but rather that it is possible to provide very efficient kernel services. This is important, because ultimately communications with the outside world still must go through the kernel.
<p>With real-time support and an overhead factor ten times less than that of other systems, Synthesis may be considered a resounding success. Four key performance-improving dynamics differentiate Synthesis:
<ul>
<li>Large scale use of run-time code generation.
<li>Quaject-oriented kernel structure.
<li>Lock-free synchronization.
<li>Feedback-based process scheduling.
</ul>
<p>Synthesis constitutes the first large-scale use of run time code generation to specifically improve operating system performance. Chapter 3 demonstrates that common operating system functions run five times faster when implemented using runtime generated code than a typical C language implementation, and nearly ten times faster when compared with the standard <span class=smallcaps>Unix</span> implementation. The use of run time code generation not only improves the performance of existing services, but allows for the addition of new services without incremental systems overhead.
<p>Further differentiating Synthesis is its novel kernel structure, based around "quajects," forming the building blocks of all kernel services. In many respects, quajects resemble the objects of traditional Object-Oriented programming, including data encapsulation and abstraction. Quajects differ, however, in four important ways:
<ul>
<li>A procedural rather than message-based interface.
<li>Explicit declaration of exceptions and external calls.
<li>Runtime binding of the external calls, and
<li>Implementation using runtime code generation.
</ul>
<p>By making explicit the quaject's exceptions and external calls, the kernel may dynamically link quajects. Rather than providing services monolithically, Synthesis builds them through the use of one or more quajects eventually comprising the user's thread. This binding takes place dynamically, at runtime, allowing for both the expansion of existing services and for an enhanced capability for creating new ones. The traditional distinction between kernel and user services becomes blurred, allowing for applications' direct participation in the delivery of services. This is possible because a quaject's interface is extensible across the protection boundaries which divide applications from the kernel and from each other. Such an approach enjoys a further advantage: preserving the partitioning and modularity found in traditional systems centered around user-level servers, while bettering the higher performance levels of the monolithic kernels which, while fast, are often difficult to understand and modify.
<p>The code generation implementation and procedural interface of quajects enhances performance by reducing argument passing and enabling in-line expansion of called quajects into their caller to happen at runtime. Quaject callentries, for example, require no "self" parameter, since it is implicit in their runtime-generated code. This shows, through quajects, that a highly efficient object-based system is possible.
<p>A further research contribution of Synthesis is to demonstrate that lock-free synchronization is a viable, efficient alternative to mutual exclusion for the implementation of multiprocessor kernels. Mutual exclusion and locking, the traditional forms of synchronization, suffer from deadlock and priority inversion. Lock-free synchronization avoids these problems. But until now, there has been no evidence that a sufficiently rich set of concurrent data structures could be implemented efficiently enough using lock-free synchronization to support a full operating system. Synthesis successfully implements a sufficient number of concurrent, lock-free data structures using one and two-word Compare-&amp;-Swap instructions. The kernel is then carefully structured using only those data structures in the places where synchronization is needed. The lock-free concurrent data structures are then demonstrated to deliver better performance than locking-based techniques, further supporting my thesis for hardware that supports Compare-&amp;-Swap.
<p>New scheduling algorithms have been presented which generalize scheduling from job assignments as a function of time, to functions of data flow and interrupt rates. The algorithms are based upon feedback, drawing from control systems theory. The applications for these algorithms include support for real-time data streams and improved support for dealing with the flow of time. These applications have been illustrated by numerous descriptions of real-time sound and signal processing programs, a disk-sector finder program, and a discussing on clock synchronization.
<p>It is often said that good research raises more questions than it answers. I now explore some open questions, point out some of Synthesis' shortcomings, and suggest directions for future work.
<p>Clearly, we need better understanding of how to write programs that create code at run time. A more formal model of what it is and how it is used would be helpful in extending its applicability and in finding ways to provide a convenient interface to it.
<p>Subsidiary to this, a good cost/benefit analysis of runtime code generation is lacking. Because Synthesis is the first system to apply run time code generation on a large-scale basis, a strategic goal has simply been to get it to work and show that performance benefits do exist. Accordingly, the emphasis has been in areas where the benefits have been deemed to be greatest and where code generation was easiest to do, both in terms of programming difficulty and in CPU cycles. Intuition has been an important guide in the implementation process, resulting in an end product which performs well. It is not known is how much more improvement is possible. Perhaps applying runtime code generation more vigorously or structuring things in a different way will yield even greater benefits.
<p>Unfortunately, there is no high-level language available making programs that use run time code generation easy to write and at the same time, portable. Aside from the obvious benefit of making the technique more accessible to all members of the profession, a better understanding of the benefits of runtime code generation will sure accrue from developing such a language.
<p>An interesting direction to explore is to extend the ideas of runtime code generation to runtime reconfigurable hardware. Chips now exist whose function is "programmed" by downloading strings of bits that configure the internal logic function blocks and the routing of signals between blocks. Although the chips are generally programmed once, upon initialization, they could be reprogrammed at other times, optimizing the hardware as the environment changes. Some PGAs could be set aside for computations purposes: functions such as permuting bit vectors can be implemented much more efficiently with PGA hardware than in software. Memory operations, such as a fast memory-zero or fast page copy could be implemented operating asynchronously with the main processor. As yet unanticipated functions could be configured as research identifies the need. A machine architecture is envisaged having no I/O device controllers at all -- just a large array of programmable gate array (PGA) chips wired to the processor and to various forms of I/O connectors. Clearly, the types of I/O devices which the machine supports is a function of the bit patterns loaded into its PGAs, rather than the boards which alternatively would be plugged into its backplane. This is highly advantageous, for as new devices need to be supported, there is no need for new boards and the attendant expense and delay of acquiring them.
<p>Currently, under Synthesis, users cannot define their own services. Quaject composition is a powerful mechanism to define and implement kernel services, but this power has not yet been made accessible to the end user. At present, all services that exist do so because located somewhere in the kernel is a piece of code which knows which quajects to create and how to link them in order to provide each service. It would be better if this were not hard coded into the kernel, but made user-accessible via some sort of service description language. To support such a language, the quaject type system would need to be extended to provide runtime type checking, which is currently lacking.
<p>Another open question concerns the generality of lock-free synchronization. Lock-free synchronization has many desirable properties as discussed in this dissertation. Synthesis has demonstrated that lock-free synchronization is sufficient for the implementation of an operating system kernel. "Is this accomplished at the expense of required generality" is a question in need of an answer. Synthesis isolates all synchronization to a handful of concurrent data structures which have been shown to have an efficient lock-free implementation. Nonetheless, when lock-free data structures are used to implement systems policy, a loss of generality or efficiency may result. Currently, Synthesis efficiently supports a scheduling policy only if it has an efficient lock free implementation. One approach to this issue is to add to the list of efficient lock-free data structure implementations, thereby expanding the set of supportable policies. Another research direction is to determine when other synchronization methods are necessary so that a policy may be followed literally, but also when the policy can be modified to fit an efficient lock-free implementation. In addition, determining how best to support processors without a Compare-&amp;-Swap instruction would be valuable.
<p>The behavior of feedback-based, fine grained scheduling has not been fully explored. When the measurements and adjustments happen at regular intervals, the schedule can be modeled as a linear discrete time system and Z-transforms used to prove stability and convergence. In the general case, measurements and adjustments can occur at irregular intervals because they are scheduled as a function of previous measurements. It is not known whether this type of scheduler is stable for all work load conditions. While empirical observations of real-time signal processing applications indicate that the scheduler is stable under many interesting, real-life workloads, it would be nice if this could be formally proven.
<p>The current 68030 assembly language implementation limits Synthesis' portability. The amount of code is not inordinately large (20,000 lines) and much of it is macro invocations, rather than machine instructions, so an assembler-level port might not be nearly as difficult as it might first appear. A high level language implementation would be better. An open question is the issue of runtime code generation. While one could create code which inserts machine-op codes into memory, the result would be no more portable than the current assembly language version. A possible approach to this problem would be to abstract as many runtime code generation techniques as possible in simple, machine-independent extensions to an existing programming language such as C. Using the prototypic language and making performance comparisons along the way would go far toward identifying which direction the final language should take.
<p>While Synthesis holds out enormous promise, its readiness for public release is retarded by the following factors:
<ul>
<li>Known bugs need to be removed.
<li>The window system is incomplete, and lacks many 2-D graphics primitives and mouse tracking.
<li>The virtual memory model is not fully developed, and the pager interface should be exported to the user in order to enhance its utility.
<li>The network driver works, but no protocols have as yet been implemented.
</ul>
<p>All of these enhancements can be made without risk to either the measurements presented in this dissertation, or to the speed and efficiency of the primitive kernel. My confidence rests partly because the significant execution paths have been anticipated and measured, and partly from past experience, when the much more significant functionality of multiprocessor support, virtual memory, ethernet driver, and the window system were added to the then primitive kernel without slowing it down.
<p>I want to conclude by emphasizing that although this work has been done in the context of operating systems, the ideas presented in this dissertation - runtime code generation, quaject structuring, lock-free methods of synchronization, and scheduling based on feedback - can all be applied equally well to improving the performance, structuring, and robustness of user-level programs. The open questions, particularly those regarding runtime code generation, may make this difficult at times; nevertheless the potential is there.
<p>While countless philosophers throughout Western Civilization have all proffered advice against the practice of predicting the future, most have failed to resist the temptation. While computer scientists shall most likely fare no better at this art, I believe that Synthesis brings to the surface an operating system of elegance and efficiency as to accelerate serious consideration and development of multiple microprocessor machine environments, particularly those allied with multi-media and communications. In short, Synthesis and the concepts upon which it rests are not likely to be eclipsed by any major occurrence on the event horizon of technology any time soon.
</div>
</body>
</html>

View File

@ -1,196 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">
<html>
<head>
<title>Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract</title>
<link rel="stylesheet" type="text/css" href="../css/style.css">
<link rel="stylesheet" type="text/css" href="style.css">
<!-- CUSTOM STYLE GOES HERE -->
</head>
<body>
<div id="nav">
<a href="index.html">Title</a>
<a href="abs.html">Abstract</a>
<a href="ack.html">Acknowledgements</a>
<a class=here href="toc.html">Contents</a>
<a href="ch1.html">Chapter 1</a>
<a href="ch2.html">Chapter 2</a>
<a href="ch3.html">Chapter 3</a>
<a href="ch4.html">Chapter 4</a>
<a href="ch5.html">Chapter 5</a>
<a href="ch6.html">Chapter 6</a>
<a href="ch7.html">Chapter 7</a>
<a href="ch8.html">Chapter 8</a>
<a href="bib.html">Bibliography</a>
<a href="app-A.html">Appendix A</a>
</div>
<div id="running-title">
Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract
</div>
<div id="content">
<div class=toc>
<h1>Contents</h1>
<ul>
<li><a href="ch1.html">1. Introduction</a>
<ul>
<li><a href="ch1.html">1.1 Purpose</a>
<li><a href="ch1.html">1.2 History and Motivation</a>
<li><a href="ch1.html">1.3 Synthesis Overview</a>
<ul>
<li><a href="ch1.html">1.3.1 Kernel Structure</a>
<li><a href="ch1.html">1.3.2 Implementation Ideas</a>
<li><a href="ch1.html">1.3.3 Implementation Language</a>
<li><a href="ch1.html">1.3.4 Target Hardware</a>
<li><a href="ch1.html">1.3.5 <span class=smallcaps>Unix</span> Emulator</a>
</ul>
</ul>
<li><a href="ch2.html">2. Previous Work</a>
<ul>
<li><a href="ch2.html">2.1 Overview</a>
<li><a href="ch2.html">2.2 The Tradeoff Between Throughput and Latency</a>
<li><a href="ch2.html">2.3 Kernel Structure</a>
<ul>
<li><a href="ch2.html">2.3.1 The Trend from Monolithic to Diffuse</a>
<li><a href="ch2.html">2.3.2 Services and Interfaces</a>
<li><a href="ch2.html">2.3.3 Managing Diverse Types of I/O</a>
<li><a href="ch2.html">2.3.4 Managing Processes</a>
</ul>
</ul>
<li><a href="ch3.html">3. Kernel Code Generator</a>
<ul>
<li><a href="ch3.html">3.1 Fundamentals</a>
<li><a href="ch3.html">3.2 Methods of Runtime Code Generation</a>
<ul>
<li><a href="ch3.html">3.2.1 Factoring Invariants</a>
<li><a href="ch3.html">3.2.2 Collapsing Layers</a>
<li><a href="ch3.html">3.2.3 Executable Data Structures</a>
<li><a href="ch3.html">3.2.4 Performance Gains</a>
</ul>
<li><a href="ch3.html">3.3 Uses of Code Synthesis in the Kernel</a>
<ul>
<li><a href="ch3.html">3.3.1 Buffers and Queues</a>
<li><a href="ch3.html">3.3.2 Context Switches</a>
<li><a href="ch3.html">3.3.3 Interrupt Handling</a>
<li><a href="ch3.html">3.3.4 System Calls</a>
</ul>
<li><a href="ch3.html">3.4 Other Issues 3.4.1 Kernel Size</a>
<ul>
<li><a href="ch3.html">3.4.2 Protecting Synthesized Code</a>
<li><a href="ch3.html">3.4.3 Non-coherent Instruction Cache</a>
</ul>
<li><a href="ch3.html">3.5 Summary</a>
</ul>
<li><a href="ch4.html">4. Kernel Structure</a>
<ul>
<li><a href="ch4.html">4.1 Quajects</a>
<ul>
<li><a href="ch4.html">4.1.1 Quaject Interfaces</a>
<li><a href="ch4.html">4.1.2 Creating and Destroying Quajects</a>
<li><a href="ch4.html">4.1.3 Resolving References</a>
<li><a href="ch4.html">4.1.4 Building Services</a>
<li><a href="ch4.html">4.1.5 Summary</a>
</ul>
<li><a href="ch4.html">4.2 Procedure-Based Kernel</a>
<ul>
<li><a href="ch4.html">4.2.1 Calling Kernel Procedures</a>
<li><a href="ch4.html">4.2.2 Protection</a>
<li><a href="ch4.html">4.2.3 Dynamic Linking</a>
</ul>
<li><a href="ch4.html">4.3 Threads of Execution</a>
<ul>
<li><a href="ch4.html">4.3.1 Execution Modes</a>
<li><a href="ch4.html">4.3.2 Thread Operations</a>
<li><a href="ch4.html">4.3.3 Scheduling</a>
</ul>
<li><a href="ch4.html">4.4 Input and Output</a>
<ul>
<li><a href="ch4.html">4.4.1 Producer/Consumer</a>
<li><a href="ch4.html">4.4.2 Hardware Devices</a>
</ul>
<li><a href="ch4.html">4.5 Virtual Memory</a>
<li><a href="ch4.html">4.6 Summary</a>
</ul>
<li><a href="ch5.html">5. Concurrency and Synchronization</a>
<ul>
<li><a href="ch5.html">5.1 Synchronization in OS Kernels</a>
<ul>
<li><a href="ch5.html">5.1.1 Disabling Interrupts</a>
<li><a href="ch5.html">5.1.2 Locking Synchronization Methods</a>
<li><a href="ch5.html">5.1.3 Lock-Free Synchronization Methods</a>
<li><a href="ch5.html">5.1.4 Synthesis Approach</a>
</ul>
<li><a href="ch5.html">5.2 Lock-Free Quajects</a>
<ul>
<li><a href="ch5.html">5.2.1 Simple Linked Lists</a>
<li><a href="ch5.html">5.2.2 Stacks and Queues</a>
<li><a href="ch5.html">5.2.3 General Linked Lists</a>
<li><a href="ch5.html">5.2.4 Lock-Free Synchronization Overhead</a>
</ul>
<li><a href="ch5.html">5.3 Threads</a>
<ul>
<li><a href="ch5.html">5.3.1 Scheduling and Dispatching</a>
<li><a href="ch5.html">5.3.2 Thread Operations</a>
<li><a href="ch5.html">5.3.3 Cost of Thread Operations</a>
</ul>
<li><a href="ch5.html">5.4 Summary</a>
</ul>
<li><a href="ch6.html">6. Fine-Grain Scheduling</a>
<ul>
<li><a href="ch6.html">6.1 Scheduling Policies and Mechanisms</a>
<ul>
<li><a href="ch6.html">6.2.1 Hardware Phase Locked Loop</a>
<li><a href="ch6.html">6.2.2 Software Feedback</a>
<li><a href="ch6.html">6.2.3 FLL Example</a>
<li><a href="ch6.html">6.2.4 Application Domains</a>
</ul>
<li><a href="ch6.html">6.3 Uses of Feedback in Synthesis</a>
<ul>
<li><a href="ch6.html">6.3.1 Real-Time Signal Processing</a>
<li><a href="ch6.html">6.3.2 Rhythm Tracking and The Automatic Drummer</a>
<li><a href="ch6.html">6.3.3 Digital Oversampling Filter</a>
<li><a href="ch6.html">6.3.4 Discussion</a>
</ul>
<li><a href="ch6.html">6.4 Other Applications</a>
<ul>
<li><a href="ch6.html">6.4.1 Clocks</a>
<li><a href="ch6.html">6.4.2 Real-Time Scheduling</a>
<li><a href="ch6.html">6.4.3 Multiprocessor and Distributed Scheduling</a>
</ul>
<li><a href="ch6.html">6.5 Summary</a>
</ul>
<li><a href="ch7.html">7. Measurements and Evaluation</a>
<ul>
<li><a href="ch7.html">7.1 Measurement Environment</a>
<ul>
<li><a href="ch7.html">7.1.1 Hardware</a>
<li><a href="ch7.html">7.1.2 Software</a>
</ul>
<li><a href="ch7.html">7.2 User-Level Measurements</a>
<ul>
<li><a href="ch7.html">7.2.1 Comparing Synthesis with SUNOS 3.5</a>
<li><a href="ch7.html">7.2.2 Comparing Window Systems</a>
</ul>
<li><a href="ch7.html">7.3 Detailed Measurements</a>
<ul>
<li><a href="ch7.html">7.3.1 File and Device I/O</a>
<li><a href="ch7.html">7.3.2 Virtual Memory</a>
<li><a href="ch7.html">7.3.3 Window System</a>
<li><a href="ch7.html">7.3.4 Other Figures</a>
</ul>
<li><a href="ch7.html">7.4 Experience</a>
<ul>
<li><a href="ch7.html">7.4.1 Assembly Language</a>
<li><a href="ch7.html">7.4.2 Porting Synthesis to the Sony NEWS Workstation</a>
<li><a href="ch7.html">7.4.3 Architecture Support</a>
</ul>
<li><a href="ch7.html">7.5 Other Opinions</a>
</ul>
<li><a href="ch8.html">8. Conclusion</a>
</ul>
</div>
</div>
</body>
</html>

View File

@ -0,0 +1,50 @@
%hr/
%p
layout: typed
title: Simple soml performance numbers
%p
These benchmarks were made to establish places for optimizations. This early on it is clear that
performance is not outstanding, but still there were some surprises.
%ul
%li loop - program does empty loop of same size as hello
%li hello - output hello world (to dev/null) to measure kernel calls (not terminal speed)
%li itos - convert integers from 1 to 100000 to string
%li add - run integer adds by linear fibonacci of 40
%li call - exercise calling by recursive fibonacci of 20
%p
Hello and itos and add run 100_000 iterations per program invocation to remove startup overhead.
Call only has 10000 iterations, as it is much slower, executing about 10000 calls per invocation
%p Gcc used to compile c on the machine. soml executables produced by ruby (on another machine)
%h3#results Results
%p
Results were measured by a ruby script. Mean and variance was measured until variance was low,
always under one percent.
%p
The machine was a virtual arm run on a powerbook, performance roughly equivalent to a raspberry pi.
But results should be seen as relative, not absolute (some were scaled)
%p
%img{:alt => "Graph", :src => "bench.png"}/
%h3#discussion Discussion
%p
Surprisingly there are areas where soml code runs faster than c. Especially in the hello example this
may not mean too much. Printf does caching and has a lot functionality, so it may not be a straight
comparison. The loop example is surprising and needs to be examined.
%p
The add example is slower because of the different memory model and lack of optimisation for soml.
Every result of an arithmetic operation is immediately written to memory in soml, whereas c will
keep things in registers as long as it can, which in the example is the whole time. This can
be improved upon with register code optimisation, which can cut loads after writes and writes that
that are overwritten before calls or jumps are made.
%p
The call was expected to be larger as a typed model is used and runtime information (like the method
name) made available. It is actually a small price to pay for the ability to generate code at runtime
and will off course reduce drastically with inlining.
%p
The itos example was also to be expected as it relies both on calling and on arithmetic. Also itos
relies heavily on division by 10, which when coded in cpu specific assembler may easily be sped up
by a factor of 2-3.
%p
All in all the results are encouraging as no optimization efforts have been made. Off course the
most encouraging fact is that the system works and thus may be used as the basis of a dynamic
code generator, as opposed to having to interpret.

View File

@ -1,54 +0,0 @@
---
layout: typed
title: Simple soml performance numbers
---
These benchmarks were made to establish places for optimizations. This early on it is clear that
performance is not outstanding, but still there were some surprises.
- loop - program does empty loop of same size as hello
- hello - output hello world (to dev/null) to measure kernel calls (not terminal speed)
- itos - convert integers from 1 to 100000 to string
- add - run integer adds by linear fibonacci of 40
- call - exercise calling by recursive fibonacci of 20
Hello and itos and add run 100_000 iterations per program invocation to remove startup overhead.
Call only has 10000 iterations, as it is much slower, executing about 10000 calls per invocation
Gcc used to compile c on the machine. soml executables produced by ruby (on another machine)
### Results
Results were measured by a ruby script. Mean and variance was measured until variance was low,
always under one percent.
The machine was a virtual arm run on a powerbook, performance roughly equivalent to a raspberry pi.
But results should be seen as relative, not absolute (some were scaled)
![Graph](bench.png)
### Discussion
Surprisingly there are areas where soml code runs faster than c. Especially in the hello example this
may not mean too much. Printf does caching and has a lot functionality, so it may not be a straight
comparison. The loop example is surprising and needs to be examined.
The add example is slower because of the different memory model and lack of optimisation for soml.
Every result of an arithmetic operation is immediately written to memory in soml, whereas c will
keep things in registers as long as it can, which in the example is the whole time. This can
be improved upon with register code optimisation, which can cut loads after writes and writes that
that are overwritten before calls or jumps are made.
The call was expected to be larger as a typed model is used and runtime information (like the method
name) made available. It is actually a small price to pay for the ability to generate code at runtime
and will off course reduce drastically with inlining.
The itos example was also to be expected as it relies both on calling and on arithmetic. Also itos
relies heavily on division by 10, which when coded in cpu specific assembler may easily be sped up
by a factor of 2-3.
All in all the results are encouraging as no optimization efforts have been made. Off course the
most encouraging fact is that the system works and thus may be used as the basis of a dynamic
code generator, as opposed to having to interpret.

89
typed/debugger.html.haml Normal file
View File

@ -0,0 +1,89 @@
%hr/
%p
layout: typed
title: Register Level Debugger / simulator
%h2#views Views
%p
From left to right there are several views showing different data and controls.
All of the green boxes are in fact pop-up menus and can show more information.
%br/
Most of these are implemented as a single class with the name reflecting what part.
I wrote 2 base classes that handle element generation (ie there is hardly any html involved, just elements)
%p
%img{:alt => "Debugger", :src => "https://raw.githubusercontent.com/ruby-x/rubyx-debugger/master/static/debugger.png", :width => "100%"}/
%h3#switch-view Switch view
%p
Top left at the top is a little control to switch files.
The files need to be in the repository, but at least one can have several and switch between
them without stopping the debugger.
%p
Parsing is the only thing that opal chokes on, so the files are parsed by a server script and the
ast is sent to the browser.
%h3#classes-view Classes View
%p
The first column on the left is a list of classes in the system. Like on all boxes one can hover
over a name to look at the class and its instance variables (recursively)
%h3#source-view Source View
%p
Next is a view of the Soml source. The Source is reconstructed from the ast as html.
Soml (RubyX object machine language) is is a statically typed language,
maybe in spirit close to c++ (without the c). In the future RubyX will compile ruby to soml.
%p While stepping through the code, those parts of the code that are active get highlighted in blue.
%p
Currently stepping is done only in register instructions, which means that depending on the source
constructs it may take many steps for the cursor to move on.
%p Each step will show progress on the register level though (next view)
%h3#register-instruction-view Register Instruction view
%p
RubyX defines a register machine level which is quite close to the arm machine, but with more
sensible names. It has 16 registers (below) and an instruction set that is useful for Soml.
%p
Data movement related instruction implement an indexed get and set. There is also Constant load and
integer operators and off course branches.
Instructions print their name and used registers r0-r15.
%p The next instruction to be executed is highlighted in blue. A list of previous instructions is shown.
%p One can follow the effect of instruction in the register view below.
%h3#status-view Status View
%p
The last view at the top right show the status of the machine (interpreter to be precise), the
instruction count and any stdout
%p Current controls include stepping and three speeds of running the program.
%ul
%li
Next (green button) will execute exactly one instruction when clicked. Mostly useful when
debugging the compiler, ie inspecting the generated code.
%li
Crawl (first blue button) will execute at a moderate speed. One can still follow the
logic at the register level
%li
Run (second blue button) runs the program at a higher speed where register instruction just
whizz by, but one can still follow the source view. Mainly used to verify that the source executes
as expected and also to get to a specific place in the program (in the absence of breakpoints)
%li
Wizz (third blue button) makes the program run so fast that its only useful function is to
fast forward in the code (while debugging)
%h3#register-view Register view
%p
The bottom part of the screen is taken up by the 16 register. As we execute an object oriented
language, we show the object contents if it is an object (not an integer) in a register.
%p
The (virtual) machine only uses objects, and specifically a linked list of Message objects to
make calls. The current message is always in register 0 (analgous to a stack pointer).
All other registers are scratch for statement use.
%p
In Soml expressions compile to the register that holds the expressions value and statements may use
all registers and may not rely on anything other than the message in register 0.
%p The Register view is now greatly improved, especially in its dynamic features:
%ul
%li when the contents update the register obviously updates
%li when the object that the register holds updates, the new value is shown immediately
%li
hovering over a variable will
%strong expand that variable
\.
%li the hovering works recursively, so it is possible to drill down into objects for several levels
%p
The last feature of inspecting objects is show in the screenshot. This makes it possible
to very quickly verify the programs behaviour. As it is a pure object system , all data is in
objects, and all objects can be inspected.

View File

@ -1,97 +0,0 @@
---
layout: typed
title: Register Level Debugger / simulator
---
## Views
From left to right there are several views showing different data and controls.
All of the green boxes are in fact pop-up menus and can show more information.
Most of these are implemented as a single class with the name reflecting what part.
I wrote 2 base classes that handle element generation (ie there is hardly any html involved, just elements)
![Debugger](https://raw.githubusercontent.com/ruby-x/rubyx-debugger/master/static/debugger.png){: width="100%"}
### Switch view
Top left at the top is a little control to switch files.
The files need to be in the repository, but at least one can have several and switch between
them without stopping the debugger.
Parsing is the only thing that opal chokes on, so the files are parsed by a server script and the
ast is sent to the browser.
### Classes View
The first column on the left is a list of classes in the system. Like on all boxes one can hover
over a name to look at the class and it's instance variables (recursively)
### Source View
Next is a view of the Soml source. The Source is reconstructed from the ast as html.
Soml (RubyX object machine language) is is a statically typed language,
maybe in spirit close to c++ (without the c). In the future RubyX will compile ruby to soml.
While stepping through the code, those parts of the code that are active get highlighted in blue.
Currently stepping is done only in register instructions, which means that depending on the source
constructs it may take many steps for the cursor to move on.
Each step will show progress on the register level though (next view)
### Register Instruction view
RubyX defines a register machine level which is quite close to the arm machine, but with more
sensible names. It has 16 registers (below) and an instruction set that is useful for Soml.
Data movement related instruction implement an indexed get and set. There is also Constant load and
integer operators and off course branches.
Instructions print their name and used registers r0-r15.
The next instruction to be executed is highlighted in blue. A list of previous instructions is shown.
One can follow the effect of instruction in the register view below.
### Status View
The last view at the top right show the status of the machine (interpreter to be precise), the
instruction count and any stdout
Current controls include stepping and three speeds of running the program.
- Next (green button) will execute exactly one instruction when clicked. Mostly useful when
debugging the compiler, ie inspecting the generated code.
- Crawl (first blue button) will execute at a moderate speed. One can still follow the
logic at the register level
- Run (second blue button) runs the program at a higher speed where register instruction just
whizz by, but one can still follow the source view. Mainly used to verify that the source executes
as expected and also to get to a specific place in the program (in the absence of breakpoints)
- Wizz (third blue button) makes the program run so fast that it's only useful function is to
fast forward in the code (while debugging)
### Register view
The bottom part of the screen is taken up by the 16 register. As we execute an object oriented
language, we show the object contents if it is an object (not an integer) in a register.
The (virtual) machine only uses objects, and specifically a linked list of Message objects to
make calls. The current message is always in register 0 (analgous to a stack pointer).
All other registers are scratch for statement use.
In Soml expressions compile to the register that holds the expressions value and statements may use
all registers and may not rely on anything other than the message in register 0.
The Register view is now greatly improved, especially in it's dynamic features:
- when the contents update the register obviously updates
- when the object that the register holds updates, the new value is shown immediately
- hovering over a variable will **expand that variable** .
- the hovering works recursively, so it is possible to drill down into objects for several levels
The last feature of inspecting objects is show in the screenshot. This makes it possible
to very quickly verify the programs behaviour. As it is a pure object system , all data is in
objects, and all objects can be inspected.

36
typed/parfait.html.haml Normal file
View File

@ -0,0 +1,36 @@
%hr/
%p
layout: typed
title: Parfait, a minimal runtime
%h3#type-and-class Type and Class
%p
Each object has a type that describes the instance variables and basic types of the object.
Types also reference the class they implement.
Type objects are unique and constant, may not be changed over their lifetime.
When a field is added to a class, a new Type is created. For a given class and combination
of instance names and basic types, only one instance every exists describing that type (a bit
similar to symbols)
%p
A Class describes a set of objects that respond to the same methods (the methods source is stored
in the RubyMethod class).
A Type describes a set of objects that have the same instance variables.
%h3#method-message-and-frame Method, Message and Frame
%p
The TypedMethod class describes a callable method. It carries a name, argument and local variable
type and several descriptions of the code.
The typed ast is kept for debugging, the register model instruction stream for optimisation
and further processing and finally the cpu specific binary
represents the executable code.
%p
When TypedMethods are invoked, A message object (instance of Message class) is populated.
Message objects are created at compile time and form a linked list.
The data in the Message holds the receiver, return addresses, arguments and a frame.
Frames are also created at compile time and just reused at runtime.
%h3#space-and-support Space and support
%p
The single instance of Space hold a list of all Types and all Classes, which in turn hold
the methods.
Also the space holds messages and will hold memory management objects like pages.
%p Words represent short immutable text and other word processing (buffers, text) is still tbd.
%p Lists (aka Array) are number indexed, starting at one, and dictionaries (aka Hash) are mappings from words to objects.

View File

@ -1,41 +0,0 @@
---
layout: typed
title: Parfait, a minimal runtime
---
### Type and Class
Each object has a type that describes the instance variables and basic types of the object.
Types also reference the class they implement.
Type objects are unique and constant, may not be changed over their lifetime.
When a field is added to a class, a new Type is created. For a given class and combination
of instance names and basic types, only one instance every exists describing that type (a bit
similar to symbols)
A Class describes a set of objects that respond to the same methods (the methods source is stored
in the RubyMethod class).
A Type describes a set of objects that have the same instance variables.
### Method, Message and Frame
The TypedMethod class describes a callable method. It carries a name, argument and local variable
type and several descriptions of the code.
The typed ast is kept for debugging, the register model instruction stream for optimisation
and further processing and finally the cpu specific binary
represents the executable code.
When TypedMethods are invoked, A message object (instance of Message class) is populated.
Message objects are created at compile time and form a linked list.
The data in the Message holds the receiver, return addresses, arguments and a frame.
Frames are also created at compile time and just reused at runtime.
### Space and support
The single instance of Space hold a list of all Types and all Classes, which in turn hold
the methods.
Also the space holds messages and will hold memory management objects like pages.
Words represent short immutable text and other word processing (buffers, text) is still tbd.
Lists (aka Array) are number indexed, starting at one, and dictionaries (aka Hash) are mappings from words to objects.

191
typed/syntax.html.haml Normal file
View File

@ -0,0 +1,191 @@
%hr/
%p
layout: typed
title: Soml Syntax
%h4#top-level-class-and-methods Top level Class and methods
%p The top level declarations in a file may only be class definitions
%pre
%code
:preserve
class Dictionary < Object
int add(Object o)
... statements
end
end
%p
The class hierarchy is explained in
= succeed "," do
%a{:href => "parfait.html"} here
%p
Methods must be typed, both arguments and return. Generally class names serve as types, but “int” can
be used as a shortcut for Integer.
%p
Code may not be outside method definitions, like in ruby. A compiled program starts at the builtin
method
= succeed "," do
%strong init
%strong Space.main
%p
Classes are represented by class objects (instances of class Class to be precise) and methods by
Method objects, so all information is available at runtime.
%h4#expressions Expressions
%p
Soml distinguishes between expressions and statements. Expressions have value, statements perform an
action. Both are compiled to Register level instructions for the current method. Generally speaking
expressions store their value in a register and statements store those values elsewhere, possibly
after operating on them.
%p The subsections below correspond roughly to the parsers rule names.
%p
%strong Basic expressions
are numbers (integer or float), strings or names, either variable, argument,
field or class names. (normal details applicable). Special names include self (the current
receiver), and message (the currently executed method frame). These all resolve to a register
with contents.
%pre
%code
:preserve
23
"hi there"
argument_name
Object
%p
A
%strong field access
resolves to the fields value at the time. Fields must be defined by
field definitions, and are basically instance variables, but not hidden (see below).
The example below shows how to define local variables at the same time. Notice chaining, both for
field access and call, is not allowed.
%pre
%code
:preserve
Type l = self.type
Class c = l.object_class
Word n = c.name
%p
A
%strong Call expression
is a method call that resolves to the methods return value. If no receiver is
specified, self (the current receiver) is used. The receiver may be any of the basic expressions
above, so also class instances. The receiver type is known at compile time, as are all argument
types, so the class of the receiver is searched for a matching method. Many methods of the same
name may exist, but to issue a call, an exact match for the arguments must be found.
%pre
%code
:preserve
Class c = self.get_class()
c.get_super_class()
%p
An
%strong operator expression
is a binary expression, with either of the other expressions as left
and right operand, and an operator symbol between them. Operand types must be integer.
The symbols allowed are normal arithmetic and logical operations.
%pre
%code
:preserve
a + b
counter | 255
mask >> shift
%p
Operator expressions may be used in assignments and conditions, but not in calls, where the result
would have to be assigned beforehand. This is one of those cases where somls low level approach
shines through, as soml has no auto-generated temporary variables.
%h4#statements Statements
%p
We have seen the top level statements above. In methods the most interesting statements relate to
flow control and specifically how conditionals are expressed. This differs somewhat from other
languages, in that the condition is expressed explicitly (not implicitly like in c or ruby).
This lets the programmer express more precisely what is tested, and also opens an extensible
framework for more tests than available in other languages. Specifically overflow may be tested in
soml, without dropping down to assembler.
%p
An
%strong if statement
is started with the keyword if_ and then contains the branch type. The branch
type may be
= succeed "." do
%em plus, minus, zero, nonzero or overflow
%em If
may be continued with en
= succeed "," do
%em else
%em end
%pre
%code
:preserve
if_zero(a - 5)
....
else
....
end
%p
A
%strong while statement
is very much like an if, with off course the normal loop semantics, and
without the possible else.
%pre
%code
:preserve
while_plus( counter )
....
end
%p
A
%strong return statement
return a value from the current functions. There are no void functions.
%pre
%code
:preserve
return 5
%p
A
%strong field definition
is to declare an instance variable on an object. It starts with the keyword
field, must be in class (not method) scope and may not be assigned to.
%pre
%code
:preserve
class Class < Object
field List instance_methods
field Type object_type
field Word name
...
end
%p
A
%strong local variable definition
declares, and possibly assigns to, a local variable. Local variables
are stored in frame objects, in fact they are instance variables of the current frame object.
When resolving a name, the compiler checks argument names first, and then local variables.
%pre
%code
:preserve
int counter = 0
%p
Any of the expressions may be assigned to the variable at the time of definition. After a variable is
defined it may be assigned to with an
%strong assignment statement
any number of times. The assignment
is like an assignment during definition, without the leading type.
%pre
%code
:preserve
counter = 0
%p Any of the expressions, basic, call, operator, field access, may be assigned.
%h3#code-generation-and-scope Code generation and scope
%p
Compiling generates two results simultaneously. The more obvious is code for a function, but also an
object structure of classes etc that capture the declarations. To understand the code part better
the register abstraction should be studied, and to understand the object structure the runtime.
%p
The register machine abstraction is very simple, and so is the code generation, in favour of a simple
model. Especially in the area of register assignment, there is no magic and only a few simple rules.
%p
The main one of those concerns main memory access ordering and states that object memory must
be consistent at the end of the statement. Since there is only only object memory in soml, this
concerns all assignments, since all variables are either named or indexed members of objects.
Also local variables are just members of the frame.
%p
This obviously does leave room for optimisations as preliminary benchmarks show. But benchmarks also
show that it is not such a bit issue and much more benefit can be achieved by inlining.

View File

@ -1,148 +0,0 @@
---
layout: typed
title: Soml Syntax
---
#### Top level Class and methods
The top level declarations in a file may only be class definitions
class Dictionary < Object
int add(Object o)
... statements
end
end
The class hierarchy is explained in [here](parfait.html), but you can leave out the superclass
and Object will be assumed.
Methods must be typed, both arguments and return. Generally class names serve as types, but "int" can
be used as a shortcut for Integer.
Code may not be outside method definitions, like in ruby. A compiled program starts at the builtin
method __init__, that does the initial setup, an then jumps to **Space.main**
Classes are represented by class objects (instances of class Class to be precise) and methods by
Method objects, so all information is available at runtime.
#### Expressions
Soml distinguishes between expressions and statements. Expressions have value, statements perform an
action. Both are compiled to Register level instructions for the current method. Generally speaking
expressions store their value in a register and statements store those values elsewhere, possibly
after operating on them.
The subsections below correspond roughly to the parsers rule names.
**Basic expressions** are numbers (integer or float), strings or names, either variable, argument,
field or class names. (normal details applicable). Special names include self (the current
receiver), and message (the currently executed method frame). These all resolve to a register
with contents.
23
"hi there"
argument_name
Object
A **field access** resolves to the fields value at the time. Fields must be defined by
field definitions, and are basically instance variables, but not hidden (see below).
The example below shows how to define local variables at the same time. Notice chaining, both for
field access and call, is not allowed.
Type l = self.type
Class c = l.object_class
Word n = c.name
A **Call expression** is a method call that resolves to the methods return value. If no receiver is
specified, self (the current receiver) is used. The receiver may be any of the basic expressions
above, so also class instances. The receiver type is known at compile time, as are all argument
types, so the class of the receiver is searched for a matching method. Many methods of the same
name may exist, but to issue a call, an exact match for the arguments must be found.
Class c = self.get_class()
c.get_super_class()
An **operator expression** is a binary expression, with either of the other expressions as left
and right operand, and an operator symbol between them. Operand types must be integer.
The symbols allowed are normal arithmetic and logical operations.
a + b
counter | 255
mask >> shift
Operator expressions may be used in assignments and conditions, but not in calls, where the result
would have to be assigned beforehand. This is one of those cases where soml's low level approach
shines through, as soml has no auto-generated temporary variables.
#### Statements
We have seen the top level statements above. In methods the most interesting statements relate to
flow control and specifically how conditionals are expressed. This differs somewhat from other
languages, in that the condition is expressed explicitly (not implicitly like in c or ruby).
This lets the programmer express more precisely what is tested, and also opens an extensible
framework for more tests than available in other languages. Specifically overflow may be tested in
soml, without dropping down to assembler.
An **if statement** is started with the keyword if_ and then contains the branch type. The branch
type may be *plus, minus, zero, nonzero or overflow*. The condition must be in brackets and can be
any expression. *If* may be continued with en *else*, but doesn't have to be, and is ended with *end*
if_zero(a - 5)
....
else
....
end
A **while statement** is very much like an if, with off course the normal loop semantics, and
without the possible else.
while_plus( counter )
....
end
A **return statement** return a value from the current functions. There are no void functions.
return 5
A **field definition** is to declare an instance variable on an object. It starts with the keyword
field, must be in class (not method) scope and may not be assigned to.
class Class < Object
field List instance_methods
field Type object_type
field Word name
...
end
A **local variable definition** declares, and possibly assigns to, a local variable. Local variables
are stored in frame objects, in fact they are instance variables of the current frame object.
When resolving a name, the compiler checks argument names first, and then local variables.
int counter = 0
Any of the expressions may be assigned to the variable at the time of definition. After a variable is
defined it may be assigned to with an **assignment statement** any number of times. The assignment
is like an assignment during definition, without the leading type.
counter = 0
Any of the expressions, basic, call, operator, field access, may be assigned.
### Code generation and scope
Compiling generates two results simultaneously. The more obvious is code for a function, but also an
object structure of classes etc that capture the declarations. To understand the code part better
the register abstraction should be studied, and to understand the object structure the runtime.
The register machine abstraction is very simple, and so is the code generation, in favour of a simple
model. Especially in the area of register assignment, there is no magic and only a few simple rules.
The main one of those concerns main memory access ordering and states that object memory must
be consistent at the end of the statement. Since there is only only object memory in soml, this
concerns all assignments, since all variables are either named or indexed members of objects.
Also local variables are just members of the frame.
This obviously does leave room for optimisations as preliminary benchmarks show. But benchmarks also
show that it is not such a bit issue and much more benefit can be achieved by inlining.

57
typed/typed.html.haml Normal file
View File

@ -0,0 +1,57 @@
%hr/
%p
layout: typed
title: Typed intermediate representation
%h3#intermediate-representation Intermediate representation
%p
Compilers use different intermediate representations to go from the source code to a binary,
which would otherwise be too big a step.
%p
The
%strong typed
intermediate representation is a strongly typed layer, between the dynamically typed
ruby above, and the register machine below. One can think of it as a mix between c and c++,
minus the syntax aspect. While in 2015, this layer existed as a language, (see soml-parser), it
is now a tree representation only.
%h4#object-oriented-to-the-core-including-calling-convention Object oriented to the core, including calling convention
%p
Types are modeled by the class Type and carry information about instance variable names
and their basic type.
%em Every object
stores a reference
to its type, and while
= succeed "," do
%strong types are immutable
%p
The object model, ie the basic properties of objects that the system relies on, is quite simple
and explained in the runtime section. It involves a single reference per object.
Also the object memory model is kept quite simple in that object sizes are always small multiples
of the cache size of the hardware machine.
We use object encapsulation to build up larger looking objects from these basic blocks.
%p
The calling convention is also object oriented, not stack based*. Message objects are used to
define the data needed for invocation. They carry arguments, a frame and return address.
The return address is pre-calculated and determined by the caller, so
a method invocation may thus be made to return to an entirely different location.
*(A stack, as used in c, is not typed, not object oriented, and as such a source of problems)
%p
There is no non- object based memory at all. The only global constants are instances of
classes that can be accessed by writing the class name in ruby source.
%h4#runtime--parfait Runtime / Parfait
%p
The typed representation layer depends on the higher layer to actually determine and instantiate
types (type objects, or objects of class Type). This includes method arguments and local variables.
%p
The typed layer is mainly concerned in defining TypedMethods, for which argument or local variable
have specified type (like in c). Basic Type names are the class names they represent,
but the “int” may be used for brevity
instead of Integer.
%p
The runtime, Parfait, is kept
to a minimum, currently around 15 classes, described in detail
= succeed "." do
%a{:href => "parfait.html"} here
%p
Historically Parfait has been coded in ruby, as it was first needed in the compiler.
This had the additional benefit of providing solid test cases for the functionality.

View File

@ -1,53 +0,0 @@
---
layout: typed
title: Typed intermediate representation
---
### Intermediate representation
Compilers use different intermediate representations to go from the source code to a binary,
which would otherwise be too big a step.
The **typed** intermediate representation is a strongly typed layer, between the dynamically typed
ruby above, and the register machine below. One can think of it as a mix between c and c++,
minus the syntax aspect. While in 2015, this layer existed as a language, (see soml-parser), it
is now a tree representation only.
#### Object oriented to the core, including calling convention
Types are modeled by the class Type and carry information about instance variable names
and their basic type. *Every object* stores a reference
to it's type, and while **types are immutable**, the reference may change. The basic types every
object is made up off, include at least integer and reference (pointer).
The object model, ie the basic properties of objects that the system relies on, is quite simple
and explained in the runtime section. It involves a single reference per object.
Also the object memory model is kept quite simple in that object sizes are always small multiples
of the cache size of the hardware machine.
We use object encapsulation to build up larger looking objects from these basic blocks.
The calling convention is also object oriented, not stack based*. Message objects are used to
define the data needed for invocation. They carry arguments, a frame and return address.
The return address is pre-calculated and determined by the caller, so
a method invocation may thus be made to return to an entirely different location.
\*(A stack, as used in c, is not typed, not object oriented, and as such a source of problems)
There is no non- object based memory at all. The only global constants are instances of
classes that can be accessed by writing the class name in ruby source.
#### Runtime / Parfait
The typed representation layer depends on the higher layer to actually determine and instantiate
types (type objects, or objects of class Type). This includes method arguments and local variables.
The typed layer is mainly concerned in defining TypedMethods, for which argument or local variable
have specified type (like in c). Basic Type names are the class names they represent,
but the "int" may be used for brevity
instead of Integer.
The runtime, Parfait, is kept
to a minimum, currently around 15 classes, described in detail [here](parfait.html).
Historically Parfait has been coded in ruby, as it was first needed in the compiler.
This had the additional benefit of providing solid test cases for the functionality.