genode/doc/architecture.txt

790 lines
36 KiB
Plaintext

====================================
Design of the Genode OS Architecture
====================================
Norman Feske and Christian Helmuth
Abstract
########
In the software world, high complexity of a problem solution comes along with a
high risk for bugs and vulnerabilities.
This correlation is particularly perturbing for todays commodity operating
systems with their tremendous complexity.
The numerous approaches to increase the user's confidence in the correct
functioning of software comprise exhaustive tests, code auditing, static code
analysis, and formal verification.
Such quality-assurance measures are either rather shallow or they scale badly
with increasing complexity.
The operating-system design presented in this document focuses on the root of the
problem by providing means to minimize the underlying system complexity for
each security-sensitive application individually.
On the other hand, we want to enable multiple applications to execute on the
system at the same time whereas each application may have different functional
requirements from the operating system.
Todays operating systems provide a functional superset of the requirements of
all applications and thus, violate the principle of minimalism for each single
application.
We resolve the conflict between the principle of minimalism and the versatility
of the operating system by decomposing the operating system into small
components and by providing a way to execute those components isolated and
independent from each other.
Components can be device drivers, protocol stacks such as file systems and
network stacks, native applications, and containers for executing legacy
software.
Each application depends only on the functionality of a bounded set of
components that we call _application-specific_trusted_computing_base_(TCB)_.
If the TCBs of two applications are executed completely _isolated_ and
_independent_ from each other, we consider both TCBs as minimal.
In practice however, we want to share physical resources between multiple applications
without sacrificing their independence.
Therefore, the operating-system design has to enable the assignment of physical
resources to each application and its TCB to maintain independence from other
applications.
Furthermore, rather than living in complete isolation, components require to
communicate with each other to cooperate.
The operating-system design must enable components to create other components
and get them to know each other while maintaining isolation from uninvolved
parts of the system.
First, we narrow our goals and pose our mayor challenges in Section [Goals and Challenges].
Section [Interfaces and Mechanisms] introduces our fundamental concepts and
protocols that apply to each component in the system.
In Section [Core - the root of the process tree], we present the one component
that is mandatory part of each TCB, enables the bootstrapping of the system,
and provides abstractions for the lowest-level resources.
We exercise the composition of the presented mechanisms by the means of process
creation in Section [Process creation].
;Section [Framework infrastructure]
Goals and Challenges
####################
The Genode architecture is designed to accommodate the following types
of components in a secure manner concurrently on one machine:
:Device drivers:
Device drivers translate the facilities of raw physical devices to
device-class-specific interfaces to be used by other components.
They contain no security policies and provide their services
to only one client component per device.
:Services that multiplex resources:
To make one physical resource (e.g., a device) usable by multiple
components at the same time, the physical resource must be translated
to multiple virtual resources. For example, a
frame buffer provided by a device driver can only be used by one
client at the same time. A window system multiplexes this physical
resource to make it available to multiple clients. Other examples
are an audio mixer or a virtual network hub.
In contrast to a device driver, a _resource multiplexer_ deals with multiple
clients and therefore, plays a crucial role for maintaining the independence
and isolation of its clients from each other.
:Protocol stacks:
Protocol stacks translate low-level protocols to a higher and more applicable
level.
For example, a file system translates a block-device protocol to a file
abstraction, a TCP/IP stack translates network packets to a socket
abstraction, or a widget set maps high-level GUI elements to pixels.
Compared to resource multiplexers, protocol stacks are typically an
order of magnitude more complex.
Protocol stacks may also act as resource multiplexers. In this case however,
high complexity puts the independence and isolation of multiple
clients at a high risk.
Therefore, our design should enable the instantiation of protocol stacks per
application.
For example, instead of letting a security-sensitive application share one
TCP/IP stack with multiple other (untrusted) applications, it could use a
dedicated instance of a TCP/IP stack to increase its independence and
isolation from the other applications.
:Containers for executing legacy software:
A _legacy container_ provides an environment for the execution of existing
legacy software. This can be achieved by the means of a virtual machine
(e.g., a Java VM, a virtual PC), a compatible programming API (e.g., POSIX,
Qt), a language environment (e.g., LISP), or a script interpreter.
In the majority of cases, we regard legacy software as an untrusted black box.
One particular example for legacy software are untrusted legacy device drivers.
In this case, the container has to protect the physical hardware from
potentially malicious device accesses by the untrusted driver.
Legacy software may be extremely complex and resource demanding, for example
the Firefox web browser executed on top of the X window system and the Linux
kernel inside a virtualized PC.
In this case, the legacy container may locally implement sophisticated
resource-management techniques such as virtual memory.
:Small custom security-sensitive applications:
Alongside legacy software, small custom applications implement crucial
security-sensitive functionality.
In contrast to legacy software, which we mostly regard as untrusted anyway,
a low TCB complexity for custom applications is of extreme importance.
Given the special liability of such an application, it is very carefully
designed to have low complexity and require as little infrastructure as
possible.
A typical example is a cryptographic component that protects credentials
of the user.
Such an application does not require swapping (virtual memory), a POSIX API,
or a complete C library.
Instead, the main objectives of such an application are to avoid as much as
possible code from being included in its TCB and to keep its requirements at
a minimum.
Our design must be able to create and destroy subsystems that are composed of
multiple such components.
The _isolation_ requirement as stated in the introduction raises the
question of how to organize the locality of name spaces and how to distribute
access from components to other components within the system.
The _independence_ requirement demands the assignment of physical resources
to components such that different applications do not interfere.
Instead of managing access control and physical resources from a central
place, we desire a distributed way for applying policy for trading and revocating
resources and for delegating rights.
Interfaces and Mechanisms
#########################
The system is structured as a tree.
The nodes of the tree are processes.
A node, for which sub-nodes exist, is called the _parent_ of these sub-nodes
(_children_).
The parent creates children out of its own resources and defines
their execution environment.
Each process can announce services to its parent.
The parent, in turn, can mediate such a service to its other children.
When a child is created, its parent provides the initial contact to the
outer world via the following interface:
! void exit(int exit_value);
!
! Session_capability session(String service_name,
! String args);
!
! void close(Session_capability session_cap);
!
! int announce(String service_name,
! Root_capability service_root_cap);
!
! int transfer_quota(Session_capability to_session_cap,
! String amount);
:'exit': is called by a child to request its own termination.
:'session': is called by a child to request a connection to the specified
service as known by its parent whereas 'service_name' is the name
of the desired service _interface_.
The way of resolving or even denying a 'session' request depends on
the policy of the parent.
The 'args' parameter contains construction arguments for the session
to be created.
In particular, 'args' contains a specification of resources that the
process is willing to donate to the server during the session lifetime.
:'close': is called by a child to inform its parent that the specified
session is no longer needed.
The parent should close the session and hand back donated
resources to the child.
:'announce': is called by a child to register a locally implemented
service at its parent. Hence, this child is a server.
:'transfer_quota': enables a child to extend its resource donation
to the server that provides the specified session.
We provide a detailed description and motivation for the different functions
in Sections [Servers] and [Quota].
Servers
=======
Each process may implement services and announce them via the 'announce'
function of the parent interface.
When announcing a service, the server specifies a _root_ capability for
the implemented service.
The interface of the root capability enables the parent to create, configure,
and close sessions of the service:
! Session_capability session(String args);
!
! int transfer_quota(Session_capability to_session_cap,
! String amount);
!
! void close(Session_capability session_cap);
[image announce 60%]
Announcement of a service by a child (server).
Colored circles at the edge of a component represent remotely accessible
objects. Small circles inside a component represent a reference (capability)
to a remote object. A cross-component reference to a remote object is
illustrated by a dashed arrow. An opaque arrow symbolizes a RPC call/return.
Figure [announce] illustrates an announcement of a service.
Initially, each child has a capability to its parent.
After Child1 announces its service "Service", its parent knows the
root capability of this service under the local name 'srv1_r' and stores
the root capability with the announced service name in its _root_list_.
The root capability is intended to be used and kept by the parent only.
[image request 60%]
Service request by a client.
When a parent calls the 'session' function of the root interface of a server
child, the server creates a new client session and returns the corresponding
'client_session' capability.
This session capability provides the actual service-specific interface.
The parent can use it directly or it may pass it to other processes, in
particular to another child that requested the session.
In Figure [request], Child2 initiates the creation of a "Service" session
by a 'session' call at its parent capability (1).
The parent uses its root list to look up the root capability that matches the
service name "Service" (2) and calls the 'session' function at the
server (3).
Child1 being the server creates a new session ('session1') and returns the
session capability as result of the 'session' call (4).
The parent now knows the new session under the local name 'srv1_s1' (5) and
passes the session capability as return value of Child2's initial 'session'
call (6).
The parent maintains a _session_list_, which stores the interrelation between
children and their created sessions.
Now, Child2 has a direct communication channel to 'session1' provided by
the server (Child1) (7).
The 'close' function of the root interface instructs the server to
destroy the specified session and to release all session-specific resources.
; Mittels 'set_quota' kann der Parent einen Dienst anweisen, die Ressourcennutzung
; für eine angegebene 'client_session' zu begrenzen. Eine nähere Beschreibung des
; Ressourcen-Accountings erfolgt in Kapitel [Quota].
[image twolevels 80%]
Announcement and request of a service in a subsystem.
For simplicity, parent capabilities are not displayed.
Even though the prior examples involved only one parent,
the announce-request mechanism can be used recursively for tree
structures of any depth and thus allow for partitioning
the system into subsystems that can cooperate with each other whereas
parents are always in complete control over the communication
and resource usage of their children (and their subsystems).
Figure [twolevels] depicts a nested subsystem on the left.
Child1 announces its service named "Service" at its parent that, in turn,
announces a service named "Service" at the Grandparent.
The service names do not need to be identical.
Their meaning spans to their immediate parent only and there
may be a name remapping on each hierarchy level.
Each parent can decide itself whether to further announce
services of their children to the outer world or not.
The parent can announce Child1's service to the grandparent
by creating a new root capability to a local service that forwards
session-creation and closing requests to Child1.
Both Parent and Grandparent keep their local root lists.
In a second step, Parent2 initiates the creation of a session to
the service by issuing a 'session' request at the Grandparent (1).
Grandparent uses its root list to look up the service-providing child (from
Grandparent's local view) Parent1 (2).
Parent1 in turn, implements the service not by itself but delegates
the 'session' request to Child1 by calling the 'session' function
of the actual "Service" root interface (3).
The session capability, created by Child1 (4), can now be passed to Parent2
as return value of nested 'session' calls (5, 6).
Each involved node keeps the local knowledge about the created session
such that later, the session can be closed in the same nested fashion.
Quota
=====
Each process that provides services to other processes consumes resources on
behalf of it clients.
Such a server requires memory to maintain session-specific state, processing
time to perform the actual service function, and eventually further system
resources (e.g., bus bandwidth) dependent on client requests.
To avoid denial-of-service problems, a server must not allocate such
resources from its own budget but let the client pay.
Therefore, a mechanism for donating resource quotas from the client to the
server is required.
Both client and server may be arbitrary nodes in the process tree.
In the following, we examine the trading of resource quotas within
the recursive system structure using memory as an example.
When creating a child, the parent assigns a part of its own memory quota
to the new child.
During the lifetime of the child, the parent can further transfer
quota back and forth between the child's and its own account.
Because the parent creates its children out of its own resources,
it has a natural interest to correctly manage child quotas.
When a child requests a session to a service, it can bind a part
of its quota to the new session by specifying a resource donation
as an argument.
When receiving a session request, the parent has to distinct
three different cases, dependent on where the corresponding server
resides:
:Parent provides service:
If the parent provides the requested services by itself,
it transfers the donated amount of memory quota from the
requesting child's account to its own account to compensate
the session-specific memory allocation on behalf of its own
child.
:Server is another child:
If there exists a matching entry in the parent's root list,
the requested service is provided by another child (or a
node within the child subsystem). In this case, the parent
transfers the donated memory quota from the requesting child
to the service-providing child.
:Delegation to grandparent:
The parent may decide to delegate the session request to
its own parent because the requested service is provided by
a lower node of the process tree.
Thus, the parent will request a session on behalf of its child.
The grandparent neither knows nor cares about the actual
origin of the request and will simply decrease the memory
quota of the parent.
For this reason, the parent transfers the donated memory
quota from the requesting child to its own account before
calling the grandparent.
This algorithm works recursively.
Once, the server receives the session request, it checks if
the donated memory quota suffices for storing the session-specific
data and, on success, creates the session.
If the initial quota donation turns out to be too scarce during
the lifetime of a session, the client may make further donations
via the 'transfer_quota' function of the parent interface that
works analogously.
If a child requests to close a session, the parent must distinguish
the three cases as above.
Once, the server receives the session-close request from its parent,
it is responsible to release all resources that were used for this session.
After the server releases the session-specific resources, the
server's quota can be decreased to the prior state.
However, an ill-behaving server may fail to release those resources by malice
or caused by a bug.
If the misbehaving service was provided by the parent himself,
it has the full authority to not hand back session-quota to
its child.
If the misbehaving service was provided by the grandparent,
the parent (and its whole subsystem) has to subordinate.
If, however, the service was provided by another child and the
child refuses to release resources, decreasing its quota after
closing the session will fail.
It is up to the policy of the parent to handle such a failure either by
punishing it (e.g., killing the misbehaving server) or by granting more of its
own quota.
Generally, misbehavior is against the server's own interests and
each server would obey the parent's 'close' request to avoid intervention.
Successive policy management
============================
For supporting a high variety of security policies for access control, we
require a way to bind properties and restrictions to sessions. For example,
a file service may want to restrict the access to files according to an
access-control policy that is specific for each client session.
On session creation, the 'session' call takes an 'args' argument that can be
used for that purpose. It is a list of tag-value pairs describing the session
properties. By convention, the list is ordered by attribute priority starting
with the most important property.
The server uses these 'args' as construction arguments for the new
session and enforces the security policy as expressed by 'args' accordingly.
Whereas the client defines its desired session-construction arguments, each
node that is incorporated in the session creation can alter these arguments in
any way and may add further properties.
This effectively enables each parent to impose any desired restrictions to
sessions created by its children.
This concept works recursively and enables each node in the process hierarchy
to control exactly the properties that it knows and cares about. As a side
note, the specification of resource donations as described in the Section
[Quota] is performed with the same mechanism. A resource donation is a property
of a session.
[image incremental_restrictions]
Successive application of policies at the creation time of a new session.
Figure [incremental_restrictions] shows an example scenario. A user
application issues the creation of a new session to the 'GUI' server and
specifies its wish for reading user input and using the string "Terminal" as
window label (1).
The parent of the user application is the user manager that introduces
user identities into the system and wants to ensure that each displayed window
gets tagged with the user and the executed program. Therefore, it overrides the
'label' attribute with more accurate information (2). Note that the modified
argument is now the head of the argument list.
The parent of the user manager, in turn, implements further policies. In the
example, Init's policy prohibits the user-manager subtree from reading
input (for example to disable access to the system beyond official working hours)
by redefining the 'input' attribute and leaving all other attributes unchanged (3).
The actual GUI server observes the final result of the successively changed
session-construction arguments (4) and it is responsible for enforcing the specified
policy for the lifetime of the session.
Once a session has been established, its properties are fixed and cannot be changed.
Core - the root of the process tree
###################################
Core is the first user-level program that takes control when starting up the
system. It has access to the raw physical resources and converts them to
abstractions that enable multiple programs to use these resources.
In particular, core converts the physical address space to higher-level
containers called _dataspaces_.
A dataspace represents a contiguous physical address space region with an
arbitrary size (at page-size granularity).
Multiple processes can make the same dataspace accessible in their
local address spaces.
The system on top of core never deals with physical memory pages but
uses this uniform abstraction to work with memory, memory-mapped I/O
regions, and ROM areas.
*Note:* _Using only contiguous dataspaces may lead to fragmentation of the_
_physical address space. This property is, however, only required by_
_a few rare cases (e.g., DMA transfers). Therefore, later versions of the_
_design will support non-contiguous dataspaces._
Furthermore, core provides all prerequisites to bootstrap the process tree.
These prerequisites comprise services for creating processes and threads,
for allocating memory, for accessing boot-time-present files, and for managing
address-space layouts.
Core is almost free from policy. There are no configuration options.
The only policy of core is the startup of the init process to which core
grants all available resources.
In the following, we explain the session interfaces of core's services in
detail.
RAM - allocator for physical memory
===================================
A RAM session is a quota-bounded allocator of blocks from physical memory.
There are no RAM-specific session-construction arguments.
Immediately after the creation of a RAM session, its quota is zero.
To make the RAM session functional, it must be loaded with quota from
another already existing RAM session, which we call the _reference account_.
The reference account of a RAM session can be defined initially via:
!int ref_account(Ram_session_capability ram_session_cap);
Once the reference account is defined, quota can be transferred back and
forth between the reference account and the new RAM session with:
!int transfer_quota(Ram_session_capability ram_session_cap,
! size_t amount);
Provided, the RAM session has enough quota, a dataspace of a given size
can be allocated with:
!Ram_dataspace_capability alloc(size_t size);
The result value of 'alloc' is a capability to the RAM-dataspace
object implemented in core. This capability can be communicated to other
processes and can be used to make the dataspace's physical-memory region
accessible from these processes.
An allocated dataspace can be released with:
!void free(Ram_dataspace_capability ds_cap);
The 'alloc' and 'free' calls track the used-quota information of the RAM
session accordingly.
Current statistical information about the quota limit and the
used quota can be retrieved by:
!size_t quota();
!size_t used();
Closing a RAM session implicitly destroys all allocated dataspaces.
ROM - boot-time-file access
===========================
A ROM session represents a boot-time-present read-only file. This may be a
module provided by the boot loader or a part of a static ROM image. On session
construction, a file identifier must be specified as a session argument using the
tag 'filename'. The available filenames are not fixed but depend on the actual
deployment. On some platforms, core may provide logical files for special memory
objects such as the GRUB multiboot info structure or a kernel info page. The
ROM session enables the actual read access to the file by exporting the file as
dataspace:
!Rom_dataspace_capability dataspace();
IO_MEM - memory mapped I/O access
=================================
With IO_MEM, core provides a dataspace abstraction for non-memory parts of the
physical address space such as memory-mapped I/O regions or BIOS areas. In
contrast to a memory block that is used for storing information of which the
physical location in memory is of no matter, a non-memory object has a special
semantics attached to its location within the physical address space. Its
location is either fixed (by standard) or can be determined at runtime, for
example by scanning the PCI bus for PCI resources. If the physical location of
such a non-memory object is known, an IO_MEM session can be created by
specifying 'base' and 'size' as session-construction arguments.
The IO_MEM session then provides the specified physical memory area as
dataspace:
!Io_mem_dataspace_capability dataspace();
IO_PORT - access to I/O ports
=============================
For platforms that rely on I/O ports for device access, core's IO_PORT service
enables fine-grained assignment of port ranges to individual processes.
Each IO_PORT session corresponds to the exclusive access right to a
port range as specified with the 'io_port_base' and 'io_port_size'
session-construction arguments. Core creates the new IO_PORT session
only if the specified port range does not overlap with an already existing
session. This ensures that each I/O port is driven by only one
process at a time. The IO_PORT session interface resembles the
physical I/O port access instructions. Reading from an I/O port
can be performed via an 8bit, 16bit, or 32bit access:
!unsigned char inb(unsigned short address);
!unsigned short inw(unsigned short address);
!unsigned inl(unsigned short address);
Vice versa, there exist functions for writing to an I/O port via
an 8bit, 16bit, or 32bit access:
!void outb(unsigned short address, unsigned char value);
!void outw(unsigned short address, unsigned short value);
!void outl(unsigned short address, unsigned value);
The address argument of I/O-port access functions are absolute
port addresses that must be within the port range of the session.
IRQ - handling device interrupts
================================
The IRQ service of core provides processes with an interface to
device interrupts. Each IRQ session corresponds to an attached
interrupt. The physical interrupt number is specified via the
'irq_number' session-construction argument. A physical interrupt
number can be attached to only one session. The IRQ session
interface provides a blocking function to wait for the next
interrupt:
!void wait_for_irq();
While the 'wait_for_irq' function blocks, core unmasks the
interrupt corresponding to the IRQ session.
On function return, the corresponding interrupt line is masked
and acknowledged.
;*Note:* _The interface of the IRQ service is going to be changed_
;_with the planed addition of signals to the framework._
RM - managing address space layouts
===================================
RM is a _region manager_ service that allows for constructing address space
layouts (_region map_) from dataspaces and that provides support for assigning
region maps to processes by paging the process' threads.
Each RM session corresponds to one region map. After creating a new RM session,
dataspaces can be attached to the region map via:
!void *attach(Dataspace_capability ds_cap,
! size_t size=0, off_t offset=0,
! bool use_local_addr = false,
! addr_t local_addr = 0);
The 'attach' function inserts the specified dataspace into the region map and
returns the actually used start position within the region map.
By using the default arguments, the region manager chooses an appropriate
position that is large enough to hold the whole dataspace.
Alternatively, the caller of 'attach' can attach any sub-range of the dataspace
at a specified target position to the region map by enabling 'use_local_addr'
and specifying an argument for 'local_addr'. Note that the interface allows for the
same dataspace to be attached not only to multiple region maps but also multiple
times to the same region map.
As the counterpart to 'attach', 'detach' removes dataspaces from the region map:
!void detach(void *local_addr);
The region manager determines the dataspace at the specified 'local_addr' (not
necessarily the start address) and removes the whole dataspace from the region
map.
To enable the use of a RM session by a process, we must associate it with
each thread running in the process. The function
!Thread_capability add_client(Thread_capability thread);
returns a thread capability for a _pager_ that handles the page faults of the
specified 'thread' according to the region map.
With subsequent page faults caused by the thread, the address-space layout
described by the region map becomes valid for the process that is executing the
thread.
CPU - allocator for processing time
===================================
A CPU session is an allocator for processing time that allows for the creation,
the control, and the destruction of threads of execution.
There are no session arguments used.
The functionality of starting and killing threads is provided by two functions:
!Thread_capability create_thread(const char* name);
!void kill_thread(Thread_capability thread_cap);
The 'create_thread' function takes a symbolic thread name (that is only used
for debugging purposes) and returns a capability to the new thread.
Furthermore, the CPU session provides the following functions for operating
on threads:
!int set_pager(Thread_capability thread_cap,
! Thread_capability pager_cap);
!int cancel_blocking(Thread_capability thread_cap);
!int start(Thread_capability thread_cap,
! addr_t ip, addr_t sp);
!int state(Thread_capability thread,
! Thread_state *out_state);
The 'set_pager' function registers the thread's pager whereas 'pager_cap'
(obtained by calling 'add_client' at a RM session) refers to the RM session to
be used as the address-space layout.
For starting the actual execution of the thread, its initial instruction
pointer ('ip') and stack pointer ('sp') must be specified for the 'start'
operation.
In turn, the 'state' function provides the current thread state including
the current instruction pointer and stack pointer.
The 'cancel_blocking' function causes the specified thread to cancel a
currently executed blocking operation such as waiting for an incoming message
or acquiring a lock. This function is used by the framework for gracefully
destructing threads.
*Note:* _Future versions of the CPU service will provide means to further control the_
_thread during execution (e.g., pause, execution of only one instruction),_
_acquiring more comprehensive thread state (current registers), and configuring_
_scheduling parameters._
PD - providing protection domains
=================================
A PD session corresponds to a memory protection domain. Together
with one or more threads and an address-space layout (RM session), it forms a
process.
There are no session arguments. After session creation, the PD contains no
threads. Once a new thread has been created from a CPU session, it can be assigned
to the PD by calling:
! int bind_thread(Thread_capability thread);
CAP - allocator for capabilities
================================
A capability is a system-wide unique object identity that typically refers to a
remote object implemented by a service. For each object to be made remotely
accessible, the service creates a new capability associated with the local
object. CAP is a service to allocate and free capabilities:
! Capability alloc(Capability ep_cap);
! void free(Capability cap);
The 'alloc' function takes an entrypoint capability as argument, which is the
communication receiver for invocations of the new capability's RPC interface.
LOG - debug output facility
===========================
The LOG service is used by the lowest-level system components such as the init
process for printing debug output.
Each LOG session takes a 'label' string as session argument,
which is used to prefix the debug output of this session.
This enables developers to distinguish multiple producers of debug output.
The function
! size_t write(const char *string);
outputs the specified 'string' to the debug-output backend of core.
Process creation
################
The previous section presented the services implemented by core.
In this section, we show how to combine these basic mechanisms to create and
execute a process.
Process creation serves as a prime example for our general approach to first
provide very simple functional primitives and then solve complex problems using
a composition of these primitives.
We use slightly simplified pseudo code to illustrate this procedure.
The 'env()' object refers to the environment of the creating process, which
contains its RM session and RAM session.
:Obtaining the executable ELF binary:
If the binary is available as ROM object, we can access its data by creating
a ROM session with the binary's name as argument and attaching its dataspace
to our local address space:
!Rom_session_capability file_cap;
!file_cap = session("ROM", "filename=init");
!Rom_dataspace_capability ds_cap;
!ds_cap = Rom_session_client(file_cap).dataspace();
!
!void *elf_addr = env()->rm_session()->attach(ds_cap);
The variable 'elf_addr' now points to the start of the binary data.
:ELF binary decoding and creation of the new region map:
We create a new region map using the RM service:
!Rm_session_capability rm_cap;
!rm_cap = session("RM");
!Rm_session_client rsc(rm_cap);
Initially, this region map is empty.
The ELF binary contains CODE, DATA, and BSS sections.
For each section, we add a dataspace to the region map.
For read-only CODE and DATA sections, we attach the corresponding ranges of
the original ELF dataspace ('ds_cap'):
!rsc.attach(ds_cap, size, offset, true, addr);
The 'size' and 'offset' arguments specify the location of the section within
the ELF image. The 'addr' argument defines the desired start position at the
region map.
For each BSS and DATA section, we allocate a read-and-writeable RAM dataspace
!Ram_dataspace_capability rw_cap;
!rw_cap = env()->ram_session()->alloc(section_size);
and assign its initial content (zero for BSS sections, copy of ELF DATA sections).
!void *sec_addr = env()->rm_session()->attach(rw_cap);
! ... /* write to buffer at sec_addr */
!env()->rm_session()->detach(sec_addr);
After iterating through all ELF sections, the region map of the new process
is completely initialized.
:Creating the first thread:
For creating the main thread of the new process, we create a
new CPU session from which we allocate the thread:
!CPU_session_capability cpu_cap = session("CPU");
!Cpu_session_client csc(cpu_cap);
!Thread_capability thread_cap = csc.create_thread();
When the thread starts its execution and fetches its first instruction, it
will immediately trigger a page fault. Therefore, we need to assign a
page-fault handler (pager) to the thread. With resolving subsequent page faults, the
pager will populate the address space in which the thread is executed with
memory mappings according to a region map:
!Thread_capability pager_cap = rsc.add_client(thread_cap);
!csc.set_pager(thread_cap, pager_cap);
:Creating a protection domain:
The new process' protection domain corresponds to a PD session:
!Pd_session_capability pd_cap = session("PD");
!Pd_session_client pdsc(pd_cap);
:Assigning the first thread to the protection domain:
!pdsc.bind_thread(thread_cap);
:Starting the execution:
Now that we defined the relationship of the process' region map, its main
thread, and its address space, we can start the process by specifying the
initial instruction pointer and stack pointer as obtained from the ELF
binary.
!csc.start(thread_cap, ip, sp);
; supplying the parent capability to the new process