==================================== Design of the Genode OS Architecture ==================================== Norman Feske and Christian Helmuth Abstract ######## In the software world, high complexity of a problem solution comes along with a high risk for bugs and vulnerabilities. This correlation is particularly perturbing for todays commodity operating systems with their tremendous complexity. The numerous approaches to increase the user's confidence in the correct functioning of software comprise exhaustive tests, code auditing, static code analysis, and formal verification. Such quality-assurance measures are either rather shallow or they scale badly with increasing complexity. The operating-system design presented in this document focuses on the root of the problem by providing means to minimize the underlying system complexity for each security-sensitive application individually. On the other hand, we want to enable multiple applications to execute on the system at the same time whereas each application may have different functional requirements from the operating system. Todays operating systems provide a functional superset of the requirements of all applications and thus, violate the principle of minimalism for each single application. We resolve the conflict between the principle of minimalism and the versatility of the operating system by decomposing the operating system into small components and by providing a way to execute those components isolated and independent from each other. Components can be device drivers, protocol stacks such as file systems and network stacks, native applications, and containers for executing legacy software. Each application depends only on the functionality of a bounded set of components that we call _application-specific_trusted_computing_base_(TCB)_. If the TCBs of two applications are executed completely _isolated_ and _independent_ from each other, we consider both TCBs as minimal. In practice however, we want to share physical resources between multiple applications without sacrificing their independence. Therefore, the operating-system design has to enable the assignment of physical resources to each application and its TCB to maintain independence from other applications. Furthermore, rather than living in complete isolation, components require to communicate with each other to cooperate. The operating-system design must enable components to create other components and get them to know each other while maintaining isolation from uninvolved parts of the system. First, we narrow our goals and pose our mayor challenges in Section [Goals and Challenges]. Section [Interfaces and Mechanisms] introduces our fundamental concepts and protocols that apply to each component in the system. In Section [Core - the root of the process tree], we present the one component that is mandatory part of each TCB, enables the bootstrapping of the system, and provides abstractions for the lowest-level resources. We exercise the composition of the presented mechanisms by the means of process creation in Section [Process creation]. ;Section [Framework infrastructure] Goals and Challenges #################### The Genode architecture is designed to accommodate the following types of components in a secure manner concurrently on one machine: :Device drivers: Device drivers translate the facilities of raw physical devices to device-class-specific interfaces to be used by other components. They contain no security policies and provide their services to only one client component per device. :Services that multiplex resources: To make one physical resource (e.g., a device) usable by multiple components at the same time, the physical resource must be translated to multiple virtual resources. For example, a frame buffer provided by a device driver can only be used by one client at the same time. A window system multiplexes this physical resource to make it available to multiple clients. Other examples are an audio mixer or a virtual network hub. In contrast to a device driver, a _resource multiplexer_ deals with multiple clients and therefore, plays a crucial role for maintaining the independence and isolation of its clients from each other. :Protocol stacks: Protocol stacks translate low-level protocols to a higher and more applicable level. For example, a file system translates a block-device protocol to a file abstraction, a TCP/IP stack translates network packets to a socket abstraction, or a widget set maps high-level GUI elements to pixels. Compared to resource multiplexers, protocol stacks are typically an order of magnitude more complex. Protocol stacks may also act as resource multiplexers. In this case however, high complexity puts the independence and isolation of multiple clients at a high risk. Therefore, our design should enable the instantiation of protocol stacks per application. For example, instead of letting a security-sensitive application share one TCP/IP stack with multiple other (untrusted) applications, it could use a dedicated instance of a TCP/IP stack to increase its independence and isolation from the other applications. :Containers for executing legacy software: A _legacy container_ provides an environment for the execution of existing legacy software. This can be achieved by the means of a virtual machine (e.g., a Java VM, a virtual PC), a compatible programming API (e.g., POSIX, Qt), a language environment (e.g., LISP), or a script interpreter. In the majority of cases, we regard legacy software as an untrusted black box. One particular example for legacy software are untrusted legacy device drivers. In this case, the container has to protect the physical hardware from potentially malicious device accesses by the untrusted driver. Legacy software may be extremely complex and resource demanding, for example the Firefox web browser executed on top of the X window system and the Linux kernel inside a virtualized PC. In this case, the legacy container may locally implement sophisticated resource-management techniques such as virtual memory. :Small custom security-sensitive applications: Alongside legacy software, small custom applications implement crucial security-sensitive functionality. In contrast to legacy software, which we mostly regard as untrusted anyway, a low TCB complexity for custom applications is of extreme importance. Given the special liability of such an application, it is very carefully designed to have low complexity and require as little infrastructure as possible. A typical example is a cryptographic component that protects credentials of the user. Such an application does not require swapping (virtual memory), a POSIX API, or a complete C library. Instead, the main objectives of such an application are to avoid as much as possible code from being included in its TCB and to keep its requirements at a minimum. Our design must be able to create and destroy subsystems that are composed of multiple such components. The _isolation_ requirement as stated in the introduction raises the question of how to organize the locality of name spaces and how to distribute access from components to other components within the system. The _independence_ requirement demands the assignment of physical resources to components such that different applications do not interfere. Instead of managing access control and physical resources from a central place, we desire a distributed way for applying policy for trading and revocating resources and for delegating rights. Interfaces and Mechanisms ######################### The system is structured as a tree. The nodes of the tree are processes. A node, for which sub-nodes exist, is called the _parent_ of these sub-nodes (_children_). The parent creates children out of its own resources and defines their execution environment. Each process can announce services to its parent. The parent, in turn, can mediate such a service to its other children. When a child is created, its parent provides the initial contact to the outer world via the following interface: ! void exit(int exit_value); ! ! Session_capability session(String service_name, ! String args); ! ! void close(Session_capability session_cap); ! ! int announce(String service_name, ! Root_capability service_root_cap); ! ! int transfer_quota(Session_capability to_session_cap, ! String amount); :'exit': is called by a child to request its own termination. :'session': is called by a child to request a connection to the specified service as known by its parent whereas 'service_name' is the name of the desired service _interface_. The way of resolving or even denying a 'session' request depends on the policy of the parent. The 'args' parameter contains construction arguments for the session to be created. In particular, 'args' contains a specification of resources that the process is willing to donate to the server during the session lifetime. :'close': is called by a child to inform its parent that the specified session is no longer needed. The parent should close the session and hand back donated resources to the child. :'announce': is called by a child to register a locally implemented service at its parent. Hence, this child is a server. :'transfer_quota': enables a child to extend its resource donation to the server that provides the specified session. We provide a detailed description and motivation for the different functions in Sections [Servers] and [Quota]. Servers ======= Each process may implement services and announce them via the 'announce' function of the parent interface. When announcing a service, the server specifies a _root_ capability for the implemented service. The interface of the root capability enables the parent to create, configure, and close sessions of the service: ! Session_capability session(String args); ! ! int transfer_quota(Session_capability to_session_cap, ! String amount); ! ! void close(Session_capability session_cap); [image announce 60%] Announcement of a service by a child (server). Colored circles at the edge of a component represent remotely accessible objects. Small circles inside a component represent a reference (capability) to a remote object. A cross-component reference to a remote object is illustrated by a dashed arrow. An opaque arrow symbolizes a RPC call/return. Figure [announce] illustrates an announcement of a service. Initially, each child has a capability to its parent. After Child1 announces its service "Service", its parent knows the root capability of this service under the local name 'srv1_r' and stores the root capability with the announced service name in its _root_list_. The root capability is intended to be used and kept by the parent only. [image request 60%] Service request by a client. When a parent calls the 'session' function of the root interface of a server child, the server creates a new client session and returns the corresponding 'client_session' capability. This session capability provides the actual service-specific interface. The parent can use it directly or it may pass it to other processes, in particular to another child that requested the session. In Figure [request], Child2 initiates the creation of a "Service" session by a 'session' call at its parent capability (1). The parent uses its root list to look up the root capability that matches the service name "Service" (2) and calls the 'session' function at the server (3). Child1 being the server creates a new session ('session1') and returns the session capability as result of the 'session' call (4). The parent now knows the new session under the local name 'srv1_s1' (5) and passes the session capability as return value of Child2's initial 'session' call (6). The parent maintains a _session_list_, which stores the interrelation between children and their created sessions. Now, Child2 has a direct communication channel to 'session1' provided by the server (Child1) (7). The 'close' function of the root interface instructs the server to destroy the specified session and to release all session-specific resources. ; Mittels 'set_quota' kann der Parent einen Dienst anweisen, die Ressourcennutzung ; für eine angegebene 'client_session' zu begrenzen. Eine nähere Beschreibung des ; Ressourcen-Accountings erfolgt in Kapitel [Quota]. [image twolevels 80%] Announcement and request of a service in a subsystem. For simplicity, parent capabilities are not displayed. Even though the prior examples involved only one parent, the announce-request mechanism can be used recursively for tree structures of any depth and thus allow for partitioning the system into subsystems that can cooperate with each other whereas parents are always in complete control over the communication and resource usage of their children (and their subsystems). Figure [twolevels] depicts a nested subsystem on the left. Child1 announces its service named "Service" at its parent that, in turn, announces a service named "Service" at the Grandparent. The service names do not need to be identical. Their meaning spans to their immediate parent only and there may be a name remapping on each hierarchy level. Each parent can decide itself whether to further announce services of their children to the outer world or not. The parent can announce Child1's service to the grandparent by creating a new root capability to a local service that forwards session-creation and closing requests to Child1. Both Parent and Grandparent keep their local root lists. In a second step, Parent2 initiates the creation of a session to the service by issuing a 'session' request at the Grandparent (1). Grandparent uses its root list to look up the service-providing child (from Grandparent's local view) Parent1 (2). Parent1 in turn, implements the service not by itself but delegates the 'session' request to Child1 by calling the 'session' function of the actual "Service" root interface (3). The session capability, created by Child1 (4), can now be passed to Parent2 as return value of nested 'session' calls (5, 6). Each involved node keeps the local knowledge about the created session such that later, the session can be closed in the same nested fashion. Quota ===== Each process that provides services to other processes consumes resources on behalf of it clients. Such a server requires memory to maintain session-specific state, processing time to perform the actual service function, and eventually further system resources (e.g., bus bandwidth) dependent on client requests. To avoid denial-of-service problems, a server must not allocate such resources from its own budget but let the client pay. Therefore, a mechanism for donating resource quotas from the client to the server is required. Both client and server may be arbitrary nodes in the process tree. In the following, we examine the trading of resource quotas within the recursive system structure using memory as an example. When creating a child, the parent assigns a part of its own memory quota to the new child. During the lifetime of the child, the parent can further transfer quota back and forth between the child's and its own account. Because the parent creates its children out of its own resources, it has a natural interest to correctly manage child quotas. When a child requests a session to a service, it can bind a part of its quota to the new session by specifying a resource donation as an argument. When receiving a session request, the parent has to distinct three different cases, dependent on where the corresponding server resides: :Parent provides service: If the parent provides the requested services by itself, it transfers the donated amount of memory quota from the requesting child's account to its own account to compensate the session-specific memory allocation on behalf of its own child. :Server is another child: If there exists a matching entry in the parent's root list, the requested service is provided by another child (or a node within the child subsystem). In this case, the parent transfers the donated memory quota from the requesting child to the service-providing child. :Delegation to grandparent: The parent may decide to delegate the session request to its own parent because the requested service is provided by a lower node of the process tree. Thus, the parent will request a session on behalf of its child. The grandparent neither knows nor cares about the actual origin of the request and will simply decrease the memory quota of the parent. For this reason, the parent transfers the donated memory quota from the requesting child to its own account before calling the grandparent. This algorithm works recursively. Once, the server receives the session request, it checks if the donated memory quota suffices for storing the session-specific data and, on success, creates the session. If the initial quota donation turns out to be too scarce during the lifetime of a session, the client may make further donations via the 'transfer_quota' function of the parent interface that works analogously. If a child requests to close a session, the parent must distinguish the three cases as above. Once, the server receives the session-close request from its parent, it is responsible to release all resources that were used for this session. After the server releases the session-specific resources, the server's quota can be decreased to the prior state. However, an ill-behaving server may fail to release those resources by malice or caused by a bug. If the misbehaving service was provided by the parent himself, it has the full authority to not hand back session-quota to its child. If the misbehaving service was provided by the grandparent, the parent (and its whole subsystem) has to subordinate. If, however, the service was provided by another child and the child refuses to release resources, decreasing its quota after closing the session will fail. It is up to the policy of the parent to handle such a failure either by punishing it (e.g., killing the misbehaving server) or by granting more of its own quota. Generally, misbehavior is against the server's own interests and each server would obey the parent's 'close' request to avoid intervention. Successive policy management ============================ For supporting a high variety of security policies for access control, we require a way to bind properties and restrictions to sessions. For example, a file service may want to restrict the access to files according to an access-control policy that is specific for each client session. On session creation, the 'session' call takes an 'args' argument that can be used for that purpose. It is a list of tag-value pairs describing the session properties. By convention, the list is ordered by attribute priority starting with the most important property. The server uses these 'args' as construction arguments for the new session and enforces the security policy as expressed by 'args' accordingly. Whereas the client defines its desired session-construction arguments, each node that is incorporated in the session creation can alter these arguments in any way and may add further properties. This effectively enables each parent to impose any desired restrictions to sessions created by its children. This concept works recursively and enables each node in the process hierarchy to control exactly the properties that it knows and cares about. As a side note, the specification of resource donations as described in the Section [Quota] is performed with the same mechanism. A resource donation is a property of a session. [image incremental_restrictions] Successive application of policies at the creation time of a new session. Figure [incremental_restrictions] shows an example scenario. A user application issues the creation of a new session to the 'GUI' server and specifies its wish for reading user input and using the string "Terminal" as window label (1). The parent of the user application is the user manager that introduces user identities into the system and wants to ensure that each displayed window gets tagged with the user and the executed program. Therefore, it overrides the 'label' attribute with more accurate information (2). Note that the modified argument is now the head of the argument list. The parent of the user manager, in turn, implements further policies. In the example, Init's policy prohibits the user-manager subtree from reading input (for example to disable access to the system beyond official working hours) by redefining the 'input' attribute and leaving all other attributes unchanged (3). The actual GUI server observes the final result of the successively changed session-construction arguments (4) and it is responsible for enforcing the specified policy for the lifetime of the session. Once a session has been established, its properties are fixed and cannot be changed. Core - the root of the process tree ################################### Core is the first user-level program that takes control when starting up the system. It has access to the raw physical resources and converts them to abstractions that enable multiple programs to use these resources. In particular, core converts the physical address space to higher-level containers called _dataspaces_. A dataspace represents a contiguous physical address space region with an arbitrary size (at page-size granularity). Multiple processes can make the same dataspace accessible in their local address spaces. The system on top of core never deals with physical memory pages but uses this uniform abstraction to work with memory, memory-mapped I/O regions, and ROM areas. *Note:* _Using only contiguous dataspaces may lead to fragmentation of the_ _physical address space. This property is, however, only required by_ _a few rare cases (e.g., DMA transfers). Therefore, later versions of the_ _design will support non-contiguous dataspaces._ Furthermore, core provides all prerequisites to bootstrap the process tree. These prerequisites comprise services for creating processes and threads, for allocating memory, for accessing boot-time-present files, and for managing address-space layouts. Core is almost free from policy. There are no configuration options. The only policy of core is the startup of the init process to which core grants all available resources. In the following, we explain the session interfaces of core's services in detail. RAM - allocator for physical memory =================================== A RAM session is a quota-bounded allocator of blocks from physical memory. There are no RAM-specific session-construction arguments. Immediately after the creation of a RAM session, its quota is zero. To make the RAM session functional, it must be loaded with quota from another already existing RAM session, which we call the _reference account_. The reference account of a RAM session can be defined initially via: !int ref_account(Ram_session_capability ram_session_cap); Once the reference account is defined, quota can be transferred back and forth between the reference account and the new RAM session with: !int transfer_quota(Ram_session_capability ram_session_cap, ! size_t amount); Provided, the RAM session has enough quota, a dataspace of a given size can be allocated with: !Ram_dataspace_capability alloc(size_t size); The result value of 'alloc' is a capability to the RAM-dataspace object implemented in core. This capability can be communicated to other processes and can be used to make the dataspace's physical-memory region accessible from these processes. An allocated dataspace can be released with: !void free(Ram_dataspace_capability ds_cap); The 'alloc' and 'free' calls track the used-quota information of the RAM session accordingly. Current statistical information about the quota limit and the used quota can be retrieved by: !size_t quota(); !size_t used(); Closing a RAM session implicitly destroys all allocated dataspaces. ROM - boot-time-file access =========================== A ROM session represents a boot-time-present read-only file. This may be a module provided by the boot loader or a part of a static ROM image. On session construction, a file identifier must be specified as a session argument using the tag 'filename'. The available filenames are not fixed but depend on the actual deployment. On some platforms, core may provide logical files for special memory objects such as the GRUB multiboot info structure or a kernel info page. The ROM session enables the actual read access to the file by exporting the file as dataspace: !Rom_dataspace_capability dataspace(); IO_MEM - memory mapped I/O access ================================= With IO_MEM, core provides a dataspace abstraction for non-memory parts of the physical address space such as memory-mapped I/O regions or BIOS areas. In contrast to a memory block that is used for storing information of which the physical location in memory is of no matter, a non-memory object has a special semantics attached to its location within the physical address space. Its location is either fixed (by standard) or can be determined at runtime, for example by scanning the PCI bus for PCI resources. If the physical location of such a non-memory object is known, an IO_MEM session can be created by specifying 'base' and 'size' as session-construction arguments. The IO_MEM session then provides the specified physical memory area as dataspace: !Io_mem_dataspace_capability dataspace(); IO_PORT - access to I/O ports ============================= For platforms that rely on I/O ports for device access, core's IO_PORT service enables fine-grained assignment of port ranges to individual processes. Each IO_PORT session corresponds to the exclusive access right to a port range as specified with the 'io_port_base' and 'io_port_size' session-construction arguments. Core creates the new IO_PORT session only if the specified port range does not overlap with an already existing session. This ensures that each I/O port is driven by only one process at a time. The IO_PORT session interface resembles the physical I/O port access instructions. Reading from an I/O port can be performed via an 8bit, 16bit, or 32bit access: !unsigned char inb(unsigned short address); !unsigned short inw(unsigned short address); !unsigned inl(unsigned short address); Vice versa, there exist functions for writing to an I/O port via an 8bit, 16bit, or 32bit access: !void outb(unsigned short address, unsigned char value); !void outw(unsigned short address, unsigned short value); !void outl(unsigned short address, unsigned value); The address argument of I/O-port access functions are absolute port addresses that must be within the port range of the session. IRQ - handling device interrupts ================================ The IRQ service of core provides processes with an interface to device interrupts. Each IRQ session corresponds to an attached interrupt. The physical interrupt number is specified via the 'irq_number' session-construction argument. A physical interrupt number can be attached to only one session. The IRQ session interface provides a blocking function to wait for the next interrupt: !void wait_for_irq(); While the 'wait_for_irq' function blocks, core unmasks the interrupt corresponding to the IRQ session. On function return, the corresponding interrupt line is masked and acknowledged. ;*Note:* _The interface of the IRQ service is going to be changed_ ;_with the planed addition of signals to the framework._ RM - managing address space layouts =================================== RM is a _region manager_ service that allows for constructing address space layouts (_region map_) from dataspaces and that provides support for assigning region maps to processes by paging the process' threads. Each RM session corresponds to one region map. After creating a new RM session, dataspaces can be attached to the region map via: !void *attach(Dataspace_capability ds_cap, ! size_t size=0, off_t offset=0, ! bool use_local_addr = false, ! addr_t local_addr = 0); The 'attach' function inserts the specified dataspace into the region map and returns the actually used start position within the region map. By using the default arguments, the region manager chooses an appropriate position that is large enough to hold the whole dataspace. Alternatively, the caller of 'attach' can attach any sub-range of the dataspace at a specified target position to the region map by enabling 'use_local_addr' and specifying an argument for 'local_addr'. Note that the interface allows for the same dataspace to be attached not only to multiple region maps but also multiple times to the same region map. As the counterpart to 'attach', 'detach' removes dataspaces from the region map: !void detach(void *local_addr); The region manager determines the dataspace at the specified 'local_addr' (not necessarily the start address) and removes the whole dataspace from the region map. To enable the use of a RM session by a process, we must associate it with each thread running in the process. The function !Thread_capability add_client(Thread_capability thread); returns a thread capability for a _pager_ that handles the page faults of the specified 'thread' according to the region map. With subsequent page faults caused by the thread, the address-space layout described by the region map becomes valid for the process that is executing the thread. CPU - allocator for processing time =================================== A CPU session is an allocator for processing time that allows for the creation, the control, and the destruction of threads of execution. There are no session arguments used. The functionality of starting and killing threads is provided by two functions: !Thread_capability create_thread(const char* name); !void kill_thread(Thread_capability thread_cap); The 'create_thread' function takes a symbolic thread name (that is only used for debugging purposes) and returns a capability to the new thread. Furthermore, the CPU session provides the following functions for operating on threads: !int set_pager(Thread_capability thread_cap, ! Thread_capability pager_cap); !int cancel_blocking(Thread_capability thread_cap); !int start(Thread_capability thread_cap, ! addr_t ip, addr_t sp); !int state(Thread_capability thread, ! Thread_state *out_state); The 'set_pager' function registers the thread's pager whereas 'pager_cap' (obtained by calling 'add_client' at a RM session) refers to the RM session to be used as the address-space layout. For starting the actual execution of the thread, its initial instruction pointer ('ip') and stack pointer ('sp') must be specified for the 'start' operation. In turn, the 'state' function provides the current thread state including the current instruction pointer and stack pointer. The 'cancel_blocking' function causes the specified thread to cancel a currently executed blocking operation such as waiting for an incoming message or acquiring a lock. This function is used by the framework for gracefully destructing threads. *Note:* _Future versions of the CPU service will provide means to further control the_ _thread during execution (e.g., pause, execution of only one instruction),_ _acquiring more comprehensive thread state (current registers), and configuring_ _scheduling parameters._ PD - providing protection domains ================================= A PD session corresponds to a memory protection domain. Together with one or more threads and an address-space layout (RM session), it forms a process. There are no session arguments. After session creation, the PD contains no threads. Once a new thread has been created from a CPU session, it can be assigned to the PD by calling: ! int bind_thread(Thread_capability thread); CAP - allocator for capabilities ================================ A capability is a system-wide unique object identity that typically refers to a remote object implemented by a service. For each object to be made remotely accessible, the service creates a new capability associated with the local object. CAP is a service to allocate and free capabilities: ! Capability alloc(Capability ep_cap); ! void free(Capability cap); The 'alloc' function takes an entrypoint capability as argument, which is the communication receiver for invocations of the new capability's RPC interface. LOG - debug output facility =========================== The LOG service is used by the lowest-level system components such as the init process for printing debug output. Each LOG session takes a 'label' string as session argument, which is used to prefix the debug output of this session. This enables developers to distinguish multiple producers of debug output. The function ! size_t write(const char *string); outputs the specified 'string' to the debug-output backend of core. Process creation ################ The previous section presented the services implemented by core. In this section, we show how to combine these basic mechanisms to create and execute a process. Process creation serves as a prime example for our general approach to first provide very simple functional primitives and then solve complex problems using a composition of these primitives. We use slightly simplified pseudo code to illustrate this procedure. The 'env()' object refers to the environment of the creating process, which contains its RM session and RAM session. :Obtaining the executable ELF binary: If the binary is available as ROM object, we can access its data by creating a ROM session with the binary's name as argument and attaching its dataspace to our local address space: !Rom_session_capability file_cap; !file_cap = session("ROM", "filename=init"); !Rom_dataspace_capability ds_cap; !ds_cap = Rom_session_client(file_cap).dataspace(); ! !void *elf_addr = env()->rm_session()->attach(ds_cap); The variable 'elf_addr' now points to the start of the binary data. :ELF binary decoding and creation of the new region map: We create a new region map using the RM service: !Rm_session_capability rm_cap; !rm_cap = session("RM"); !Rm_session_client rsc(rm_cap); Initially, this region map is empty. The ELF binary contains CODE, DATA, and BSS sections. For each section, we add a dataspace to the region map. For read-only CODE and DATA sections, we attach the corresponding ranges of the original ELF dataspace ('ds_cap'): !rsc.attach(ds_cap, size, offset, true, addr); The 'size' and 'offset' arguments specify the location of the section within the ELF image. The 'addr' argument defines the desired start position at the region map. For each BSS and DATA section, we allocate a read-and-writeable RAM dataspace !Ram_dataspace_capability rw_cap; !rw_cap = env()->ram_session()->alloc(section_size); and assign its initial content (zero for BSS sections, copy of ELF DATA sections). !void *sec_addr = env()->rm_session()->attach(rw_cap); ! ... /* write to buffer at sec_addr */ !env()->rm_session()->detach(sec_addr); After iterating through all ELF sections, the region map of the new process is completely initialized. :Creating the first thread: For creating the main thread of the new process, we create a new CPU session from which we allocate the thread: !CPU_session_capability cpu_cap = session("CPU"); !Cpu_session_client csc(cpu_cap); !Thread_capability thread_cap = csc.create_thread(); When the thread starts its execution and fetches its first instruction, it will immediately trigger a page fault. Therefore, we need to assign a page-fault handler (pager) to the thread. With resolving subsequent page faults, the pager will populate the address space in which the thread is executed with memory mappings according to a region map: !Thread_capability pager_cap = rsc.add_client(thread_cap); !csc.set_pager(thread_cap, pager_cap); :Creating a protection domain: The new process' protection domain corresponds to a PD session: !Pd_session_capability pd_cap = session("PD"); !Pd_session_client pdsc(pd_cap); :Assigning the first thread to the protection domain: !pdsc.bind_thread(thread_cap); :Starting the execution: Now that we defined the relationship of the process' region map, its main thread, and its address space, we can start the process by specifying the initial instruction pointer and stack pointer as obtained from the ELF binary. !csc.start(thread_cap, ip, sp); ; supplying the parent capability to the new process