Release notes for version 15.02
This commit is contained in:
parent
e866bf3fa3
commit
2392700c79
|
@ -0,0 +1,899 @@
|
|||
|
||||
|
||||
===============================================
|
||||
Release notes for the Genode OS Framework 15.02
|
||||
===============================================
|
||||
|
||||
Genode Labs
|
||||
|
||||
|
||||
|
||||
Genode's [http://genode.org/about/road-map - roadmap] for this year puts a
|
||||
strong emphasis on the consolidation and cultivation of the existing feature
|
||||
set. With the first release of the year, version 15.02 pays tribute to this
|
||||
mission by stepping up to extensive and systematic automated testing. As
|
||||
a precondition for scaling up Genode's test infrastructure, the release
|
||||
features a highly modular tool kit for exercising system scenarios on a growing zoo
|
||||
of test machines. Section [Modular tool kit for automated testing] explains
|
||||
the new tools in detail. In the spirit of improving the existing feature
|
||||
set, Genode 15.02 vastly improves the performance and stability of our version of
|
||||
VirtualBox running on the NOVA microhypervisor, solves long-standing shortcomings
|
||||
of memory management on machines with a lot of RAM, addresses NOVA-related
|
||||
scalability limitations, stabilizes our Rump-kernel-based file-system server,
|
||||
and refines the configuration interface of the Intel wireless driver.
|
||||
|
||||
As the most significant new feature, the new version introduces virtualization
|
||||
support for ARM to our custom base-hw kernel. Section [Virtualization on ARM]
|
||||
outlines the design and implementation of this feature, which was greatly
|
||||
inspired by NOVA's virtualization architecture and has been developed over the
|
||||
time span of more than a year.
|
||||
|
||||
With respect to platform support, we are happy to accommodate the upcoming
|
||||
USB-Armory board, which is a computer in the form factor of a USB
|
||||
stick especially geared towards security applications. Section
|
||||
[Support for the USB-Armory board] covers the background and the current
|
||||
state of this line of work.
|
||||
|
||||
|
||||
Virtualization on ARM
|
||||
#####################
|
||||
|
||||
The ARMv7 architecture of recent processors like Cortex-A7, Cortex-A15, or
|
||||
Cortex-A17 CPUs support hardware extensions to facilitate virtualization of
|
||||
guest operating systems. With the current release, we enable the use of these
|
||||
virtualization extensions in our custom base-hw kernel when running on the
|
||||
Cortex-A15-based Arndale board.
|
||||
|
||||
While integrating ARM's virtualization extension, we aimed to strictly follow
|
||||
microkernel-construction principles. The primary design is inspired by the
|
||||
[http://hypervisor.org/ - NOVA OS Virtualization Architecture]. It is based on a
|
||||
microhypervisor that provides essential microkernel mechanisms along with
|
||||
basic primitives to switch between virtual machines (VMs). On top of the
|
||||
microhypervisor, classical OS services are implemented as
|
||||
ordinary, unprivileged user-level components. Those services can be used by other
|
||||
applications. Services may be shared between applications or instantiated
|
||||
separately, according to security and safety needs. Correspondingly,
|
||||
following the NOVA principles, each VM has its own associated virtual-machine
|
||||
monitor (VMM) that runs as an unprivileged user-level component. VMM implementations
|
||||
can range from simple ones that just emulate primary device requirements to highly
|
||||
complex monitors including sophisticated device models, like VirtualBox. The
|
||||
NOVA approach allows to decouple the TCB complexity of one VM with respect to
|
||||
another, as well as with respect to all components not related to
|
||||
virtualization at all.
|
||||
|
||||
Along those lines, we extended the base-hw kernel/core conglomerate with API
|
||||
extensions that enable user-level VMM components to create and control virtual
|
||||
machines.
|
||||
|
||||
|
||||
Design
|
||||
======
|
||||
|
||||
The ARM virtualization extensions are based on the so-called security
|
||||
extensions, commonly known as
|
||||
[http://genode.org/documentation/articles/trustzone - TrustZone].
|
||||
The ARM designers did not follow the
|
||||
Intel approach to split the CPU into a "root" and a "guest" world while having all prior
|
||||
existing CPU modes available in both worlds. Instead, ARM added a new privilege level
|
||||
to the non-secure side of TrustZone that sits underneath the ordinary kernel
|
||||
and userland privilege levels. It is subjected to a hypervisor-like kernel. All
|
||||
instructions used to prepare a VM's environment have to be executed in this so
|
||||
called "hyp" mode. In hyp mode, some instructions
|
||||
differ from their regular behaviour on the kernel-privilege level.
|
||||
For this reason, prior-existing kernel code cannot simply be reused in
|
||||
hyp mode without modifications.
|
||||
|
||||
The base-hw kernel is meant to execute Genode's core component on bare hardware.
|
||||
Core, which is an ordinary user-level component, is
|
||||
linked together with a slim kernel library that is executed in privileged kernel
|
||||
mode. To enable ARM hardware virtualization, we pushed this approach
|
||||
even further by executing core in three different privilege levels. Thereby,
|
||||
core shares the same view on hardware resources and virtual memory across all
|
||||
levels. A code path is executed on a higher privilege level only if the code
|
||||
would fail to execute on a lower privilege level.
|
||||
Following this approach, we were able to keep most of the existing kernel code
|
||||
with no modifications.
|
||||
|
||||
[image avirt_overview]
|
||||
Genode's ARM kernel (core) runs across all privilege levels
|
||||
|
||||
The hypervisor part of core is solely responsible to switch between VMs and the
|
||||
host system. Therefore, it needs to load/store additional CPU state that
|
||||
normally remains untouched during context switches of ordinary tasks. It also needs to
|
||||
configure the VM's guest-physical to host-physical memory translations. Moreover, the
|
||||
virtualization extensions of the ARMv7 architecture are not related to the CPU
|
||||
cores only. The interrupt controller and the CPU-local timers are also
|
||||
virtualization-aware. Therefore, the hypervisor has to load/store state specific
|
||||
to those devices, too. Nevertheless, the hypervisor merely reloads those
|
||||
devices. It does not interpret their state.
|
||||
|
||||
In contrast to the low-complexity hypervisor, a user-level VMM can be complex
|
||||
without putting the system's security at risk. It contains potentially complex
|
||||
device-emulation code and assigns hardware resources such as memory and
|
||||
interrupts to the VM. The VMM is an ordinary user-level component running
|
||||
unprivileged. Of course, as a plain user-level component, it is not able to
|
||||
directly access hardware resources. Hence an interface between VMMs and the
|
||||
kernel is needed to share the state of a virtual machine. In the past, we faced a similar
|
||||
problem when building a VMM for our former TrustZone experiments. It was natural
|
||||
to build upon the available solution and to extend it where necessary. Core
|
||||
provides a so-called VM service. Each VM corresponds to a session of this
|
||||
service. The session provides the following extended interface:
|
||||
|
||||
:CPU state:
|
||||
The CPU-state function returns a dataspace containing the virtual machine's
|
||||
state. The state is initialized by the VMM before bootstrapping the VM, gets updated
|
||||
by the hypervisor whenever it switches away from the VM, and can be used by
|
||||
the VMM to interpret the behavior of the guest OS. Moreover, the CPU state can be
|
||||
updated after the virtual machine monitor emulated instructions
|
||||
for the VM.
|
||||
|
||||
:Exception handler:
|
||||
The second function is used to register a signal handler that gets informed
|
||||
whenever the VM produces a virtualization fault.
|
||||
|
||||
:Run:
|
||||
The run function starts or resumes the execution of the VM.
|
||||
|
||||
:Pause:
|
||||
The pause function removes the VM from the kernel's scheduler.
|
||||
|
||||
:Attach:
|
||||
This function attaches a given RAM dataspace to a designated area of the
|
||||
guest-physical address space.
|
||||
|
||||
:Detach:
|
||||
The detach function invalidates a designated area of the guest-physical
|
||||
address space.
|
||||
|
||||
:Attach_pic: Tells the hypervisor to attach the CPU's virtual interface of the
|
||||
virtualization-aware interrupt controller to a designated area of the
|
||||
guest-physical address space.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
By strictly following the micro-kernel construction principles when integrating the
|
||||
hypervisor into the base-hw kernel, we reached a minimally invasive solution. In
|
||||
doing so, we took the time to separate TrustZone-specific code that was formerly
|
||||
an inherent part of the kernel on ARMv7 platforms. Now, TrustZone- and
|
||||
virtualization-specific aspects are incorporated into the kernel only if
|
||||
actually used. The change in complexity of the whole core component expressed in
|
||||
lines of code is shown in the table below. As can be seen, the additional code in
|
||||
the root of the trusted computing base when using virtualization is about 700-800
|
||||
LOC.
|
||||
|
||||
Platform | with TrustZone, no VT | TrustZone/VT optional
|
||||
-----------------------------------------------------------------
|
||||
hw_arndale | 17970 LOC | 18730 LOC
|
||||
----------------------------------------------------------------
|
||||
hw_imx53_qsb | 17900 LOC | 17760 LOC
|
||||
----------------------------------------------------------------
|
||||
hw_imx53_qsb_tz | 18260 LOC | 18320 LOC
|
||||
----------------------------------------------------------------
|
||||
hw_rpi | 17500 LOC | 17430 LOC
|
||||
----------------------------------------------------------------
|
||||
hw_panda | 18040 LOC | 17880 LOC
|
||||
----------------------------------------------------------------
|
||||
hw_odroid_xu | 17980 LOC | 18050 LOC
|
||||
|
||||
Besides the VM world switch, we enabled support for the so-called "large
|
||||
physical address extension" (LPAE), which is obligatory when using
|
||||
virtualization. It allows for addressing a 40-bit instead of only 32-bit physical
|
||||
address space. Moreover, to execute in hypervisor mode, the bootstrap code of
|
||||
the kernel had to be set up properly. Hence, when booting on the Arndale board,
|
||||
the kernel now prepares the non-secure TrustZone world first, and finally leaves the
|
||||
secure world forever.
|
||||
|
||||
To test and showcase the ARM virtualization features integrated in base-hw, we
|
||||
implemented a minimal, exemplary VMM. It can be found in
|
||||
_repos/os/src/server/vmm_. The VMM emulates a simplified variant of ARM's
|
||||
Versatile Express Cortex-A15 development platform. Currently, it only comprises
|
||||
support for the CPU, the timer, the interrupt controller, and a UART device. It is
|
||||
written in 1100 lines of C++ in addition to the base Genode libraries. The VMM
|
||||
is able to boot a vanilla Linux kernel compiled with a slightly modified
|
||||
standard configuration (no-SMP), and a device tree description stripped down to
|
||||
the devices provided by the VMM. This release includes an automated run test that
|
||||
executes the Linux kernel on top of the VMM on Genode. It can be started via:
|
||||
|
||||
! make run/vmm
|
||||
|
||||
[image avirt_screen]
|
||||
Three Linux serial consoles running in parallel on top of Genode
|
||||
|
||||
|
||||
Modular tool kit for automated testing
|
||||
######################################
|
||||
|
||||
In
|
||||
[http://genode.org/documentation/release-notes/13.05#Automated_quality-assurance_testing - Genode version 13.05],
|
||||
we already introduced comprehensive support for the automated testing of
|
||||
Genode scenarios. Since then, Genode Labs has significantly widened the scope
|
||||
of its internal test infrastructure, both in terms of the coverage of the test
|
||||
scenarios as well as the variety of the used hardware platforms.
|
||||
|
||||
The centerpiece of our test infrastructure is the so-called run tool. Steered
|
||||
by a script (run script), it performs all the steps necessary to test drive
|
||||
a Genode system scenario. Those steps are:
|
||||
|
||||
# *Building* the components of a scenario
|
||||
# *Configuration* of the init component
|
||||
# Assembly of the *boot directory*
|
||||
# Creation of the *boot image*
|
||||
# *Powering-on* the test machine
|
||||
# *Loading* of the boot image
|
||||
# Capturing the *LOG output*
|
||||
# *Validation* of the scenario behavior
|
||||
# *Powering-off* the test machine
|
||||
|
||||
Each of those steps depends on various parameters such as the
|
||||
used kernel, the hardware platform used to run the scenario, the
|
||||
way the test hardware is connected to the test infrastructure
|
||||
(e.g., UART, AMT, JTAG, network), the way the test hardware is powered or
|
||||
reseted, or the way of how the scenario is loaded into the test hardware.
|
||||
Naturally, to accommodate the growing variety of combinations of those
|
||||
parameters, the complexity of the run tool increased over time.
|
||||
This growth of complexity prompted us to eventually turn the run tool into a
|
||||
highly modular and extensible tool kit.
|
||||
|
||||
Originally, the run tool consisted of built-in rules that could be
|
||||
extended and tweaked by a kernel-specific supplement called run environment.
|
||||
The execution of a run script used to depend on the policies built into
|
||||
the run tool, the used run environment, and optional configuration
|
||||
parameters (run opts).
|
||||
|
||||
The new run tool kit replaces most of the formerly built-in policies by the
|
||||
ability to select and configure different modules for the various steps.
|
||||
The selection and configuration of the modules is expressed in the run-tool
|
||||
configuration. There exist the following types of modules:
|
||||
|
||||
:boot-dir modules:
|
||||
These modules contain the functionality to populate the boot directory
|
||||
and are specific to each kernel. It is mandatory to always include the
|
||||
module corresponding to the used kernel.
|
||||
|
||||
_(the available modules are: linux, hw, okl4, fiasco, pistachio, nova,_
|
||||
_codezero, foc)_
|
||||
|
||||
:image modules:
|
||||
These modules are used to wrap up all components used by the run script
|
||||
in a specific format and thereby prepare them for execution.
|
||||
Depending on the used kernel, different formats can be used. With these
|
||||
modules, the creation of ISO and disk images is also handled.
|
||||
|
||||
_(the available modules are: uboot, disk, iso)_
|
||||
|
||||
:load modules:
|
||||
These modules handle the way the components are transfered to the
|
||||
target system. Depending on the used kernel there are various options
|
||||
to pass on the components. For example, loading from TFTP or via JTAG is handled
|
||||
by the modules of this category.
|
||||
|
||||
_(the available modules are: tftp, jtag, fastboot)_
|
||||
|
||||
:log modules:
|
||||
These modules handle how the output of a currently executed run script
|
||||
is captured.
|
||||
|
||||
_(the available modules are: qemu, linux, serial, amt)_
|
||||
|
||||
:power_on modules:
|
||||
These modules are used for bringing the target system into a defined
|
||||
state, e.g., by starting or rebooting the system.
|
||||
|
||||
_(the available modules are: qemu, linux, softreset, powerplug, amt)_
|
||||
|
||||
:power_off modules:
|
||||
These modules are used for turning the target system off after the
|
||||
execution of a run script.
|
||||
|
||||
_(the available modules are: powerplug)_
|
||||
|
||||
When executing a run script, only one module of each category must be used.
|
||||
|
||||
Each module has the form of a script snippet located under the
|
||||
_tool/run/<step>/_
|
||||
directory where _<step>_ is a subdirectory named after the module type.
|
||||
Further instructions about the use of each module (e.g., additional
|
||||
configuration arguments) can be found in the form of comments inside the
|
||||
respective script snippets.
|
||||
Thanks to this modular structure,
|
||||
the extension of the tool kit comes down to adding a file at the corresponding
|
||||
module-type subdirectory. This way, custom work flows (such as tunneling JTAG
|
||||
over SSH) can be accommodated fairly easily.
|
||||
|
||||
|
||||
Usage examples
|
||||
==============
|
||||
|
||||
To execute a run script, a combination of modules may be used. The combination
|
||||
is controlled via the RUN_OPT variable used by the build framework. Here are a
|
||||
few common exemplary combinations:
|
||||
|
||||
Executing NOVA in Qemu:
|
||||
|
||||
!RUN_OPT = --include boot_dir/nova \
|
||||
! --include power_on/qemu --include log/qemu --include image/iso
|
||||
|
||||
Executing NOVA on a real x86 machine using AMT for resetting the target system
|
||||
and for capturing the serial output while loading the files via TFTP:
|
||||
|
||||
!RUN_OPT = --include boot_dir/nova \
|
||||
! --include power_on/amt --power-on-amt-host 10.23.42.13 \
|
||||
! --power-on-amt-password 'foo!' \
|
||||
! --include load/tftp --load-tftp-base-dir /var/lib/tftpboot \
|
||||
! --load-tftp-offset-dir /x86 \
|
||||
! --include log/amt --log-amt-host 10.23.42.13 \
|
||||
! --log-amt-password 'foo!'
|
||||
|
||||
Executing Fiasco.OC on a real x86 machine using AMT for resetting, USB serial
|
||||
for output while loading the files via TFTP:
|
||||
|
||||
!RUN_OPT = --include boot_dir/foc \
|
||||
! --include power_on/amt --amt-host 10.23.42.13 --amt-password 'foo!' \
|
||||
! --include load/tftp --tftp-base-dir /var/lib/tftpboot \
|
||||
! --tftp-offset-dir /x86 \
|
||||
! --include log/serial --log-serial-cmd 'picocom -b 115200 /dev/ttyUSB0'
|
||||
|
||||
Executing base-hw on a Raspberry Pi using powerplug to reset the hardware,
|
||||
JTAG to load the image and USB serial to capture the output:
|
||||
|
||||
!RUN_OPT = --include boot_dir/hw \
|
||||
! --include power_on/powerplug --power-on-powerplug-ip 10.23.42.5 \
|
||||
! --power-on-powerplug-user admin \
|
||||
! --power-on-powerplug-password secret \
|
||||
! --power-on-powerplug-port 1
|
||||
! --include power_off/powerplug --power-off-powerplug-ip 10.23.42.5 \
|
||||
! --power-off-powerplug-user admin \
|
||||
! --power-off-powerplug-password secret \
|
||||
! --power-off-powerplug-port 1
|
||||
! --include load/jtag \
|
||||
! --load-jtag-debugger /usr/share/openocd/scripts/interface/flyswatter2.cfg \
|
||||
! --load-jtag-board /usr/share/openocd/scripts/interface/raspberrypi.cfg \
|
||||
! --include log/serial --log-serial-cmd 'picocom -b 115200 /dev/ttyUSB0'
|
||||
|
||||
After the run script was executed successfully, the run tool will print the
|
||||
string 'Run script execution successful.". This message can be used to check
|
||||
for the successful completion of the run script when doing automated testing.
|
||||
|
||||
|
||||
Meaningful default behaviour
|
||||
============================
|
||||
|
||||
To maintain the ease of use of creating and using a build directory, the
|
||||
'create_builddir' tool equips a freshly created build directory with a meaningful
|
||||
default configuration that depends on the selected platform. For example, if
|
||||
creating a build directory for the Linux base platform, RUN_OPT
|
||||
is initially defined as
|
||||
|
||||
! RUN_OPT = --include boot_dir/linux \
|
||||
! --include power_on/linux --include log/linux
|
||||
|
||||
|
||||
Low-level OS infrastructure
|
||||
###########################
|
||||
|
||||
Improved management of physical memory
|
||||
======================================
|
||||
|
||||
On machines with a lot of memory, there exist constraints with regard to
|
||||
the physical address ranges of memory:
|
||||
|
||||
* On platforms with a non-uniform memory architecture, subsystems should
|
||||
preferably use memory that is local to the CPU cores the subsystem is using.
|
||||
Otherwise the performance is impeded by costly memory accesses to
|
||||
the memory of remote computing nodes.
|
||||
|
||||
* Unless an IOMMU is used, device drivers program physical addresses
|
||||
into device registers to perform DMA operations. Legacy devices such as
|
||||
USB UHCI controllers expect a 32-bit address. Consequently, the memory
|
||||
used as DMA buffers for those devices must not be allocated above 4 GiB.
|
||||
|
||||
* When using an IOMMU on NOVA, Genode represents the address space
|
||||
accessible by devices (by the means of DMA) using a so-called device PD
|
||||
([http://genode.org/documentation/release-notes/13.02#DMA_protection_via_IOMMU]).
|
||||
DMA transactions originating from PCI devices are subjected to the virtual
|
||||
address space of the device PD.
|
||||
All DMA buffers are identity-mapped with their physical addresses within
|
||||
the device PD. On 32-bit systems with more than 3 GiB of memory, this
|
||||
creates a problem. Because the device PD is a regular user-level component, the
|
||||
upper 1 GiB of its virtual address space is preserved for the kernel. Since
|
||||
no user-level memory objects can be attached to this
|
||||
area, the physical address range to be used for DMA buffers is limited
|
||||
to the lower 3 GiB.
|
||||
|
||||
Up to now, Genode components had no way to influence the allocation of
|
||||
memory with respect to physical address ranges. To solve the problems outlined
|
||||
above, we extended core's RAM services to take allocation constraints
|
||||
as session arguments when a RAM session is created. All dataspaces created
|
||||
from such a session are subjected to the specified constraints. In particular,
|
||||
this change enables the AHCI/PCI driver to allocate DMA buffers at suitable
|
||||
physical address ranges.
|
||||
|
||||
This innocent looking feature to constrain RAM allocations raises a problem
|
||||
though: If any component is able to constrain RAM allocations in
|
||||
arbitrary ways, it would become able to scan the physical address space for
|
||||
allocated memory by successively opening RAM sessions with the constraints set
|
||||
to an individual page and observe whether an allocation succeeds or not. Two
|
||||
conspiring components could use this information to construct a covert storage
|
||||
channel.
|
||||
|
||||
To prevent such an abuse, the init component filters out allocations
|
||||
constrains from RAM-session requests unless explicitly permitted. The
|
||||
permission is granted by supplementing the RAM resource assignment of
|
||||
a component with a new 'constrain_phys' attribute. For example:
|
||||
|
||||
! <resource name="RAM" quantum="3M" constrain_phys="yes"/>
|
||||
|
||||
|
||||
Init component
|
||||
==============
|
||||
|
||||
Most of Genode's example scenarios in the form of run scripts support
|
||||
different platforms. However, as the platform details vary, the run scripts
|
||||
have to tweak the configuration of the init component according to the
|
||||
features of the platform.
|
||||
For example, when declaring an explicit route to a framebuffer driver named
|
||||
"fb_drv", the run script won't work on Linux because on this platform, the
|
||||
framebuffer driver is called "fb_sdl".
|
||||
Another example is the role of the USB driver. Depending on the platform, the
|
||||
USB driver is an input driver, a block driver, a networking driver, or a
|
||||
combination of those.
|
||||
Consequently, run scripts with support
|
||||
for a great variety of platforms tend to become convoluted with
|
||||
platform-specific conditionals.
|
||||
|
||||
To counter this problem, we enhanced init to support aliases for component
|
||||
names. By defining the following aliases in the init configuration
|
||||
! <alias name="nic_drv" child="usb_drv"/>
|
||||
! <alias name="input_drv" child="usb_drv"/>
|
||||
! <alias name="block_drv" child="usb_drv"/>
|
||||
the USB driver becomes reachable for session requests routed to either "usb_drv",
|
||||
"nic_drv", "input_drv", and "block_drv". Consequently, the routing
|
||||
configuration of components that use either of those drivers does no longer
|
||||
depend on any platform-intrinsic knowledge.
|
||||
|
||||
|
||||
RTC session interface
|
||||
=====================
|
||||
|
||||
Until now, the RTC session interface used an integer to return the current
|
||||
time. Although this is preferable when performing time-related
|
||||
calculations, a structured representation is more convenient to use, i.e., if
|
||||
the whole purpose is showing the current time. This interface change is only
|
||||
visible to components that use the RTC session directly.
|
||||
|
||||
Since the current OS API of Genode lacks time-related functions, most users
|
||||
end up using the libc, which already converts the structured time stamp
|
||||
internally, or provide their own time related functions.
|
||||
|
||||
|
||||
Update of rump-kernel-based file systems
|
||||
========================================
|
||||
|
||||
We updated the rump-kernel support to a newer rump-kernel version (as of mid of
|
||||
January 2015). This way, Genode is able to benefit from upstream stability
|
||||
improvements related to the memory management. Furthermore, we revised the
|
||||
Genode backend to allow the rump_fs server to cope well with a large amount of
|
||||
memory assigned to it. The latter is useful to utilize the block cache of the
|
||||
NetBSD kernel.
|
||||
|
||||
|
||||
Libraries and applications
|
||||
##########################
|
||||
|
||||
As a stepping stone in the
|
||||
[https://github.com/genodelabs/genode/issues/1399 - forthcoming community effort]
|
||||
to bring the Nix package manager to Genode, ports of libbz2 and sqlite have
|
||||
been added to the _repos/libports/_ repository.
|
||||
|
||||
|
||||
Runtime environments
|
||||
####################
|
||||
|
||||
VirtualBox on NOVA
|
||||
==================
|
||||
|
||||
Whereas our previous efforts to run VirtualBox on Genode/NOVA were mostly
|
||||
concerned with enabling principal functionality and with the addition of
|
||||
features, we took the release cycle of Genode 15.02 as a chance to focus
|
||||
on performance and stability improvements.
|
||||
|
||||
|
||||
:Performance:
|
||||
|
||||
Our goal with VirtualBox on NOVA is to achieve a user experience
|
||||
comparable to running VirtualBox on Linux. Our initial port of VirtualBox used
|
||||
to cut a lot of corners with regards to performance and timing accuracy
|
||||
because we had to concentrate on more fundamental issues of the porting
|
||||
work first. Now, with the feature set settled, it was time to revisit
|
||||
and solidify our interim solutions.
|
||||
|
||||
The first category of performance improvements is the handling of timing,
|
||||
and virtual guest time in particular. In our original version,
|
||||
we could observe a substantial drift of the guest time compared to the host time.
|
||||
The drift is not merely inconvenient but may even irritate the guest OS
|
||||
because it violates its assumptions about the behaviour of certain virtual devices.
|
||||
The drift was caused by basing the timing on a simple jiffies counter
|
||||
that was incremented by a thread after sleeping for a fixed period. Even
|
||||
though the thread almost never executes, there is still a chance that it gets
|
||||
preempted by the kernel and resumed only after the time slices of
|
||||
concurrently running threads have elapsed. This can take tens of milliseconds.
|
||||
During this time, the jiffies counter remains unchanged. We could
|
||||
significantly reduce the drift by basing the timing on absolute time values
|
||||
requested from the timer driver. Depending on the used guest OS, however,
|
||||
there is still a residual inaccuracy left, which is subject to ongoing
|
||||
investigations.
|
||||
|
||||
The second type of improvements is related to the handling of virtual
|
||||
interrupts. In its original habitat, VirtualBox relies on so-called
|
||||
external-interrupt virtualization events. If a host interrupt occurs while the
|
||||
virtual machine is active, the virtualization event is forwarded by the
|
||||
VirtualBox hypervisor to the virtual machine monitor (VMM).
|
||||
On NOVA, however, the kernel does not propagate this
|
||||
condition to the user-level VMM because the occurrence of host interrupts should
|
||||
be of no matter to the VMM. In the event of a host interrupt, NOVA takes
|
||||
a normal scheduling decision (eventually activating the user-level device driver
|
||||
the interrupt belongs to) and leaves the virtual CPU (vCPU) in a runnable
|
||||
state - to be rescheduled later. Once the interrupt is handled, the vCPU gets
|
||||
resumed. The VMM remains out of the loop. Because the update of the VirtualBox
|
||||
device models ultimately relies on the delivery of external-interrupt
|
||||
virtualization events, the lack of this kind of event introduced huge delays
|
||||
with respect to the update of device models and the injection of virtual
|
||||
interrupts. We solved this problem by exploiting a VirtualBox-internal
|
||||
mechanism called POKE. By setting the so-called POKE flag, an I/O thread is
|
||||
able to express its wish to force the virtual machine into the VMM. We only
|
||||
needed to find the right spots to set the POKE flag.
|
||||
|
||||
Another performance-related optimization is the caching of RTC time
|
||||
information inside VirtualBox. The original version of the gettimeofday
|
||||
function used by VirtualBox contacted the RTC server for obtaining the
|
||||
wall-clock time on each call. After the update to VirtualBox 4.3, the rate of those
|
||||
calls increased significantly. To reduce the costs of these calls, our
|
||||
new version of gettimeofday combines infrequent calls to the RTC driver
|
||||
with a component-local time source based on the jiffies mechanism mentioned above.
|
||||
|
||||
With these optimizations in place,
|
||||
simple benchmarks like measuring the boot time of Window 7 or the time of
|
||||
compiling Genode within a Debian VM suggest that our version of VirtualBox
|
||||
has reached a performance that is roughly on par with the Linux version.
|
||||
|
||||
|
||||
:Stability:
|
||||
|
||||
Since the upgrade to VirtualBox 4.3.16 in release 14.11, we fixed several
|
||||
regression issues caused by the upgrade. Beside that, we completed the
|
||||
support to route serial output of guests to Genode, lifted the restriction
|
||||
to use just one fixed VESA mode, and enabled support for 32-bit Windows 8
|
||||
guests on 64-bit Genode/NOVA. The 64-bit host restriction stems from
|
||||
the fact that Windows 8 requires support for the non-executable bit (NX)
|
||||
feature of page tables. The 32-bit version of the NOVA kernel does not leverage
|
||||
the physical address extension (PAE) feature, which is a pre-requisite for
|
||||
using NX on 32-bit.
|
||||
|
||||
In the course of the adaptation, our port of VirtualBox now evaluates the
|
||||
PAE and HardwareVirtExUX XML tags of .vbox files:
|
||||
|
||||
!<VirtualBox xmlns=...>
|
||||
! <Machine uuid=...>
|
||||
! <Hardware ..>
|
||||
! <CPU ...>
|
||||
! <HardwareVirtExUX enabled="true"/>
|
||||
! <PAE enabled="true"/>
|
||||
! ...
|
||||
|
||||
The PAE tag specifies whether to report PAE capabilities to the guest
|
||||
or not. The HardwareVirtExUx tag is used by our port to decide whether to stay
|
||||
for non-paged x86 modes in Virtualbox's recompiler (REM) or not. Until now, we used REM
|
||||
to emulate execution when the guest was running in real mode and protected mode
|
||||
with paging disabled. However, newer Intel machines support the unrestricted guest
|
||||
feature, which makes the usage of REM in non-paged modes not strictly
|
||||
necessary anymore. Setting the HardwareVirtExUx tag to false accommodates
|
||||
older machines with no support for the unrestricted-guest feature.
|
||||
|
||||
|
||||
Device drivers
|
||||
##############
|
||||
|
||||
iPXE-based network drivers
|
||||
==========================
|
||||
|
||||
We enabled and tested the driver with Intel I218-LM and I218-V PCI devices.
|
||||
|
||||
|
||||
Intel wireless stack
|
||||
====================
|
||||
|
||||
In this release, several small issues regarding the wireless stack are fixed.
|
||||
From now on, the driver only probes devices on the PCI bus that correspond to
|
||||
the PCI_CLASS_NETWORK_OTHER device class. Prior to that, the driver probed all
|
||||
devices attached to the bus resulting in problems with other devices, e.g.
|
||||
the GPU, when accessing their extended PCI config space.
|
||||
Since the driver uses cooperative scheduling internally, it must never block
|
||||
or, in case it blocks, must schedule another task. Various sleep functions
|
||||
lacked this scheduling call and are now fixed. Furthermore, a bug in the timer
|
||||
implementation has been corrected, which caused the scheduling of wrong timeouts.
|
||||
In addition to these fixes, patches for enabling the support for
|
||||
Intel 7260 cards were incorporated.
|
||||
|
||||
Up to now, the configuration of the wireless driver was rather inconvenient because
|
||||
it did not export any information to the system. The driver now creates two
|
||||
distinct reports to communicate its state and information about the wireless
|
||||
infrastructure to other components. The first one is a list of all available
|
||||
access points. The following exemplary report shows its structure:
|
||||
|
||||
!<wlan_accesspoints>
|
||||
! <accesspoint ssid="skynet" bssid="00:01:02:03:04:05" quality="40"/>
|
||||
! <accesspoint ssid="foobar" bssid="01:02:03:04:05:06" quality="70" protection="WPA-PSK"/>
|
||||
! <accesspoint ssid="foobar" bssid="01:02:03:04:05:07" quality="10" protection="WPA-PSK"/>
|
||||
!</wlan_accesspoints>
|
||||
|
||||
Each '<accesspoint>' node has attributes that contain the SSID and the BSSID
|
||||
of the access point as well as the link quality (signal strength). These
|
||||
attributes are mandatory. If the network is protected, the node will also
|
||||
have an attribute describing the type of protection in addition.
|
||||
|
||||
The second report provides information about the state of the connection
|
||||
with the currently associated access point:
|
||||
|
||||
!<wlan_state>
|
||||
! <accesspoint ssid="foobar" bssid="01:02:03:04:05:06" quality="70"
|
||||
! protection="WPA-PSK" state="connected"/>
|
||||
!</wlan_state>
|
||||
|
||||
Valid state values are 'connected', 'disconnected', 'connecting' and
|
||||
'disconnecting'.
|
||||
|
||||
The driver obtains its configuration via a ROM module. This ROM
|
||||
module contains the selected access point and can be updated during runtime.
|
||||
To connect to an access point, a configuration like the following is used:
|
||||
|
||||
!<selected_accesspoint ssid="foobar" bssid="01:02:03:04:05:06"
|
||||
! protection="WPA-PSK" psk="foobar123!"/>
|
||||
|
||||
To disconnect from an access point, an empty configuration can be set:
|
||||
|
||||
!<selected_accesspoint/>
|
||||
|
||||
For now, the prevalent WPA/WPA2 protection using a pre-shared key is supported.
|
||||
|
||||
|
||||
Improved UART driver for Exynos5
|
||||
================================
|
||||
|
||||
The UART driver for the Exynos5 SoC has been enhanced by enabling the RX
|
||||
channel. This improvement was motivated by automated tests, where a run script
|
||||
needs to interact with some component via a terminal connection.
|
||||
|
||||
|
||||
Touchscreen support
|
||||
===================
|
||||
|
||||
We enabled support of Wacom USB touchscreen devices via dde_linux - a port of
|
||||
Linux USB driver to Genode. In order to make touchscreen coordinates
|
||||
usable by Genode's input services, they must be calibrated
|
||||
to screen-absolute coordinates. The screen resolution is not determined
|
||||
automatically by the USB driver. It can, however, be configured as a sub
|
||||
node of the '<hid>' XML tag of the USB driver's configuration:
|
||||
|
||||
!<start name="usb_drv">
|
||||
! ...
|
||||
! <config uhci=... ohci=... xhci=...>
|
||||
! <hid>
|
||||
! <screen width="1024" height="768"/>
|
||||
! </hid>
|
||||
! ...
|
||||
|
||||
|
||||
USB session interface
|
||||
=====================
|
||||
|
||||
We enhanced our USB driver with the support of remote USB sessions. This
|
||||
feature makes it possible to implement USB-device drivers outside the USB
|
||||
server using a native Genode API. The new USB session can be found under
|
||||
_repos/os/include/usb_session_ and can be used to communicate with the USB
|
||||
server, which merely acts as a host controller and HUB driver in this scenario.
|
||||
Under _repos/os/include/usb_, there are a number of convenience
|
||||
and wrapper functions that operate directly on top of a USB session. These
|
||||
functions are meant to ease the task of USB-device-driver programming by hiding
|
||||
most of the USB session management, like packet-stream handling.
|
||||
|
||||
We also added a USB terminal server, which exposes a Genode terminal session to
|
||||
its clients and drives the popular PL2303 USB to UART adapters using the new
|
||||
USB-session interface.
|
||||
A practical use case for this component is the transmission of logging data on
|
||||
systems where neither UART, AMT, nor JTAG are available. A run script
|
||||
showcasing this feature can be found at _repos/dde_linux/run/usb_terminal.run_.
|
||||
|
||||
|
||||
RTC proxy driver for Linux
|
||||
==========================
|
||||
|
||||
There are a handful of run scripts that depend on the RTC service. So far,
|
||||
it was not possible to run these tests on Linux due to the lack of an RTC
|
||||
driver on this platform. To address this problem, we created a proxy driver
|
||||
that uses the time() system call to provide a
|
||||
reasonable base period on Linux.
|
||||
|
||||
|
||||
Platforms
|
||||
#########
|
||||
|
||||
Execution on bare hardware (base-hw)
|
||||
====================================
|
||||
|
||||
Support for the USB-Armory board
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
With [https://www.crowdsupply.com/inverse-path/usb-armory - USB Armory],
|
||||
there is an intriguing hardware platform for Genode on the horizon.
|
||||
In short, USB Armory is a computer in the form factor of a USB
|
||||
stick. It is meant for security applications such as VPNs,
|
||||
authentication tokens, and encrypted storage. It is based on the
|
||||
FreeScale i.MX53 SoC, which is well supported by Genode, i.e.,
|
||||
Genode can be used as secure-world OS besides Linux running in the
|
||||
normal world.
|
||||
Apart from introducing a novel form factor, this project is
|
||||
interesting because it strives to be an 100% open platform, which
|
||||
includes hardware, software, and firmware. This motivated us to
|
||||
bring Genode to this platform.
|
||||
|
||||
The underlying idea is to facilitate
|
||||
[http://genode.org/documentation/articles/trustzone - ARM TrustZone] to
|
||||
use Genode as a companion to a Linux-based OS on the platform.
|
||||
Whereas Linux would run in the normal world of TrustZone, Genode runs
|
||||
in the secure world. With Linux, the normal world will control the
|
||||
communication over USB and provide a familiar environment to implement
|
||||
USB-Armory applications. However, security-critical functions and data like
|
||||
cryptographic keys will reside exclusively in the secure world. Even in
|
||||
the event that Linux gets compromised, the credentials of the user
|
||||
will stay protected.
|
||||
|
||||
The support of the USB Armory platform was added in two steps:
|
||||
First, we enabled our base-hw kernel to run as TrustZone monitor with
|
||||
Genode on the "secure side". Since the USB Armory is based on the
|
||||
FreeScale i.MX53 SoC, which Genode already supported, this step went
|
||||
relatively straight-forward.
|
||||
|
||||
Second, we enabled a recent version of the Linux kernel (3.18) to run in the
|
||||
normal world. The normal world is supervised by a user-level Genode component
|
||||
called tz_vmm (TrustZone Virtual Machine Monitor). The tz_vmm is, among
|
||||
others, responsible for providing startup and hardware information to the
|
||||
non-secure guest. The Linux kernel version we used previously as TrustZone
|
||||
guest on i.MX53 boards expected this information to be communicated via
|
||||
so-called ATAGs. The new version, however, expects this to be done via a
|
||||
device tree blob. As a consequence, the tz_vmm had to be adapted to properly
|
||||
load this blob into the non-secure RAM. The original USB-Armory device tree
|
||||
was modified to blind out the RAM regions that get protected by the TrustZone
|
||||
hardware. This way, Linux won't attempt to access them. Furthermore,
|
||||
to keep basic user interaction simple, our device tree tells Linux to use the
|
||||
same non-secure UART as Genode for console I/O.
|
||||
|
||||
The kernel itself received some modifications, for two reasons. First,
|
||||
we don't want Linux to rely on resources that are protected to keep
|
||||
the secure world secure. This is why the driver for the interrupt controller
|
||||
that originally made use of the TrustZone interrupt configuration, had to be
|
||||
adapted. Second, to prevent Linux from disturbing Genode activities, we
|
||||
disabled most of the dynamic clock and power management as it may sporadically
|
||||
gear down or even disable hardware that Genode relies on. Furthermore, we
|
||||
disabled the Linux drivers for I2C interfaces and the GPIO configuration as
|
||||
these are reserved for Genode.
|
||||
|
||||
|
||||
IPC helping
|
||||
~~~~~~~~~~~
|
||||
|
||||
In traditional L4 microkernels, scheduling parameters (like time-slice
|
||||
length and priority) used to be bound to threads. Usually, those parameters
|
||||
are defined at thread creation time. The initial version
|
||||
of base-hw followed this traditional approach. However, it has a few problems:
|
||||
|
||||
* For most threads, the proper *choice of scheduling parameters* is very
|
||||
difficult if not impossible. For example, the CPU-time demands of a
|
||||
server thread may depend on the usage patterns of its clients. Most
|
||||
theoretical work in the domain of scheduling presumes the knowledge of
|
||||
job lengths in advance of computing a schedule. But in practice and in
|
||||
particular in general-purpose computing, job lengths are hardly known a priori.
|
||||
As a consequence, in most scenarios, scheduling parameters are
|
||||
set to default values.
|
||||
|
||||
* With each thread being represented as an independent schedulable entity,
|
||||
the kernel has to take a scheduling decision each time a thread performs an
|
||||
IPC call because the calling thread gets blocked and the called thread
|
||||
may get unblocked. In a microkernel-based system, those events occur at a
|
||||
much higher rate than the duration of typical time slices, which puts the
|
||||
scheduler in a *performance-critical* position.
|
||||
|
||||
* Regarding IPC calls, a synchronous flow of control along IPC call chains is
|
||||
desired. Ideally, an IPC call should have the same characteristics as
|
||||
a function call with respect to scheduling. When a client thread performs an
|
||||
IPC call, it expects the server to immediately become active to
|
||||
handle the request. But if the kernel treats each thread independently,
|
||||
it may pick any other thread and thereby introduce *high latencies* into
|
||||
IPC operations.
|
||||
|
||||
To counter those problems, the NOVA microhypervisor introduced a new approach
|
||||
that decouples scheduling parameters from threads. Instead of selecting
|
||||
threads for execution, the scheduler selects so-called scheduling contexts.
|
||||
For a selected scheduling context, the kernel dynamically determines a
|
||||
thread to execute by taking IPC relationships into account. When a thread
|
||||
performs an IPC, the thread's scheduling context will be used to execute
|
||||
the called server. In principle, a server does not need CPU time on its own
|
||||
but always works with CPU resources provided by clients.
|
||||
|
||||
The new version of the base-hw kernel adapts NOVA's approach with slight
|
||||
modifications. Each thread owns exactly one scheduling context for its entire
|
||||
lifetime. However, by the means of "helping" during an IPC call, the caller
|
||||
lends its scheduling context to the callee. Even if the callee is still busy
|
||||
and cannot handle the IPC request right away, the caller helps because it
|
||||
wants the callee to become available for its request as soon as
|
||||
possible. Consequently, a thread has potentially many scheduling contexts at
|
||||
its disposal, its own scheduling context plus all scheduling contexts
|
||||
provisioned by helpers. This works transitively.
|
||||
|
||||
|
||||
Purged outdated platforms
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
We removed the support for two stale platforms that remained unused for
|
||||
more than a year, namely FreeScale i.MX31 and the TrustZone variant
|
||||
of the Coretile Versatile Express board.
|
||||
|
||||
|
||||
NOVA
|
||||
====
|
||||
|
||||
On Genode/NOVA, we used to employ one pager thread in core for each thread
|
||||
in the system. We were forced to do so because not every page
|
||||
fault can be resolved immediately. In some situations, core asynchronously
|
||||
propagates the fault to an external component for the resolution.
|
||||
In the meantime, the
|
||||
pager thread leaves the page fault unanswered. Unfortunately, the kernel
|
||||
provides no mechanism to support this scenario besides just blocking the
|
||||
pager thread using a semaphore. This, in turn, means that the pager thread is not
|
||||
available for other page-fault requests. Ultimately, we had to setup a
|
||||
dedicated pager per thread.
|
||||
|
||||
This implementation has the downside of "wasting" memory for a lot of
|
||||
pager threads. Moreover, it becomes a denial-of-service vector as soon as more
|
||||
threads get created than core can accommodate. The number of threads is
|
||||
limited per address space - also for core - by the size of Genode's context
|
||||
area, which typically means 256 threads.
|
||||
|
||||
To avoid the downsides mentioned, we extended the NOVA IPC reply syscall to
|
||||
specify an optional semaphore capability. The NOVA kernel validates the
|
||||
capability and blocks the faulting thread in the semaphore. The faulted thread
|
||||
remains blocked even after the pager has replied to the fault message. But
|
||||
the pager immediately becomes available for other
|
||||
page-fault requests. With this change, it suffices to maintain only one pager
|
||||
thread per CPU for all client threads.
|
||||
|
||||
The benefits are manifold. First, the base-nova implementation converges more
|
||||
closely to other Genode base platforms. Second, core can not run out of threads
|
||||
anymore as the number of threads in core is fixed for a given setup. And the
|
||||
third benefit is that the helping mechanism of NOVA can be leveraged for
|
||||
concurrently faulting threads.
|
||||
|
||||
|
||||
Build system and tools
|
||||
######################
|
||||
|
||||
Tools for convenient handling of port contrib directories
|
||||
=========================================================
|
||||
|
||||
We supplemented our tools for the ports mechanism with two convenient
|
||||
scripts:
|
||||
|
||||
:_tool/ports/shortcut_:
|
||||
|
||||
Creates a symbolic link from _contrib/<port-name>-<hash>_ to
|
||||
_contrib/<port-name>_. This is useful when working on the third-party
|
||||
code contained in the _contrib_ directory.
|
||||
|
||||
:_tool/ports/current_:
|
||||
|
||||
Prints the current contrib directory of a port. When switching
|
||||
branches back and forth, the hash of the used port might change.
|
||||
The script provides a shortcut to looking up the hash file for a
|
||||
specific port within the repositories and printing its content.
|
||||
|
Loading…
Reference in New Issue