Benchmarking boot latency on x86

Gilad Ben-Yossef July 8th, 2008

One of the tasks embedded system developers face is managing boot latency, or the time it takes for a device to become functional after power up.

After all, a set top box that will take more then 60 seconds from the time the “On” button is pressed until the end customer can at least interact with the menus, for example, will most likely be returned to the store for a refund, no matter what feature set it has. When it come to our gadgets, we are all hungry for immediate satisfaction, it seems.

There are many tricks one can employ to to achieve boot time nirvana, but as Knuth taught us, premature optimization is the root of all evil. Therefore, before we turn our efforts to optimize boot latency, we should first try to measure what that boot latency really is and what part of the boot sequence contribute to it, lest we optimize the wrong thing.

Generally speaking, on a x86 (32bit or 64 bit), the boot process of a Linux system is comprised of the following phases and milestones:

  • The power up milestone: when the power is set to on
  • The BIOS phase: which includes the POST (or Power On Self Test), device initialization, running of option ROMs and loading the boot loader form the MBR (or Master Boot Record).
  • The boot loader phase: loading of an operating system kernel and ancillary data (such as Linux initrd or initramfs) into RAM.
  • Kernel initialization phase: initialization of CPU, peripherals and kernel data structures, including the bring up of the non boot cores in the case of a multi-core machine.
  • First user application first line of code milestone: the time when the first line of user application source code is executed.

Unfortunately, measuring the contribution of each of these phases to the overall boot latency is not an easy task to accomplish. At each of the phases different kind of code is executing on the machine: from the 16 bit BIOS code which is part of the machine firmware, via the 16 or 32 bit boot loader code, the 32 bit kernel code and the finally user applications - each of these is executing in a completely different software environment from the other and it is hard to find a common ground to compare the time each phase takes.

Luckily, the x86 architecture provides a useful tool: the TSC (or Time Stamp Clock) register. The TSC register was introduced in the original Intel Pentium CPU and counts the number of clock ticks from the last processor reset. Reading the current value of the TSC register is done using the RDTSC instruction.

Assuming no processor frequency changes (for example via SpeedStep technology) takes place and that we always sample the register of the same core in a multi core environment, both of which are easy to guarantee during the boot phase, the TSC register provides us with an accurate hi-res timer through which we can benchmarks the various phases in the boot process.

As of Linux kernel version 2.6.11, the Linux kernel supports a build time configuration options called CONFIG_PRINTK_TIMES, which adds a time stamp based on the TSC value to each kernel log message, including during the boot process.

Turning on this option is the first step towards boot latency benchmark, as it provides a break down of the latency of the various points within the kernel initialization phase and is very easy to set up. While we are configuring the kernel, turning off CPU frequency scaling OS support is also required, as if this feature is turned on, it can render the the TSC register useless.

While the PRINTK_TIMES feature is useful to get latency contributions of various kernel subsystems to the overall kernel initialization latency, is does not suffice: the printk timestamps counts from the kernel boot and includes only kernel log messages, and thus it does not take into account the latency induced by BIOS, boot loader code or the time to the first payload application source code line.

In order to measure those we will need to add code that queries and prints the TSC value to run during the boot loader execution and as the first user space program. The code to query the TSC and print it is rather simple. Here is an example (taken from the Wikipedia article on RDTSC) that demonstrates how it is done:

#include <stdint.h>
 
   inline uint64_t rdtsc() {
       uint32_t lo, hi;
       /*
        * cpuid will serialize the following rdtsc with respect to all other
        * instructions the processor may be handling.
        */
       __asm__ __volatile__ (
         "xorl %%eax, %%eax\n"
         "cpuid\n"
         "rdtsc\n"
         : "=a" (lo), "=d" (hi)
         :
         : "%ebx", "%ecx");
     return (uint64_t)hi &lt;&lt; 32 | lo;
   }

In order to measure boot time latency contribution of the various phases other then the kernel initialization, we will need to incorporate a call to this function and print the results in both our boot loader and a user program that will run at boot.

Let us start with the user program first, as it is easier. Here is the user program:

#include <stdint.h>
#include <stdio.h>
 
inline uint64_t rdtsc() {
   ...
}
 
int main(void) {
 
        printf("TSC: %lld\n",  rdtsc());
t1:
        goto t1;
 
        return 0;
}

The rdtsc function implementation which is missing from the listing above is the same function provided earlier in this article. Also, note we added an infinite loop at the end of the program - this hack is meant from keeping our program from existing when it has finished, this making it easier for us to run it as an “init” program without having the kernel panics when the program exists (more about this below).

Yet another issue to bare in mind that in a multi core board, it is prudent to pin the program to run on the boot CPU, as the value of the TSC register of different cores is not consistent, something which have neglected in the example above.

To build the program, the following one liner can be used:

gcc timestamp.c -static -o timestamp

After building the program, we shall copy the timestamp program binary on to the root file system of the target board and instruct the kernel to run the program during boot using the kernel command line parameter “init”, which tells the kernel to run the specified program at boot, rather then the default of “/sbin/init”.

Alternatively, we can use the initramfs mechanism of the Linux kernel to include the small user space program as part of the kernel binary image itself.

As you’ve seen, writing a simple user space program to print the value of the TSC register and have it run as the first program during boot is not a complicated affair. Let us see how we can produce a similar affect with the boot loader.

Unfortunately, doing so is less simple then with the user application example above. First and foremost, there are quite a few boot loaders that can be used to load Linux on the x86 architecture (e.g. GRUB, LILO, syslinux). In addition, x86 boot loader code can be notoriously complex, usually written in 16 bit assembly and part of it runs under less then idle conditions.

Luckily, the syslinux boot loader offers a very easy way to integrate 32 bit C code that can be run as part of the boot process, via its support for the Comboot API which enables us to write a small 32 bit C program and integrate it as part of the boot loader phase.

Here is our timestamp program once again, this time presented as a Comboot 32 module. It is derived from the hello.c example program supplied as part of the syslinux source code:

#include <string.h>
#include <stdio.h>
#include <console.h>
 
inline uint64_t rdtsc() {
...
}
 
int main(int argc, char *argv[])
{
   uint32_t lo, hi;
 
  openconsole(&amp;dev_stdcon_r, &amp;dev_stdcon_w);
 
  printf("TSC: %lld\n", rdtsc());
 
  syslinux_run_command("linux");
 
  return 0;
}

As before, the rdstc function is the same one we presented earlier in this article. Also, you may note that we end our little comboot32 program by invoking the syslinux_run_command() comboot 32 API call, which will cause the Linux kernel image with the name of “linux” to be executed when out little timestamp program will end.

To build our little comboot32 program, have a look inside the com32/samples/ directory that is part of the syslinux source code and follow the Makefile example provided there.

Now that we have our little Comboot32 program, named timestamp.c32, we need to instruct the syslinux boot loader to run the program at boot time. The program will then print the value of the TSC register and proceed to load the Linux kernel.

The syslinux boot loader reads a configuration file, called syslinux.cfg, located in same partition into which it is installed and executes the command in the file. In our case, the file content will be:

default timestamp.c32
prompt 0

Let us recap the different stages of the boot process and how we plan to benchmark them:

When the power is turned on, the board firmware will run the BIOS which will initialize the hardware and load and run syslinux, out boot loader.

syslinux will read the syslinux.cfg file and according to the commands in it, will execute our timestamp.c32 Comboot32 program, which will print the current value of the TSC register to the VGA screen and proceed to instruct syslinux to load the Linux kernel.

As the kernel boots, every printk log message send by it to the kernel log buffer, will be stamped with a timestamp, relative to the kernel start time, which is to based on the TSC register.

Finally, after the kernel initialization has complete, the kernel will run the first user space program, which in our case will be the user program version of the timestamp program, which once again will print the current value of the TSC register and halt.

At the end of this proccess, we will have the following values:

The value of the TSC register when the boot loader run. By dividing this value by the number of cycles per second the machine has, we can convert this cycle count, to the the number of seconds since the CPU reset up to this point.

The time it took the kernel to finish initialization, using the PRINTK_TIMES tiemstamps in the log messages. This value is already in seconds, so not conversion is needed.

The final TSC register value when the first user code program runs. Here too we will divide the raw cycle count as reported by the TSC register by the number of cycles per second the machine is running at to obtain the number of seconds.

From these measurements we can easily construct a somewhat crude, but useful, table of the contribution each of the stages has to the overall machine boot latency:

The latency of the BIOS phase is measured by the time we measure by reading the TSC register at the start of the boot loader code in our timestamp.c32 program.

The latency of the boot loader actions (reading the kernel image to RAM and compressing the image), can be computed by subtracting the number of seconds which took the kernel to initialize as reported by the PRINTK_TIMES feature, from the number of seconds reported by the user space version of our timestamp program and the number of seconds recorded at the boot loader beginning.

In addition, the various timestamps in the log messages throughout the kernel initialization process provided further insight as to the contribution of the various kernel subsystems and device driver to the overall latency.

Finally, the full boot latency, which is the the overall time which is needed to get to the first user program first line of source code running, is the number of seconds computed from the value of the TSC reported by the user program version of our timestamp program.

As an example, here are the values from one type of CoreDuo board, running at 1.5 Ghz:

The value reported from the timestamp.c32 program was 90933051222 cycles. Dividing the number of cycles by the number of cycles per second, or 1,500,000,000 on this specific boards, gives a BIOS latency of 60.6 seconds (the BIOS on this particular run was wasting a long time scanning the IDE bus).

The timestamp provided by the PRINTK_TIMES feature in the kernel log messages indicated it took the kernel 0.44 seconds to initialize.

Finally, the user program timestamp reported the first line of code ran at 93907247022 cycles, or 62.6 seconds after CPU reset, which are 2.0 second since the boot loader started.

If we subtract the 0.44 seconds it took the kernel to initialize from the over 2 seconds latency between boot loader and first user program line of code, we can reach the conclusion that it took the boot loader roughly 1.66 seconds to read the kernel binary from the CompactFlash on which it was stored into memory and decompress it.

As is evident, these numbers are not 100% accurate. Certainly, we have neglected to attribute some kernel initialization time before the first printk to the kernel initialization phase, as well as attributing some latency caused by the boot loader code initialization to the BIOS phase.

However, these numbers are accurate enough to be used as the basis of evaluating the effect of various changes on overall boot time and give us a clue as to where out optimization efforts needs to be focused.

For example, deom the measurements provided above it is clear that a faster storage medium can save up to 1.66 seconds from the boot process, simply by the virtue that it will take less time for the bootloader to load the kernel image into memory.

For further information about reducing and measuring boot latency, the embedded Linux wiki entry BootTime is highly recommended.

Gilad Ben-Yossef

Trackback URI | Comments RSS

Leave a Reply

You must be logged in to post a comment.