Archive for the 'Applications' Category

Do Androids Dreap of Electronic Sheep?

Tuxology team May 24th, 2009

We’ve added a new lecture about Android, the Open Source mobile stack, from Google and the Open handset Alliance to our list of available lectures.

You are all invited to grab the slides or watch the video (in Hebrew only right now, sorry) at the lecture page.

Helping The Compiler Help You

Tuxology team May 10th, 2009

GCC and G++, respectively, the GNU C compiler collection C and C++ language front ends, are a wondrous duo that usually do a decent job translating our C and C++ source into executable code. Well, most of the time at least… ;-)

Wonderful as these tools are, however, they are not omnipotent and cannot guess what’s on our minds. Would this function we just wrote get called very often or no? is that if statement where we are going to take the if branch more often then the else one, or is it the other way around? the compiler simply can’t tell, without our help that is.

Supplying the compiler some hints when writing code, can be very beneficial in enabling the compiler to properly optimise the code it produced for best performance, how ever that is defined.

In this post we shall describe several hints that we can provide to GCC and G++ compiler to help them understand our program run time behaviour better and ultimately produce better code.

There are two types of constructs to provide hints to GCC and G++ about the run time attributes of the code you write for the purpose of optimisation: function attributes and built-ins.

Function Attributes

Function attributes are modifiers you add to function  definition that supply the compiler with additional information about that function run time behaviour. The canonical form of an attribute in GCC and G++ is:

__attribute__((name))

Where “name” is one of the reserved attributes names. Many such attributes exists and you are encouraged to read the GCC documentation about them.

For example, this is how a function defined with the “pure” attribute will look like (more on the pure attribute itself later):

int my_func(int i)  __attribute__((pure));

One very common shorthand form  that is often used in large project (for example, the Linux kernel), is to use a pre-processor define to make the attribute use less verbose, like so:

#define __pure __attribute__((pure))
 
int  my_func(int i) __pure;

Or, as I prefer:

int   __pure my_func(int i);

I shall use this shorthand form hence forth.

Now that we have established what are function attributes and how to use them, let us go over a list of such attributes which can be used to provide hints to the compiler and allow it to better optimise our code:

The pure function attribute

Many functions that we write are pure functions. A pure function is a function whose only effect is it’s return value and the return value, in turn, depends solely on the parameters to the function and/or gloval variables, such as that if we call the function with the same parameters and the same values in global variables we are guaranteed to get the same return value.

As an example, the functions strlen(3) and memcmp(3) are good examples for pure functions; their return value is solely dependant on their input parameters. On the other hand, the feof(3) function is a not a pure function, because, in a multi-threaded application, some other thread might have changed the state of the file the file descriptor is referencing between two calls of the function.

Compilers can make use of the knowledge that a certain function is a pure function by performing such optimisations such as common sub-expression elimination in the same way that this is done with arithmetic operators such as “+” or “*”.

Using the pure attribute, we can tell GCC and G++ that a certain function is a pure function and therefore might be a good candidate to optimisation. For example:

size_t __pure strlen(const char *) ;

Note that the pure attribute is not implemented in GCC versions earlier than 2.96.

The hot and cold function attributes

An additional pair of function attributes which you can use to help GCC and G++ emit better code are the hot and code function attributes:

__attribute__((hot))
__attribute__((cold))

As their name suggests, these function attributes are used to hint the compiler that the corresponding functions are called often in your code (hot) or seldom called (cold).

The compiler can then order the code in branches, such as if statements, to favour branches that call these hot functions and disfavour functions cold functions, under the assumption that it is more likely that that the branch that will be taken will call a hot function and less likely to call a cold one.

In addition, the compiler can choose to group together functions marked as hot in a special section in the generated binary, on the premise that since data and instruction caches work based on locality, or the relative distance of related code and data, putting all the often called function together will result in better caching of their code for the entire application.

Good candidates for the hot attribute are core functions which are called very often in your code base. Good candidates for the cold attribute are internal  error handling functions which are called only in case of errors.

Here is a usage example:

int __hot call_me_often(void);
int __cold handle_error(void);

The hot and cold attributes are available GCC version from 4.3 and onwards.

Built-ins

Another form of hints that can be a programmer may provide to the compiler about the code run time behaviour for the purpose of optimisation, are calling a built-in compiler function.

Data pre-fetching and cache pre-population

As you are most probably aware, fetching information from RAM is a costly business in the terms of a computer program. This is why an elaborate system of internals on CPU and off CPU caches are used to store recently accessed data and instructions in the hope that these will be needed again soon and thus the cost of fetching them from RAM can be saved.

Most chip architectures and compilers do a pretty decent job working together to automatically populated the various caches with the right data and instructions in order to achieve better performance. Sometime however, a programmer can provide a hint to the compiler about what data might be needed in advance of time, thus allowing the compiler to emit instruction that will fetch the data from RAM to the cache while the CPU is busy doing other processing, thus making sure the data is available in the cache when the CPU gets around to needing it.

The way to do this with GCC and G++ is to use the built-in pre-fetch function:

void __builtin_prefetch( const void *addr, int rw, \
  int locality );

The built-in pre-fetch function will cause the compiler to emit instructions that will pre-populate the cache with the data at address addr. Note that it is allowed to provide an illegal address to pre-fetch.

The optional rw parameter indicated whether the requested access is for read purposes or write (1 denotes read/write access while 0 denotes read only), whereas the optional locality parameter  describes the temporal locality of the data. That is, the degree to which it is likely the same data will be used again soon and therefore whether or not to keep the data in the cache after it is used (0 means “do not keep at all” and 3 denotes “keep as long as possible”).

As an usage example, consider the common operation of traversing a linked list. Typically, we iterate over the list, processing each item in turn and then progressing to the next list item:

while (!item) {
  item->field++;
  // more processing...
  item++
}

Using the built-in pre-fetch function, we can write the same code like this:

while (!item) {
  // Not NULL check since invalid addr OK
  __builtin_prefetch(item->next, 1, 1) ;
  item->field++;
  // more processing...
  item++
}

Adding the pre-fetch call will allow the compiler to issue instructions that will get the CPU to start the transaction that will get the data for the next item fetched from the main RAM to the cache while processing the current field which is in the cache. With any luck, by the time the CPU finished processing the current item, the next item will have already been pre-fetched and populated the cache, just in time for it being processed.

Proper use of pre-fetching and cache pre-populating can sometime have a dramatic effect on the performance of some kinds of computer programs.

Branch annotation

Branches in program code path, such as those that happen due to an if statement, are challenging for modern CPUs. As most modern day CPUs makes use of multiple pipelines of instructions that execute in parallel, a branch in a program is problematic since the CPU does not know for sure in advance which branch will be taken until it has executed the entire instructions that lead up to the branch, thus stalling the pipeline.

The solution taken by CPU makers for this problem is the introduction of branch prediction logic into the CPU which tries to estimate which branch will be taken and try to advance with the pipeline based on that estimation. If the assumption is later proved to be wring, the CPU aborts the pipeline and starts over; a costly business.

As you probably have guessed by now though, we can use the knowledge we have as programmers about the program flow and hint the compiler about the chances that a certain branch, or if statement, will be taken or not. The compiler in turn, will use this information to bias the CPU to prefer to proceed with the most probable branch, the one least likely to cause a hazard and an abort of the pipeline.

Other benefits are that the compiler can re-order the binary code such that the code most likely to execute in a branch will be closest to the branch instruction itself, thus preserving code locality and therefore assisting caching.

Providing the GCC and G++ compilers with the probable outcome of a branch, we use the built-in expect function:

__builtin_expect(!!(x), e)

The function is used by applying it to a branch predicate, such as in the following example:

if(__builtin_expect(!!(condition), 1)) { ... }

The x parameter to the built-in expect function is the predicate we test in the if statement and the e parameter is the most likely outcome of evaluating the predicate as a Boolean value: true or false.

Often, the built-in expect function is wrapped for convenience using two macros. Such is the case in the Linux kernel source, for example -

#define likely(x)	__builtin_expect(!!(x), 1)
#define unlikely(x)	__builtin_expect(!!(x), 0)

Using these macros, the usage becomes the following, which is much more readable:

if(likely(condition)) {...}

It should be noted that GCC also provide a run time parameter -fprofile-arcs, which can profile the code for the actual statistics for each branch and the use of it should be prefered above guessing.

Conclusion

We have detailed above several note worthy way to providing hints to GCC and G++ compilers about your code behaviour in run time, hints that allow the compiler to better optimise your code and therefore achieve superior performance. Making a habit of using these hints as part of your normal code writing routine can easily provide substantial returns on a long enough scale.

Care should be taken however, to not get these hints wrong, as there is nothing worse then providing your compiler with the wring hint and thus “pessimising” your code instead of optimising it.

glibc 2.10 news

Tuxology team April 23rd, 2009

Ulrich Drepper, maintainer of glibc, published an interesting blog entry detailing some of the changes going into the glibc 2.10 release. Well worth a read.

IBM DeveloperWorks: high availability for composite applications

Tuxology team January 15th, 2009

IBM DeveloperWorks published a great article by Mahesh Viswanathan (maheshv@us.ibm.com), Senior Technical Staff Member at IBM and
Suraj Subramanian (suraj@us.ibm.com), Senior Integration Architect at IBM about an implementation of high availability for a composite application using Linux-HA.

sys_clone: Beyond Processes and Threads

Gilad Ben-Yossef January 1st, 2009

Most Linux developers are aware of the two library calls for creating a new context of execution in Linux:

fork() and related call vfork(), which create a new process, complete with it’s own process id, private address space and a private copy  set file descriptors, file system attributes (such as working directory) and signal handlers, all distinctly separate from those of the parent that called the fork() function.

And pthread_create(), which creates a new thread inside the existing process, which shares the same process id, address space, file descriptors, file system attributes and signal handlers with the caller.

Looking at these two options it would seem that were are faced with a “share all or share nothing” attitude – either we go the new process route and get a private copy of everything  (actually Linux employs copy on write semantics to make more efficient use of memory), or we can go the new thread route share all resources and for the most, these options are sufficient for most needs.

Sometimes, however, a less black and white approach is called for – such as a case where we would like to create a new context of execution sharing the file descriptors and file system attributes of the creator, but not it’s address space (or maybe juts a portion of it) and process id, for example.

For these cases exactly Linux offers clone(2). Clone is a library function implemented inside the C library, Glibc, which layered on top of the underlying sys_clone system call.

Clone is similar to fork(2) and pthread_create(3) in that it creates a new execution context which are scheduled independently from the creator.

Unlike fork and pthread_create, however, clone provides a fine grained level of control of the properties of this new context:

  • What it will and what it will not share with its creator.
  • Which parent process will it belong to.
  • Which signal, if any, will be delivered to its parent when it terminates.
  • Where is the location of the new task call stack.

As an example, here a code snippet that creates a new task (for lack of a better word), which implements our previous example – a new process which has its own process ID, a private address space (with a copy on write semantic of the creator address space) and a separate set of signal handlers, but shares with its creating process the table of file descriptors, file system attributes:

#include <ched.h>
 
#define STACK_SIZE 4096
 
void * stack = mmap(NULL, STACK_SIZE , \
   PROT_READ | PROT_WRITE, MAP_PRIVATE | \
   MAP_ANON | MAP_GROWSDOWN, -1, 0);
 
/* check for alloc errors... */
 
clone(test, stack, CLONE_FS | CLONE_FILES \
    | SIGCHLD| CLONE_PARENT, NULL);
/* check for clone errors... */
 
munmap(stack, STACK_SIZE);

As you can see in the example above, we first allocate a stack for the new task using an anonymous private memory mapping and instruct the kernel to grow the mapping downwards as need be (this is actually not true on all architectures).

Them, we use the clone library call to create the new task, sharing the file descriptor table and file system attributes, but not the address space with the creator.

We also ask that the new task will inherit the same parent as the creator and that a SIGCHLD signal will be sent to the parent of the new task, as normally would be the case.

As you can see, the clone library call and the sys_clone are a powerful tool that can be used to create unique execution contexts beyond the standard thread or process variety. Many more options are detailed in the clone(2) man page.

Gilad

Using ldd and nm to locate crashing function

Gilad Ben-Yossef August 4th, 2008

A new video post tutorial, showing how to locate the function where your Linux application crashed if all you know is the address where the crash happened using two common Linux utilities: ldd and nm.

Enjoy!

Benchmarking boot latency on x86

Gilad Ben-Yossef July 8th, 2008

One of the tasks embedded system developers face is managing boot latency, or the time it takes for a device to become functional after power up.

After all, a set top box that will take more then 60 seconds from the time the “On” button is pressed until the end customer can at least interact with the menus, for example, will most likely be returned to the store for a refund, no matter what feature set it has. When it come to our gadgets, we are all hungry for immediate satisfaction, it seems.

There are many tricks one can employ to to achieve boot time nirvana, but as Knuth taught us, premature optimization is the root of all evil. Therefore, before we turn our efforts to optimize boot latency, we should first try to measure what that boot latency really is and what part of the boot sequence contribute to it, lest we optimize the wrong thing.

Generally speaking, on a x86 (32bit or 64 bit), the boot process of a Linux system is comprised of the following phases and milestones:

  • The power up milestone: when the power is set to on
  • The BIOS phase: which includes the POST (or Power On Self Test), device initialization, running of option ROMs and loading the boot loader form the MBR (or Master Boot Record).
  • The boot loader phase: loading of an operating system kernel and ancillary data (such as Linux initrd or initramfs) into RAM.
  • Kernel initialization phase: initialization of CPU, peripherals and kernel data structures, including the bring up of the non boot cores in the case of a multi-core machine.
  • First user application first line of code milestone: the time when the first line of user application source code is executed.

Unfortunately, measuring the contribution of each of these phases to the overall boot latency is not an easy task to accomplish. At each of the phases different kind of code is executing on the machine: from the 16 bit BIOS code which is part of the machine firmware, via the 16 or 32 bit boot loader code, the 32 bit kernel code and the finally user applications – each of these is executing in a completely different software environment from the other and it is hard to find a common ground to compare the time each phase takes.

Luckily, the x86 architecture provides a useful tool: the TSC (or Time Stamp Clock) register. The TSC register was introduced in the original Intel Pentium CPU and counts the number of clock ticks from the last processor reset. Reading the current value of the TSC register is done using the RDTSC instruction.

Assuming no processor frequency changes (for example via SpeedStep technology) takes place and that we always sample the register of the same core in a multi core environment, both of which are easy to guarantee during the boot phase, the TSC register provides us with an accurate hi-res timer through which we can benchmarks the various phases in the boot process.

Continue Reading »

A new lecture: Crash N’ Burn

Tuxology team June 30th, 2008

We’ve just added a brand new tutorial to the lectures section on the site, entitled: “Crash and burn: Writing Linux application fault handlers“. Check out the full description, slides and example code on the lecture page.

Or, if you’d rather see our very own Gilad Ben-Yossef present the tutorial in front of a live audience, you’re welcome to attend one of the following venues:

Hope to see you there!

Tuxology team

Remote Debugging with Eclipse

Tuxology team June 5th, 2008

Eclipse is an open source development platform comprised of extensible frameworks, tools and run times for building, deploying and managing software across the life cycle.

CDT is the name of the C/C++ development plug-in. It includes a graphical GDB front end.

The following slides are a short visual “how to” demonstrating configuring and using CDT to debug a remote target with GDB.

Read this doc on Scribd: eclipse debug

Reducing the size of dynamic libraries

Gilad Ben-Yossef May 15th, 2008

Today we have a special reader request from an anonymous reader (slightly edited):


Hello,

Can you help me with some link issue which I face?

I need to compile tree of c-sources which produce .so files and exe
files. I want to decrease the sizes of .so by throwing away unused symbols.

I can compile my tree statically and used as compiler flags –static -ffunction-sections -fdata-sections and link with –gc-sections option and it reduces all the unneeded symbols but I want to achieve the same effect in dynamic linking.

Do you know some efficient way to do it?

Thanks,
Anonymous Reader

So the question is: how to make a minimal set of dynamic libraries for a known set of executables that only contain the code for those symbols which the executables actually use, thus saving expensive storage?

The quite simple answer is that there exists a Python utility that does exactly what you want called mklibs – It produces cut-down shared libraries that contain only the routines required by a particular set of executables.

On Debian just go “sudo apt-get install mklibs”, or you can get the source straight from the Buildroot source repository here.

If you don’t like Python (you really should!) you can try an older, shell script based variation on the same theme found here.

Note that you can pass the cross compiler prefix with the “–target” option, which is of course needed for supporting cross compilation.

Hope you found it useful and if you have more questions about Linux development you’d like answered, just let us know.

This post originally appeared in the Codefidence Technoblog

Next »