Welcome to Tuxology!

Tuxology is a web site dedicated to the Linux training courses and consultancy provided by Codefidence and Hi-Tech College. It aims to provide an on-line counterpart to the courses material for students who wish to expand their knowledge and keep their skills up to date both during and after the training courses.

In the spirit of the Free and Open Source software world which Linux is a part of, all the site facilities are free for all to access and enjoy, including updated course slides, forums and articles about the different course topics.

New lecture slides: the good, the bad and the ugly

Gilad Ben-Yossef October 19th, 2009

We’ve added slides for a new lecture, entitled: “The Good, the bad and the ugly: on threads, processes and co-processes“. This presentation was originally delivered at CELF ELCE Europe 2009 conference about the trade off involved when choosing threads vs. processes for an application design.

Do Androids Dreap of Electronic Sheep?

Tuxology team May 24th, 2009

We’ve added a new lecture about Android, the Open Source mobile stack, from Google and the Open handset Alliance to our list of available lectures.

You are all invited to grab the slides or watch the video (in Hebrew only right now, sorry) at the lecture page.

Helping The Compiler Help You

Tuxology team May 10th, 2009

GCC and G++, respectively, the GNU C compiler collection C and C++ language front ends, are a wondrous duo that usually do a decent job translating our C and C++ source into executable code. Well, most of the time at least… ;-)

Wonderful as these tools are, however, they are not omnipotent and cannot guess what’s on our minds. Would this function we just wrote get called very often or no? is that if statement where we are going to take the if branch more often then the else one, or is it the other way around? the compiler simply can’t tell, without our help that is.

Supplying the compiler some hints when writing code, can be very beneficial in enabling the compiler to properly optimise the code it produced for best performance, how ever that is defined.

In this post we shall describe several hints that we can provide to GCC and G++ compiler to help them understand our program run time behaviour better and ultimately produce better code.

There are two types of constructs to provide hints to GCC and G++ about the run time attributes of the code you write for the purpose of optimisation: function attributes and built-ins.

Function Attributes

Function attributes are modifiers you add to function  definition that supply the compiler with additional information about that function run time behaviour. The canonical form of an attribute in GCC and G++ is:

__attribute__((name))

Where “name” is one of the reserved attributes names. Many such attributes exists and you are encouraged to read the GCC documentation about them.

For example, this is how a function defined with the “pure” attribute will look like (more on the pure attribute itself later):

int my_func(int i)  __attribute__((pure));

One very common shorthand form  that is often used in large project (for example, the Linux kernel), is to use a pre-processor define to make the attribute use less verbose, like so:

#define __pure __attribute__((pure))
 
int  my_func(int i) __pure;

Or, as I prefer:

int   __pure my_func(int i);

I shall use this shorthand form hence forth.

Now that we have established what are function attributes and how to use them, let us go over a list of such attributes which can be used to provide hints to the compiler and allow it to better optimise our code:

The pure function attribute

Many functions that we write are pure functions. A pure function is a function whose only effect is it’s return value and the return value, in turn, depends solely on the parameters to the function and/or gloval variables, such as that if we call the function with the same parameters and the same values in global variables we are guaranteed to get the same return value.

As an example, the functions strlen(3) and memcmp(3) are good examples for pure functions; their return value is solely dependant on their input parameters. On the other hand, the feof(3) function is a not a pure function, because, in a multi-threaded application, some other thread might have changed the state of the file the file descriptor is referencing between two calls of the function.

Compilers can make use of the knowledge that a certain function is a pure function by performing such optimisations such as common sub-expression elimination in the same way that this is done with arithmetic operators such as “+” or “*”.

Using the pure attribute, we can tell GCC and G++ that a certain function is a pure function and therefore might be a good candidate to optimisation. For example:

size_t __pure strlen(const char *) ;

Note that the pure attribute is not implemented in GCC versions earlier than 2.96.

The hot and cold function attributes

An additional pair of function attributes which you can use to help GCC and G++ emit better code are the hot and code function attributes:

__attribute__((hot))
__attribute__((cold))

As their name suggests, these function attributes are used to hint the compiler that the corresponding functions are called often in your code (hot) or seldom called (cold).

The compiler can then order the code in branches, such as if statements, to favour branches that call these hot functions and disfavour functions cold functions, under the assumption that it is more likely that that the branch that will be taken will call a hot function and less likely to call a cold one.

In addition, the compiler can choose to group together functions marked as hot in a special section in the generated binary, on the premise that since data and instruction caches work based on locality, or the relative distance of related code and data, putting all the often called function together will result in better caching of their code for the entire application.

Good candidates for the hot attribute are core functions which are called very often in your code base. Good candidates for the cold attribute are internal  error handling functions which are called only in case of errors.

Here is a usage example:

int __hot call_me_often(void);
int __cold handle_error(void);

The hot and cold attributes are available GCC version from 4.3 and onwards.

Built-ins

Another form of hints that can be a programmer may provide to the compiler about the code run time behaviour for the purpose of optimisation, are calling a built-in compiler function.

Data pre-fetching and cache pre-population

As you are most probably aware, fetching information from RAM is a costly business in the terms of a computer program. This is why an elaborate system of internals on CPU and off CPU caches are used to store recently accessed data and instructions in the hope that these will be needed again soon and thus the cost of fetching them from RAM can be saved.

Most chip architectures and compilers do a pretty decent job working together to automatically populated the various caches with the right data and instructions in order to achieve better performance. Sometime however, a programmer can provide a hint to the compiler about what data might be needed in advance of time, thus allowing the compiler to emit instruction that will fetch the data from RAM to the cache while the CPU is busy doing other processing, thus making sure the data is available in the cache when the CPU gets around to needing it.

The way to do this with GCC and G++ is to use the built-in pre-fetch function:

void __builtin_prefetch( const void *addr, int rw, \
  int locality );

The built-in pre-fetch function will cause the compiler to emit instructions that will pre-populate the cache with the data at address addr. Note that it is allowed to provide an illegal address to pre-fetch.

The optional rw parameter indicated whether the requested access is for read purposes or write (1 denotes read/write access while 0 denotes read only), whereas the optional locality parameter  describes the temporal locality of the data. That is, the degree to which it is likely the same data will be used again soon and therefore whether or not to keep the data in the cache after it is used (0 means “do not keep at all” and 3 denotes “keep as long as possible”).

As an usage example, consider the common operation of traversing a linked list. Typically, we iterate over the list, processing each item in turn and then progressing to the next list item:

while (!item) {
  item->field++;
  // more processing...
  item++
}

Using the built-in pre-fetch function, we can write the same code like this:

while (!item) {
  // Not NULL check since invalid addr OK
  __builtin_prefetch(item->next, 1, 1) ;
  item->field++;
  // more processing...
  item++
}

Adding the pre-fetch call will allow the compiler to issue instructions that will get the CPU to start the transaction that will get the data for the next item fetched from the main RAM to the cache while processing the current field which is in the cache. With any luck, by the time the CPU finished processing the current item, the next item will have already been pre-fetched and populated the cache, just in time for it being processed.

Proper use of pre-fetching and cache pre-populating can sometime have a dramatic effect on the performance of some kinds of computer programs.

Branch annotation

Branches in program code path, such as those that happen due to an if statement, are challenging for modern CPUs. As most modern day CPUs makes use of multiple pipelines of instructions that execute in parallel, a branch in a program is problematic since the CPU does not know for sure in advance which branch will be taken until it has executed the entire instructions that lead up to the branch, thus stalling the pipeline.

The solution taken by CPU makers for this problem is the introduction of branch prediction logic into the CPU which tries to estimate which branch will be taken and try to advance with the pipeline based on that estimation. If the assumption is later proved to be wring, the CPU aborts the pipeline and starts over; a costly business.

As you probably have guessed by now though, we can use the knowledge we have as programmers about the program flow and hint the compiler about the chances that a certain branch, or if statement, will be taken or not. The compiler in turn, will use this information to bias the CPU to prefer to proceed with the most probable branch, the one least likely to cause a hazard and an abort of the pipeline.

Other benefits are that the compiler can re-order the binary code such that the code most likely to execute in a branch will be closest to the branch instruction itself, thus preserving code locality and therefore assisting caching.

Providing the GCC and G++ compilers with the probable outcome of a branch, we use the built-in expect function:

__builtin_expect(!!(x), e)

The function is used by applying it to a branch predicate, such as in the following example:

if(__builtin_expect(!!(condition), 1)) { ... }

The x parameter to the built-in expect function is the predicate we test in the if statement and the e parameter is the most likely outcome of evaluating the predicate as a Boolean value: true or false.

Often, the built-in expect function is wrapped for convenience using two macros. Such is the case in the Linux kernel source, for example -

#define likely(x)	__builtin_expect(!!(x), 1)
#define unlikely(x)	__builtin_expect(!!(x), 0)

Using these macros, the usage becomes the following, which is much more readable:

if(likely(condition)) {...}

It should be noted that GCC also provide a run time parameter -fprofile-arcs, which can profile the code for the actual statistics for each branch and the use of it should be prefered above guessing.

Conclusion

We have detailed above several note worthy way to providing hints to GCC and G++ compilers about your code behaviour in run time, hints that allow the compiler to better optimise your code and therefore achieve superior performance. Making a habit of using these hints as part of your normal code writing routine can easily provide substantial returns on a long enough scale.

Care should be taken however, to not get these hints wrong, as there is nothing worse then providing your compiler with the wring hint and thus “pessimising” your code instead of optimising it.

glibc 2.10 news

Tuxology team April 23rd, 2009

Ulrich Drepper, maintainer of glibc, published an interesting blog entry detailing some of the changes going into the glibc 2.10 release. Well worth a read.

IBM DeveloperWorks: high availability for composite applications

Tuxology team January 15th, 2009

IBM DeveloperWorks published a great article by Mahesh Viswanathan (maheshv@us.ibm.com), Senior Technical Staff Member at IBM and
Suraj Subramanian (suraj@us.ibm.com), Senior Integration Architect at IBM about an implementation of high availability for a composite application using Linux-HA.

sys_clone: Beyond Processes and Threads

Gilad Ben-Yossef January 1st, 2009

Most Linux developers are aware of the two library calls for creating a new context of execution in Linux:

fork() and related call vfork(), which create a new process, complete with it’s own process id, private address space and a private copy  set file descriptors, file system attributes (such as working directory) and signal handlers, all distinctly separate from those of the parent that called the fork() function.

And pthread_create(), which creates a new thread inside the existing process, which shares the same process id, address space, file descriptors, file system attributes and signal handlers with the caller.

Looking at these two options it would seem that were are faced with a “share all or share nothing” attitude – either we go the new process route and get a private copy of everything  (actually Linux employs copy on write semantics to make more efficient use of memory), or we can go the new thread route share all resources and for the most, these options are sufficient for most needs.

Sometimes, however, a less black and white approach is called for – such as a case where we would like to create a new context of execution sharing the file descriptors and file system attributes of the creator, but not it’s address space (or maybe juts a portion of it) and process id, for example.

For these cases exactly Linux offers clone(2). Clone is a library function implemented inside the C library, Glibc, which layered on top of the underlying sys_clone system call.

Clone is similar to fork(2) and pthread_create(3) in that it creates a new execution context which are scheduled independently from the creator.

Unlike fork and pthread_create, however, clone provides a fine grained level of control of the properties of this new context:

  • What it will and what it will not share with its creator.
  • Which parent process will it belong to.
  • Which signal, if any, will be delivered to its parent when it terminates.
  • Where is the location of the new task call stack.

As an example, here a code snippet that creates a new task (for lack of a better word), which implements our previous example – a new process which has its own process ID, a private address space (with a copy on write semantic of the creator address space) and a separate set of signal handlers, but shares with its creating process the table of file descriptors, file system attributes:

#include <ched.h>
 
#define STACK_SIZE 4096
 
void * stack = mmap(NULL, STACK_SIZE , \
   PROT_READ | PROT_WRITE, MAP_PRIVATE | \
   MAP_ANON | MAP_GROWSDOWN, -1, 0);
 
/* check for alloc errors... */
 
clone(test, stack, CLONE_FS | CLONE_FILES \
    | SIGCHLD| CLONE_PARENT, NULL);
/* check for clone errors... */
 
munmap(stack, STACK_SIZE);

As you can see in the example above, we first allocate a stack for the new task using an anonymous private memory mapping and instruct the kernel to grow the mapping downwards as need be (this is actually not true on all architectures).

Them, we use the clone library call to create the new task, sharing the file descriptor table and file system attributes, but not the address space with the creator.

We also ask that the new task will inherit the same parent as the creator and that a SIGCHLD signal will be sent to the parent of the new task, as normally would be the case.

As you can see, the clone library call and the sys_clone are a powerful tool that can be used to create unique execution contexts beyond the standard thread or process variety. Many more options are detailed in the clone(2) man page.

Gilad

Herzelinux lecture slides : Initramfs – boot your Linux WELL

Tuxology team December 29th, 2008

On 18/12/2008 Zaar Hai gave a lecture in Herzelinux named “Initramfs – boot your Linux WELL”
the slides for the lecture is now available here.

Thanks you Zaar and we look forward to hosting you again in Herzelinux club meeting as well as here.

Tuxology Team.

IBM DeveloperWorks: Anatomy of Linux process management

Tuxology team December 25th, 2008

IBM DeveloperWorks web site just published a good article by Tim Jones discussing the anatomy of processes in Linux.

To the article…

Building an embedded Linux system emulator

Gilad Ben-Yossef December 14th, 2008

One of the hallmarks of embedded system programming is working with specialized hardware. Unfortunately, embedded system developers do not always have the luxury to develop and test their code on the actual hardware they target. Often, the hardware is developed in tandem with the system software and therefore it it is not available for much of the embedded system software development cycle.

While one can develop and test much of our code on a PC running Linux, such a PC is a very different environment from the target board. More often then not, the target board is not even of the same architecture as the PC. A solution to this problem can be obtained by using an emulator – a software tool that executes software code of our target platform in a virtual machine that is running on our development host, and running our system software in it.

The following article describes how to build an embedded Linux system running inside an emulator which can be used to develop, test and debug target code even without access to target hardware.

Continue Reading »

Rami Rosen on IPv6 in Linux

Tuxology team November 10th, 2008

Last Thursday Rami Rosen, a Computer Science graduate of The Technion Israel Institute of Technology in Haifa, Israel and a Linux kernel programmer for a networking startup, gave an excellent talk about IPv6 and it’s implementation in the Linux kernel at monthly the Herzlinux meeting.

For those of our reader which were absent, fear not! not only are the slides for the lecture available here, Rami posted an excellent article based on the lecture on LinuxDevices.com.

Thanks you Rami and we look forward to hosting you again in Herzelinux club meeting as well as here.

Tuxology Team.

Next »