Optimization: Your Worst Enemy

Optimization (2)

A classic blunder in optimization was committed some years ago by one of the major software vendors. We had their first interactive timesharing system, and it was a "learning experience" in a variety of ways. One such experience was that of the FORTRAN compiler group. Now any compiler writer knows that the larger the hash table you use for a symbol table, the better your performance on lookup will be. When you are writing a multipass compiler in a 32K mainframe, you end up using a relatively small symbol table, but you create a really, really good hash algorithm so that the probability of a hash collision is reduced (unlike a binary seach, which is log n, a good hash table has constant performance, up to a certain density, so as long as you keep the density below this threshold you might expect that you will typically have an order 1 or 2 cost to enter or lookup a symbol, on the average. A perfect hash table (which is usually precomputed for constant symbols), has a constant performance between 1.0 and 1.3 or thereabouts; if it gets to 1.5 you rework the hashing to get it lower).

So anyway, this compiler group discovers that they no longer have 32K, or 128K, or 512K. Instead, they now have a 4GB virtual address space. "Hey, let's use a really big hash table!" you can hear them saying, "Like, how about 1 MB table?" So they did. But they also had a very sophisticated compiler technology designed for small and relatively dense hash tables. So the result was that the symbols were fairly uniformly distributed over the 256 4K pages in that 1MB, which meant that every symbol access caused a page fault. The compiler was a dog. When they finally went back to a 64K symbol table, they found that although the algorithm had poorer "absolute" performance from a purely algorithmic viewpoint (taking many more instructions to look up a symbol), because it did not cause nearly as many page faults, it ran over an order of magnitude faster. So third-order effects do matter.

Also, beware of C. No, not the speed of light. When we talk about performance, the algorithmic performance for n is expressed as a function C × f(n). Thus an n2 algorithm is formally C × n2, meaning that the performance is a constant multiple of the square of the number of elements being considered. We shorten this to O(n2), meaning "order of n2", and in common parlance just drop the "order of" designation. But never forget the C is there. Some years ago, I was doing a project that produced summary set of listings, sorted in a variety of ways. In the first attempt (this was in the days before C and qsort) I just did an ordinary bubble sort, an O(n2) algorithm. After initial testing,  I fed it some live data. Ten minutes later, after it had printed the console message "Starting reports", it had not yet produced any reports. A series of probes showed that most of the time was in the sort routine. OK, I was done in by my laziness. So I pulled out my trusty heapsort (n log n) sort and spent an hour modifying it to work in my application (remember, I said qsort did not yet exist).  Having solved the problem, I started running it again. Seven minutes into the report phase, nothing had yet appeared. Some probes revealed something significant: it was spending most of its time in the equivalent of strcmp, comparing the strings. While I'd fixed the O issue, I had serious neglected the C issue. So what I did was do one sort of the composite symbol table, all of the names, and then assigning an integer to each symbol structure. Thereafter, when I had to sort a substructure, I just did an integer sort on its ordinal position. This reduced C to the point where less than 30 seconds were required to do the entire report phase. A second-order effect, but a significant one.

So algorithmic performance, particularly paging performance, can matter. Unfortunately, we have neither the proper tools for measuring paging hits nor for reorganizing code to minimize paging of code pages.

Some performance tools measure the total time spent in user space, and treat kernel time as free. This can mask the impact the application has on the kernel time. For example, a few years ago we were measuring the performance of a program whose performance was exceptionally poor. No "hotspot" showed up in terms of program time wasted. However, at one point I was looking at the trace data and noticed that the input routine was called about a million times, which is not surprising when you are reading a megabyte of data, but something seemed odd to me. I looked. Each time it was called, it called the kernel to read a single byte of the file! I changed it to read 8K bytes and unpack the buffer itself, and got a factor of 30 performance improvement! Note the lesson here: kernel time matters, and kernel transitions matter. It is not accidental that the GDI in NT 4.0 is no longer a user-level process but integrated into the kernel. The kernel transitions dominated the performance.

So what to optimize is easy: those parts of the program that are consuming the time. But local optimizations that ignore global performance issues are meaningless. And first-order effects (total time spent in the allocator, for example), may not be the dominant effects. Measure, measure, measure.

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“UNIX is basically a simple operating system, but you have to be a genius to understand the simplicity.” - Dennis Ritchie