Optimization: Your Worst Enemy

Optimization

A reasonably skilled programmer will not write a grossly inefficient program. At least not deliberately. Optimization is what you do when the performance is insufficient. Sometimes the optimizations are easy, sometimes they are hard. Sometimes they fly in the face of your original design, and sometimes they require that you grossly violate your beautiful abstractions in your class system. But always, and I repeat, always, my experience has been that no programmer has ever been able to predict or analyze where performance bottlenecks are without data. No matter where you think the time is going, you will be surprised to discover that it is going somewhere else. 

You optimize because you have a problem in performance. Sometimes it is computational optimization: your bitmap manipulation is just too slow. Sometimes it is data access optimization: it just takes too long to get the data into the machine. And sometimes it is algorithmic optimization: you're doing it wrong. If you don't understand the difference between an n2 sort and an nlogn sort, you're probably already in trouble, but that knowledge alone is not useful.

Some years ago, I was working on a complex program which had to perform a semantic cross-check between the "statements" of the program and the "declarations" (it was actually a 4GL constraint equation system, but the details don't matter). I discovered that the operation was of n3 complexity (well, actually m * n2, but m and n would be of comparable size most of the time). There are three paths that you can follow here:

  • The naive path. You don't even realize you've got an n3 problem. You're probably in trouble, because if it is the bottleneck, you didn't know it was there.
  • The formal academic path. You realize you've got an n3 problem, and know it is intrinsically evil, and rewrite your algorithms.
  • The engineering path. You realize you've got an n3 problem, but you instrument the system to discover its actual impact on the system.

The only valid path to optimization is the engineering path. I measured the performance, and on the largest "real" example we had, I discovered that n was almost always 1, sometimes 2, rarely 3, and had exactly one instance of 4. This was too small to matter. Sure, the algorithm was n3, but with n that small, there was no need to rewrite the code. Rewriting the code would have been incredibly complex, delayed the whole project for a couple weeks, and used up a couple pointers in each node of the tree in an already-tight minicomputer address space.

I also wrote the storage allocator that everyone used. I spent a lot of hours tweaking its performance to be the fastest allocator of its class. These adventures are detailed in the book IDL: The Language and its Implementation, now, alas, out of print. (Nestor, Newcomer, Gianinni and Stone, Prentice-Hall, 1990). One group that used it used a "PC sampling" performance tool on Unix. This is a performance tool that samples where the program counter (PC) is executing, and over a long execution derives a "density histogram" of where the program is spending its time. It clearly showed that a massive amount of time was being spent in the storage allocator. This made no sense to me, but fingers were being pointed in my direction.

So I wrote a little hook in the allocator that counted the number of times it was called. It turns out it was called over 4,000,000 times. No call took longer than the minimum measurement interval of 10µs (approximately ten instructions on our 1-MIPS machine), but 40,000,000 microseconds is 40 seconds. Actually, it was more than that because there were 4,000,000 free operations as well, which were even faster, but still the resulting time was more than 50% of the total execution time.

Why did this happen? Because, unknown to the programmers, a critical function they were calling in the inner loop of several algorithms would actually allocate a 5-to-10 byte working buffer, do its thing, and release it. When we changed this to be a 10-byte stack local, the time spent in the storage allocator dropped to about 3% of the total program time.

Without the data, we would not have known why the allocator accounted for so much time. PC-sampling performance tools are a very weak class of tools and their results are intrinsically suspect. See my article "Profiling for Performance", in Dr. Dobb's Journal (18,1) January, 1993, pp.80-87.

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Debugging is anticipated with distaste, performed with reluctance, and bragged about forever.” - Dan Kaminsky