Grand designs

This article was originally published in VSJ, which is now part of Developer Fusion.
The phrase “think global, act local” may be a bit overused, but it encapsulates one of the biggest issues in software architecture – do you go for a local or global architecture? At this point you may be a little vague as to what local or global architecture is, but don’t worry – everyone is in the same position. Most of the time you can tell which direction a decision moves your design – more local or more global – but the point at which it deserves the label “local” or “global” remains ill-defined.

Let’s try to be a little more concrete. As projects become more complicated increasingly there are decision points where you can select between a local or global design. We can ignore any simple programs that are self-contained and just take in some data, do some sums and terminate. After all, such programs are so simple they don’t really need a grand design. The sort of projects that are interesting are the ones that work with multiple, possibly complex, data sources interacting with multiple entities and presenting fresh data to yet more entities. To be even more concrete let’s consider a simple case study.

Collecting the garbage

Suppose you need to implement a document garbage collection scheme. Each document can be either ready to be deleted or not and the condition that determines it can either change with time or be fixed. For example, some documents might be marked for deletion by the user; others might need to satisfy a complex condition such as five new versions exist and the document is six months old. Think about how you might implement the garbage collector. I hope you can see at once that your first major architectural decision is one of those critical decision points that select for either a local or global design. This is a critical decision because once made, even if made by accident, the decision is very difficult to un-make. In this sense the decision determines the overall architecture and there rest comes down to decoration and furnishings!

I hope you noticed, before you started to refine your design, that you really do have a choice. You can opt to use a database to record the details of the documents or you can record the details along with the documents as metadata. At the current state of technology it is worth observing that local versus global often does reduce to database versus metadata. To be clear, you could set up a database that records the file’s location and an indicator of its state or you could store the state information about the file as part of its metadata, i.e. store the state either in or “alongside” the file in some way. Exactly how this is done is important because the association between the metadata and file should make them effectively a single entity.

Although I have named the database approach “local” and the metadata approach “global”, you might want to argue with these tags. For example, surely the metadata is local because the information is stored locally to the file it applies to? Well it is certainly local when you think of it as applying to a single file, but when you think about all of the files and all of their metadata as comprising a single information store spread out all over the place, it is surely a global approach compared to the idea of gathering all of the data together in a single database-oriented information store. What is local and what is global does depend on where you draw the system boundaries, and is one of the reasons why the problem doesn’t arise if you consider only small projects. For example, in the storing of the state of a single file the global/local issue hardly arises – you simply get on with the job and do it.

In this case, however, the example serves to highlight a slightly different point and a better usage of the terminology. You really shouldn’t think of a solution as global or local. You should – returning to the well-worn expression – always think global and, wherever possible, act local. You need to implement a database of global garbage collection status but you should store the data locally in each file rather than collect it together into a single database. What could be a better example of global thinking and local action? In this sense the metadata solution is a local action implementing a global objective and the database is a global action doing the same job.

Pros and cons

So what are the advantages and disadvantages of each approach? The database storage solution has the apparently huge advantage that it is efficient. You don’t have to search for the documents that are potential candidates for garbage collection. All you have to do is scan the list of documents, evaluate their associated conditions and, if appropriate, delete the document. Compare this to the global-thinking/local-action approach. In this case you have to search the entire file system to locate documents that have garbage collection conditions in their metadata, evaluate the conditions and delete the document if appropriate. This sounds terrible!

There is another old software saying which goes “efficiency is mostly about hardware”. While it is true that choosing the wrong algorithm can make a task take a time longer than the lifetime of the universe, efficiency isn’t the main aim of software. Software design is about sophistication and reliability. First it should do the job well and only then should we worry about how long the job takes.

Of course in the real world we do have to worry about how long the job takes but this should be a treated as a separate, orthogonal, design factor. For example, what do we get if we adopt the seemingly inefficient global/local solution? The first thing is sophistication at little extra cost. If the document moves the garbage collection condition follows it. Consider how you would arrange for the database to be updated to reflect the change in a document’s location? Equally the user can query and change the condition without the need for access to a central database. If the user has access to the file they have access to the garbage collection condition.

What really matters here is that the file is the object that the user regards as the document. For example, if the user deletes the file and creates a new file with the same name – does it have the same garbage collection condition? Clearly if the condition is stored in the document’s metadata it is automatically destroyed along with the file – if it isn’t then the metadata mechanism is fundamentally wrong. In the same way the metadata automatically follows the document as it is copied, moved and generally manipulated in ways that the user finds easy to understand.

Compare this to the reference to the file stored in a separate database, possibly not even on the same machine, as the file. There is no such “natural” association between the file reference and its garbage collection condition and the reference certainly doesn’t automatically track any changes in the file.

Now consider how much complexity you have to add to the database mechanism to make it track the state of the file? If you want to reproduce the close relationship between file and metadata you can’t even use a relaxed “fix the reference when it fails” pattern to keep the two in sync. If you store the references in the database and then wait for them to fail, i.e. generate a file not found error when you try to garbage-collect the file, you can’t simply search for the file because you haven’t tracked its history. It might now represent an entirely different document.

At this point, you should be inventing lots of ways of keeping the object and the reference in sync without sacrificing efficiency – but all the ways that you can think of are essentially brittle. For example, if you use a “file watcher” pattern and ask the OS to pass each file change to your application you can attempt to keep in sync – but think of all the things that can go wrong! If your “watcher” task fails to load, or interact with the OS correctly you have an almost unrecoverable loss of sync – you cannot reconstruct the lost history of the document if there is any interruption of the update mechanism.

Also consider what happens if there is some sort of glitch in the storage of the data. In the metadata case the problem might affect a document or two but in the case of the database a logical inconsistency could destroy all of the information. The local approach has one very big single point of failure – the database.

In short, the metadata forms a distributed database complete with all the advantages and disadvantages of the same. However it’s a good choice of architecture if you can solve the efficiency problems. But can you solve it?

The spreadsheet paradigm

Currently it has to be admitted that implementing distributed local action in the sense described above is a real problem. To discover exactly what sort of challenge is posed by a distributed implementation it is a good idea to conduct a thought experiment where hardware is in abundance. What sort of hardware would you need to make a distributed architecture really work?

You might be thinking in terms of ultra-fast processors and ultra-fast disk drives. All of these help but there is another way. Imagine that all of the files are represented by people – one person per file. Now imagine that all of the people are gathered together on a football pitch and you simply ask everyone to put up their hand if their file’s “condition” evaluates to true – and then please leave the pitch. This is an example of the “spreadsheet” paradigm, perhaps the simplest approach to parallel processing ever invented. One processor is allocated to each item of data and each processor has a, usually simple, set of instructions to obey. It is also an object-oriented approach in the sense that data and process are tied together. In an ideal computing environment all data would be encapsulated within an object that provided its first level of processing. In an even more ideal world every object would not only have its own thread of execution but its own processor taking care of that thread. There are some cases where this is indeed a possible architecture.

Of course in practice the spreadsheet usually isn’t implemented as a single processor per data item, instead whatever processing power is available is used to simulate the parallelism. This said, it is worth mentioning that machines have been built that implement the spreadsheet approach to parallel processing with thousands of processors. But back to the real world – in practice the importance of the spreadsheet paradigm is that it provides us with good ways of thinking about distributed processing. The point is that each of the files “knows” if its time is up, but we have to use serial methods to simulate the ideal parallel implementation.

In this case things are simple enough not to pose any real problems. The best solution is to use a lazy approach to garbage collection and run a file scanner in processor-idle time. This is just a simulated implementation of the spreadsheet algorithm visiting each data cell in turn and computing the result – but in the background. Of course as you allocate more and more threads to the process the simulation becomes increasingly parallel.

The result is not a sophisticated solution but it’s a fairly standard one for distributed systems where the outcome isn’t time critical. In this case it doesn’t matter exactly when a document is garbage collected as long as it happens sometime. The same paradigm applies to web crawlers, disk defragmenters, indexing software and so on. Only when the outcome is time critical or the results could depend on the outcome of multiple data items – such as a seat booking database or solving a mathematical problem – do we have to confront the challenge of real-time distributed systems and true parallel processing.

Redundancy is good

Moving on to more general situations we quickly find it difficult to generalise. There are two distinct aspects to any global solution corresponding to time and space, or more accurately processing and data. A global/local approach can involve parallel processing but more likely simulated parallel processing. It also involves distributed data which, as already discussed, in the absence of true parallel processing is often perceived as an inefficiency and hence a problem.

It could be that this perception is causing us to miss some good approaches to well-known problems. For example, currently we regard data redundancy as a serious problem. It wastes space and it risks inconsistencies. This is the rationale behind the normalisation of relational databases. Don’t store anything more than once. If you need something in more than a single logical place then use a reference not a copy.

The same argument is currently being played out at a higher level in the current vogue for “de-duping” software that removes multiple copies of the same document fragment from large data stores. Reducing redundancy must be good because it reduces storage needs and enforces consistency – as one copy cannot contradict itself. However redundancy has its good points. It’s the basis of all error-correcting codes for example. Put simply having multiple copies of data means that you still have it even if you lose it repeatedly. Redundancy can also make data easier to process. Putting the data into a more amenable format can be worth doing even if it wastes storage.

Many procedures in artificial intelligence make use of distributed coding, i.e. spreading the information over more bits than strictly necessary. This makes the detection and extraction of patterns easier. Certainly it is the case that the need for storage efficiency often works against a sophisticated, flexible and distributed design.

SOA – is global/local?

Currently the most popular approaches to distributed architecture are SOA and web services. Microsoft’s up and coming SOA project, code-named ‘Oslo’, is set to bring SOA to mainstream .NET coding. Essentially the “services” idea promises to distribute a system across servers in such a way that the solution is loosely coupled and can scale without a software redesign.

Services promise to end the “silo” mentality, where data is piled high in a single, all-encompassing, but relatively inaccessible database. However, services can only provide a robust distributed system if there isn’t a single choke point – either due to data or due to process. The problem is that simply splitting a system up into services doesn’t necessarily provide a solution that puts enough thought into the global.

It is too easy to decompose a system into atomic services that look distributed but are in fact simply concentrating the data and processing into one place. Imagine a garbage collection service consisting of a database behind a service interface. A client could register the collection status of a file simply by connecting. The garbage collector itself could connect to the service to discover if a file needed attention. It’s a nice design, but… it suffers from the global database problem.

Implementing services encourages you to think in terms of provisioning the service with exactly what the service needs to do its job and this in turn tends to emphasise the use of a central database that gathers in all the information ready to be used as soon as a client requests service. Consider for a moment how you might build a service that provides a client with garbage collection of data? Actually once you have considered the proposal it’s not that difficult. All you have to do is shift the lazy scanner algorithm to a service. Allow the scanner to work as and when appropriate and allow it to build up a database of garbage collection conditions in its own private database. Now a client can query the service to discover the state of any file and can either go and update the file or perform the collection. The service database isn’t guaranteed to be up-to-date in any way and the client might find that the file isn’t where it is supposed to be and there may be files ready for collection that are not yet listed – but it doesn’t matter in this case. The real data is still managed at a local level along with each file but the service provides a cache snapshot that allows clients to process most of the outstanding task.

Conclusion

This is a difficult argument to sum up, but if challenged to express it in ten words or fewer – think global, act local, find ways of making it efficient.


Dr Mike James has over 20 years of programming experience, both as a developer and lecturer, and has written numerous books on programming and IT-related subjects. His PhD is in computer science.

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Memory is like an orgasm. It's a lot better if you don't have to fake it.” - Seymour Cray