Database efficiency

This article was originally published in VSJ, which is now part of Developer Fusion.
The late 1990s to the beginning of the 21st century was the era of in-memory representation of data. Dozens of toolkits have been available to provide code-centric abstractions for your relational data. The object-oriented view of data was the highest goal. Databases were nothing more than a cumbersome means for persistence – just dumb data stores. But are these tools – and the associated thought patterns – the best choice for highly scalable and performing applications?

Basically all databases which are in broad use today have been optimised to work with tabular data and its relations. SQL itself is designed around set-based operations rather than the manipulation of single records. This is a view which is the direct opposite of the object-oriented world, which is based on working with distinct objects or single entities. Even though you can use a diverse set of collections to represent relationships and sets in your object-oriented applications, one question still remains: does it really make sense to keep and modify in-memory copies of your data in high-transaction environments?

The end of automated tools

In fact, it is quite likely that you will reach some limits of any automated tool – DataSets, DataAdapters or custom object/relational mappers – as soon as your application requires highly scalable transactional database access. But you don’t need to believe me – let me instead work with you on a sample application in a number of iterations.

As the basis for this demonstration, I’ll look at a small fragment of an application for maintaining article information, inventory levels, and orders. This application has to track an article’s ID, its name and the available inventory for each one. When an order is placed, your program has to verify the article’s inventory level, because your business-analysts have identified a business rule which states that orders must only be accepted for articles which are in stock. In addition this available amount has to be decremented if the order is accepted.

Step One: Separate your data

In the first iteration, some developers might think about creating a table “Article” with fields “ArticleID”, “Name”, and “InventoryLevel”. But wait! To create highly scalable systems, you have to design your database not just with strict normalisation in mind. Instead you should identify the volume of transaction for each piece of data and group them accordingly. In the application above, this means that you should actually create two different tables: “Article” with the fields “ArticleID” and “Name”, and the second table “Inventory” with the fields “ArticleID” and “Amount”. This allows your application to always read from Article without being affected by any transaction locks on table “Inventory”.

After identifying the best table structure, you now have to choose the right database access strategy for your code. Should you use an object/relational mapper? DataSets? Or just DataCommands and their Execute*-methods? I tend to recommend a very pragmatic solution for most applications: for all pieces of data which are mostly read – and which seldom participate in high-transactional UPDATEs – you should just go for the most comfortable approach. Based on your personal preferences and project focus, this might include the use of an object/relational mapper or DataSet.

For all kinds of data which are changed with high transaction counts, you should however consider either manually creating all UPDATE and DELETE statements, or using appropriately optimised stored procedures. Please always keep in mind that automatically generated SQLs or stored procedures will usually not allow you to achieve the best possible scalability and performance. The reason for this caveat is that the underlying tools (DataAdapters or O/R mappers) have to generate very generic SQL without any possibility of application-specific optimisations.

Optimistic concurrency

One of the core problems of all automatically generated database access methods is that they usually rely on a generic optimistic concurrency model. The main assumption behind these optimistic models is that most of the time only one user will change a given piece of data. Corrective action will only be taken in the negative case – if a previous change by another user is detected during the update.

This assumption is perfectly valid for most near-static data like articles, customers, and so on. This optimistic approach will however have very negative effects if you are working with data which is changed in high-transaction environments. It can be the reason for excessively long database locks or even deadlocks. But what exactly is an UPDATE for optimistic concurrency, and why is it bad for transactions?

When changing data in this way, the first step is to load the current value of a record into some in-memory representation, for example into a DataSet. After this your application can modify the data to represent the target state which should be stored in the database after your transaction. As soon as your application wants to persist the change, it will generate an UPDATE statement in which the WHERE clause will not just contain the record’s primary key, but also a timestamp or the previous values of the changed fields.

If your Inventory table for article 42 shows a current inventory level of 15, and you would like to store an order for 5 pieces, SQL similar to the following would be generated:

UPDATE Inventory SET Amount = 10
	WHERE ArticleID = 42 
	and Amount = 15
The second part of the WHERE clause (Amount = 15) makes sure that the UPDATE fails – or returns zero hits – if some other user modifies the amount in the meantime. In this case, your application receives an Exception and has to re-load the record, re-apply the changes and try again with a new UPDATE statement.

This is the first major disadvantage of using automatically generated SQL with data which is modified in high transaction counts: if a large number of users are running the same application and modifying the same data, these additional re-SELECT/re-UPDATE cycles will happen quite often.

Step Two: Avoid deadlocks

Let’s further suppose that two purchase orders are entered at the same time. The first PO requires two pieces each of article 42, 43 and 44. The second PO is also for two pieces of each article, but in the reverse order: 44, 43, and 42. Both purchase orders are stored at the same time in two different transactions. Processing starts with the first order and the UPDATE of article 42 which will be successful. After this, depending on the time delay, the second order will change the inventory level of article 44. Subsequently article 43 will be changed according to the first PO. As a next step, your application will try to change article 43 again, this time to reflect the changes of the second PO. This UPDATE will not succeed because the corresponding record is currently locked by the first transaction. Consequently the processing of the first order will continue trying to update its last article. This change will however also fail as article 44 is still locked by the second transaction. This is a classic deadlock scenario: the first transaction waits for article 44, the second waits for article 43. Both records are locked by the other transaction. In most modern databases this situation is automatically detected, one transaction is chosen as a deadlock-victim and is forcibly aborted. Your application receives a corresponding Exception and has to re-start the transaction from the beginning. The interesting question is whether or not these deadlocks can be generally avoided, or their likelihood reduced.

A general way to avoid this situation is to define and follow a specific update strategy. In the application above, it would be enough to just execute the UPDATEs sorted by the affected articles’ IDs. You would in this case UPDATE the articles always in the sequence of 42, 43, and 44 – no matter in which sequence the user had entered the elements of the two purchase orders. It therefore does not cause a deadlock if the processing of the second PO hits the lock of article 42. The first PO will be completed successfully because neither article 43, nor 44 are locked. The lock on article 42 will then be removed after completion of the first transaction so that the second PO can also be successfully stored without any necessity for a retry.

You can in most cases achieve this kind of UPDATE strategy only with explicit, application-specific database access code. Automated tools will generally process all rows in the exact sequence in which they have been entered, leading to inevitably deadlock-prone database code.

But even if you are following an in-memory, optimistic concurrency approach as described above, there is still one drawback. When the second order is processed, your application has to send additional SELECTs and UPDATEs to compensate for the changed inventory level. The total database interaction for these two purchase orders consists therefore of nine SELECTs and nine UPDATEs.

Step Three: Reduce database interactions

The original reason for the first SELECT was the need to read the current inventory level, to verify the stock level (according to the “business rule”), and to decrement the in-memory representation reflecting the new stock level after the transaction. Interestingly enough, this complete operation can be fully handed over to the database. If, for example, you have 15 pieces of article 42 in stock and receive an order of 4 pieces, you could in the worst case work as follows:
  • First send a SELECT to the database
  • Transform the result into a DataSet using a DataAdapter
  • Change the inventory level in the DataSet to the new value of 11
  • Finally use a DataAdapter to send a corresponding UPDATE statement to the database
This requires at best two database round-trips and two more if someone has changed the value in the meantime, and at least two (again perhaps four) transformations to/from in-memory XML in the DataSet.

Let’s think the unthinkable thought: how about not using any in-memory representation of the inventory level? Instead of using a DataSet, you would hypothetically only call the necessary SQL statement or stored procedure directly via an IDbCommand object. Instead of pushing your data through SELECT->DataAdapter->DataSet->DataAdapter->UPDATE you would just send one simple SQL statement to your database.

Sounds too good to be true? Actually I would suspect that this is the way most people used to design their software before Resultsets, Recordsets, DataSets, and similar constructs were available. In fact, you can implement the same behavior by just sending this SQL statement to your database:

UPDATE Inventory
	SET Amount = Amount - 4
	WHERE ArticleID = 42
	AND Amount >= 4
This statement decrements the inventory level (Amount = Amount - 4) after verifying the current stock level (Amount >= 4).

After executing this statement, its return value will contain the number of affected rows. If this number equals 1, you automatically know that the business rule has been satisfied, and that the inventory level has been successfully decremented. You only have to send an additional SELECT statement if you received 0 as the return value, which means that there were not enough units in stock when executing the original statement.

In fact, if you use SQL Server, you could even send a batched statement consisting of the UPDATE and the SELECT to acquire the necessary data in just one roundtrip. It is important to note that a change in the inventory level, which has been initiated by a different user in the meantime, will not require additional SELECT/UPDATE statements. Different processing is only necessary if the business rule has not been satisfied.

When processing two orders at the same time – as in the example above – this allows you to reduce the number of database interactions from 18 SQL statements (nine SELECTs plus nine UPDATEs) to just six UPDATEs.

This automatically leads to shorter transaction times resulting in quicker release of database locks. As a consequence, your application will produce better performance and scalability. And all this with just performing three easy steps: separate your data according to transaction volume, choose an UPDATE strategy to eliminate deadlocks, and reduce the number of database round trips with application-specific, optimised SQL.


Ingo Rammer is an independent consultant, mentor, and developer based in Austria. He helps software architects and developers working with .NET and Web Services technologies. Apart from his consulting services, he is a regular speaker at developer conferences around the world, and has authored two best-selling books, Advanced .NET Remoting and Advanced .NET Remoting in VB.NET (both published by Apress). Ingo is also the Microsoft Regional Director for Austria and can be reached at www.ingorammer.com.

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“It works on my machine.” - Anonymous