The Developer and Database Architecture: How much should you know?

This article was originally published in VSJ, which is now part of Developer Fusion.
As a SQL Server Developer, you face the task of not only storing data for safekeeping, but also of building performance into the database. Add to this the expectation that the more successful the application, the more it will be used, meaning a greater load on the database, which in turn will affect performance.

This article covers some of the basic areas of database development that are foundational to both good database design and good performance. As with most things, building performance into the database is much easier while it is being built, rather than having to make modifications later. And of course, the Application Layer must be carefully designed to make good use of the database. Even the fastest database in the world can be rendered unusable by a less-than-optimal application.

Basic normalization

First and foremost my recommendation is to take the time to build a good relation design, including relation normal form (at least 3rd normal form) and narrow tables. Normalization will not only improve performance, it will make more efficient use of your storage since redundant data is reduced. The simplest example of normalization is Customer Information. If you do not normalize Customer Info into a separate Customer table, you then must store both Customer Name and Address in your Orders table. This wastes storage space by repeatedly storing the same information on every order for that Customer. Also, the Order table now has the overhead of maintaining this same data repeatedly. Since the Customer Address is stored multiple times, when an address update is required, that address must be updated everywhere it exists. Normalizing data into separate tables means each Customer has only one place (row) where that address is maintained. If you find yourself with wide tables (i.e. a large number of columns), stop and ask this question: “Does every single non-key column in this table relate directly to the Primary Key?” If not, you should normalize them into multiple tables.

Know your application

Understand your application and how it will utilize the database. Will the application be performing significantly more Reads than Writes? Is it a transactional (OLTP) application, or is it more of a querying/reporting application? This understanding will help guide you in determining how and how much to index your tables. Remember that indexes are the key to fast data retrieval, and must be kept up-to-date in order to be useful. Having too many indexes can cause INSERT statements to be slower due to maintaining the index values in addition to the actual table INSERT. The same principle applies to Updates and Deletes as well. So if fast transaction times are critical, choose your indexes judiciously. Conversely, if your requirements are for more reporting, irrespective of data entry speed, then more indexes are warranted.

Every table should have a clustered index. This is important from an indexing/speed perspective. It is also important from the physical data-storage perspective, since table data/rows are physically arranged in the order of the clustered index. By carefully choosing your clustered index, you can pre-determine where (in the table) your users will be inserting or updating rows. Choose the wrong order and you potentially create a hot spot in your table. If all the updating is occurring in one area of the table due to similar primary key values, you risk having two users needing to insert or update the same data page. This introduces resource contention (for that page) as well as a potential for page-splits, which will have a noticeable effect on transaction response times.

Scaling For Growth

Always keep in mind that the Database is a shared resource. Your app user/session will not be the only user accessing the database. So everything you do with regards to the database resources must take account of other users. This is an obvious point, but stated for the reason that any efficiency you can make at the database level will improve things for all users of the database. For example, if you have a stored procedure that only takes 1.0 second to run, assuming you are able to optimize it to run in 0.5 seconds or less, the results may be imperceptible by the user. However, if that stored procedure is one that is called hundreds of times per minute, the savings could allow the user load to increase without changing the hardware configuration. That is, the same database/server is able to support greater use/load. So always be as efficient as possible.

Unlike the application, which can be relatively easily replicated into a multi-server/farm scenario, the data layer is normally a central repository where all users must come to get or put data. Scaling out the Db layer is possible, but can be complex without creating duplicate copies of the data.

Don’t make the mistake of assuming that you can just add more hardware resources to the Db Server to increase performance. Although faster CPUs and/or additional memory might be an option in some cases, there is usually a physical limit to how much of these resources you can add. Moreover, your hardware budget may already be limiting such options. So database performance does not always translate to client/app performance. Sometimes it translates to scalability.

Stored procedures are more than necessary

Stored Procedures are a marvellous invention. They allow SQL Server to predetermine the best possible execution plan for your query, avoiding this cost at run-time. SQL Server is able to cache execution plans in memory just as it does data, so the plans are readily available when needed. Stored procedures represent a layer of data-access logic that does not reside in the actual compiled code. By putting the data-access logic in stored procedures, you allow the database to use the optimal data access methods to process data. Stored Procedures also provide an abstraction layer that can be tuned/optimized without requiring changes to the application. As long as inputs and outputs of a stored procedure remain the same, the calling app code is not affected. Of course, you must still ensure that the results of your procedure are what the application is expecting, but you can accomplish any number of optimizations and or changes within the stored procedure and the client or app layer does not have to be changed and/or redeployed.

Stored Procedures should be kept to as short as possible. This is especially important for procedures that manipulate tables (contain DML). If you find yourself with a stored procedure that is several hundred lines long you should probably take a step backwards and rethink the task. It is not a crime to have long complex stored procedures, but that complexity usually results in excessive CPU or I/O use. Or that complexity may indicate that business logic (that should normally reside in the application layer) has been placed into the stored procedure. Although SQL Server allows this, it could lead to having to maintain numerous single purpose stored procedures and could even lead to multiple procedures that are simply different versions of the same logic. The disadvantage of this is increased maintenance as well as potential misuse of procedure cache. The purpose of procedure cache is similar to data cache. You want the database to cache (store) commonly used stored procedure query plans so that no time is wasted having to recompile and recreate the query plan.

With SQL Server 2005 you now have Common-Language Runtime (CLR) Integration. This means you can write stored procedures (as well as other types and functions) using any .NET Framework language. For data-intensive tasks, Transact-SQL (TSQL) is still the best method, however for CPU-intensive tasks; you now have the option of writing stored procedures in Microsoft Visual C# or Visual Basic .NET, for example. So you can choose the appropriate language for your stored procedure, depending on the task requirements.

Just the facts

Avoid unnecessary communication between the application and the database. SQL Server returns an “n rows affected” response for each command processed. In most cases this message is not needed and is ignored, however that message is sent over the network to the app server. Most applications do not pay any attention to the message, so using SET NOCOUNT ON at the beginning of stored procedures not only eliminates this message, it saves you network bandwidth, since these messages are no longer travelling over the wire to the app servers or client.

Also, do not send data back and forth between the database and app server just to be manipulated and stored again. Often you find that the application must manipulate a large set (rows) of data. The actual change needed could be simple or complex, but the overhead of having to send all the data over the wire to the app layer, manipulate it and then send it back into the database can be both slow and expensive. In these cases, it makes sense for the stored procedure to manipulate the data in-place on the database. This avoids the time and bandwidth needed to move the data round-trip between the Db and app layers.

In your SELECT Queries, always explicitly name the columns you want returned. This is especially important when doing joins.

For example, if you have 3 tables that have 10 columns each and you need to join them in a query, if you simply use SELECT *, you will get all 30 columns returned, even if you only needed 7 of them. In addition, if the columns you need are also indexed, you can potentially use the index alone to satisfy the query. Using SELECT * is not efficient and simply lazy coding. Also, if the table schema later changes, the result set of SELECT * will change, possibly breaking your application.

Conclusion

Although the database architecture topics covered here may seem rudimentary, I have mentioned them to illustrate how they affect performance and scalability. If you design and build your database for heavy loading, you should see good performance under a normal load, and will have a measure of confidence that your database can grow to meet future demands._


Noah Gomez is a Mentor with Solid Quality Learning in the Relational Engine Practice. He specialises in Very Large Databases (VLDB) and Data Architecture. He is a member of the Professional Association For SQL Server (PASS). He can be reached at [email protected].

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Theory is when you know something, but it doesn't work. Practice is when something works, but you don't know why. Programmers combine theory and practice: Nothing works and they don't know why.”