Why you want to be good at data modeling
To be really successful at working with databases, there are many different disciplines to become proficient in - things like disaster recovery, security management, data integration, and more. But there are less than a handful of things you want to be really good at; becoming super-skilled in them will help you enormously in your database career. And believe it or not, one of these key areas is data modeling and its kissing cousin, physical database design.
That's right - in the pursuit for better overall database performance, many professionals are ignoring what is perhaps the number one contributor to excellent RDBMS speed - the physical database design. This article addresses this issue and demonstrates how database pro's need to return to their foundational practices of building the right database for the right type of job.
Why is Design Overlooked?
Simply, for two reasons:
- It's difficult to perform correctly
- It takes time (and sometimes lots of it)
Designing a high performance database is complicated work. It takes skill and experience to build a good data model that's forward engineered into a physical design that runs as fast as a lightning. But sadly, experienced personnel are at a premium these days so junior or completely green workers are called upon to design and build a database. The mindset of needing a staff of experienced logical data modelers was thrown out in the early nineties when CASE tools that promised the world cracked under the strain of increasing business workloads. Since many CASE tools failed to deliver what they had promised, and because many stressed logical design as the necessary forerunner of a good system, logical design was discounted with respect to its importance. Corporations had endured enough projects that never got off the drawing board and so shortcuts to production deployment became the accepted mode of development. The end result was - and still is - that data modeling and design aren't taken nearly as serious in overall system development as they should be.
The second reason quality designs are overlooked when the topic of performance is discussed is that a lot of up-front time is needed to create a good design. And time isn't what a lot of companies have these days. The application lifecycle has never been shorter in corporations than it is right now. Projects that would have taken years to complete just five years ago are being thrown up in six months or less. Obviously, to accomplish such a feat requires one of two things (1) superior personnel using state-of-the art software tools or (2) the elimination of necessary tasks from the application construction equation. Usually, one of the first to go is the database modeling/design phase. Instead of sitting down and intelligently laying out the necessary components and objects of a database, the database structure is built in the development phase alongside the code base used to run the application. The end result is a design that never had a chance to succeed.
Instead of concentrating on good physical database design, database professionals look to other methods to enhance performance. However, when they do, they risk missing the boat entirely and could end up dazed and confused with a database that simply won't perform.
Exposing the Number One Performance Myth of Today
Whether it's in the realm of database technology or any other discipline, some maxims are whispered around the campfire so much that they are taken for gospel on face value and never questioned. Especially when supposed "experts" mouth the words. Such is the case with a database performance myth that has been around for as long as I can remember. It goes something like this:
"70 - 80% to seventy percent of a database's overall performance is derived from the code that is written against it."
This is a complete untruth, or at the very least, an overestimation of the impact that properly written SQL code has against a running physical database. Good coding practices definitely count (many times heavily) toward the success of any database application, but to state affirmatively that they make a contribution of over two-thirds is a stretch. The reason this proverb cannot pass the reality test is that it is stated independent of what good or bad code can do in the face of poor physical design. For example ...
Many years ago, I was called into a life insurance company to investigate the cause of a poorly performing management reporting system. The database was Oracle that resided on a fairly robust IBM AIX machine. The front end consisted of a GUI report application that constructed a number of management summary reports. The problem was that most of the reports took an abnormally long time to run, with the average response time being ninety minutes from start to finish for a single report. Since the reports needed to be rebuilt several times a day, this was a completely unacceptable response time scenario for the end users. After ruling out typical "quick-fix" solutions of giving Oracle more memory and ensuring no hardware bottlenecks existed (swapping, I/O contention, etc.), I asked to see the code being used to create one of the reports. After all, I had (like most other IS professionals) been brought up through the database ranks being told that code was what caused a system to live or die from a performance standpoint. On the surface, nothing appeared wildly out of place in the SQL code. A fairly sophisticated join predicate linked together what seemed to be only six tables. The database itself was only about 500MB or so in size, so volume was not an issue. I then asked an important question: "Are these tables or views?" "Views" was the response. I then decided to extract the definition of the first view used in the code, and what I discovered was nothing short of amazing. The first view used in the code consisted of a join of 33 tables with 27 of the join predicates being outer joins!
I can say without hesitation that no amount of code rework or any SQL rewrite techniques could salvage that situation. The developers were doing the absolute best they could, however they were up against a database that had been normalized to the nth degree by an overzealous, logical-model minded DBA. The physical design had been implemented in a way that made no sense from a performance standpoint and nothing short of a major design change was going to alter the situation.
The physical design constrains all code - good or bad - and has the capability to turn even the best written SQL into molasses. After all, how can a SQL developer obtain unique key index access unless the physical index has been created and is in place? How can a database coder scan only the parts of a table that they need unless that table has been partitioned to accommodate such a request? Only when a solid physical data model and design is put in place - a design that fits the application like a glove - can SQL code really take off and make for some impressive response times. But good design comes first.
So what did I do with the poorly performing management reporting system? The DBA steadfastly refused to modify any of his maze-like design so changing the primary physical database structure was out. I grabbed one of the developers and, using a 4GL-development environment, created a small, customized ETL product that the clients could utilize. The user could bring up a GUI front end, and with a few mouse clicks, create a set of denormalized reporting tables and then build the reports they needed. Amazingly, the extract, transform, and load procedure, coupled with the report creation, crossed the finish line in less than seven minutes. A 94% reduction in response time was achieved in the same hardware environment, but with a different, improved physical design.
The Link between Performance Monitoring and Physical Design
Every database professional I know wants to be thought of as an expert in database tuning. The consultants that make the most money out in the field are the ones who can miraculously transform a sluggish, wheezing database into one that runs fast and efficiently. The books that fly off the shelf in the technical bookstores are the ones that promise secret hidden tips on accelerating the performance of database systems. And almost every database administrator covets their complicated SQL scripts that dig into the heart of a database's internals and regurgitate mountains of difficult to interpret statistics. But do those down in the database trenches really know what to do with all the information produced through performance monitors and SQL scripts? How does one really monitor a database for performance and become good at making a difference in the response times end users experience?
The key to understanding the discipline of performance monitoring is this: When you monitor a database for performance, you are really validating your physical design implementation. If the performance monitor you choose to use is blasting you with flashing lights, alarm bells, and pager alerts, it's probably because your physical design is failing. If all is quiet on the scene in your performance monitor, then your physical design is likely a current success. It really is almost as simple as that.
To be sure, there are performance situations that really aren't impacted by the physical design directly. Lock contention, for example, is mostly an application or coding issue. But on a grand scale, your performance monitoring output speaks volumes to your talents as a database designer. Got physical I/O contention problems in your database? Then you likely didn't segment or partition the tables, indexes, and storage structures properly in your physical design. Observing too many large table scans in your database? Chances are you didn't adhere to the proper indexing strategy. I could continue down this path for some time, but you get the idea.
The tragic thing is that much of today's mindset dismisses the idea that altering and improving a database's physical design will yield the largest possible performance benefit. Part of the reason for this is that modifying the design of a physical database - especially one that is currently in production - is no easy task and oftentimes requires healthy amounts of off-hours work by the administrator. So instead, many take the quick fix approach to performance problems, which equates to throwing hardware at the situation in most cases. Either the server box itself is upgraded, more processors are introduced to the mix, or a decent amount of RAM is added. In the short term, things appear to get better, and if the database is relatively static in nature, things may remain that way. But if the database is dynamic and the data/user load continues to grow, the situation will slide back to the point where it once was.
The reason for this is a foundational one. If the foundation is flawed, then the house needs to be put in order at that level before anything else is done. But much of the way performance monitoring and problem resolution is performed today isn't handled that way. It's like a homeowner discovering that his or her house has a cracked foundation so they put a new coat of paint on the outside to temporarily cover up all the cracks and then they declare all is well. Even worse, the homeowner could attempt to add on to their home in hopes of improving the value or appeal. But let's face it - with a cracked foundation, who will buy it? The same thing holds true for adding more hardware onto a poorly designed database. You may throw more RAM, etc., at a badly performing database and for a while those performance cracks get covered up. But over time, as more data and users are added, those foundational cracks will reappear and must be dealt with yet again. Regardless of the effort involved, it's much better to attack the foundation problem in order to correct the problems permanently.
As an example, a database administrator may use his or her performance monitor to find out that excessive physical I/O is occurring on a system. The DBA may erroneously conclude from the situation that more RAM is needed or that the data caches should be enlarged to improve the scenario. But what if the problem instead stems from the fact that too many large table scans are occurring? Most RDBMS's will quickly recycle the data obtained from large table scan operations to keep stale data out of the cache. To be sure, the problem could be a coding problem where developers aren't using the right indexes in the SQL predicates. Or, more likely, the database may not have the correct indexes in place to assist the code in avoiding the many large table scans. If this physical design flaw can be correctly identified, then no extra RAM may be needed at all.
What about the link between availability and design? According to Oracle Corporation's own studies of client downtime some years back, the largest percentage, up to 36%, are design-related issues. If that isn't a wake-up call to get serious about design, I don't know what is.
How to Make the Biggest Performance Impact in your Database
If this article has convinced you that proper physical design should be your number one performance goal as a MySQL pro, then it is time to get serious about how you manage your physical design lifecycle. So how do you get started in making a noticeable difference in the physical designs of the databases currently under your care, and those you are destined to encounter and/or build in the future? The first step to take is a mental one and involves making the commitment to pay more attention to excellent physical design and data modeling techniques. As an aside, I might mention that all project management personnel need to make this same commitment as the effort involved in guaranteeing a solid physical design foundation will take more up-front resources. But make no mistake, it is an understatement to say that it is time well spent.
The next step involves education on the part of the database designer. Of course, the best way to become a design guru is to put time in the trenches and work with every style of database - heavy OLTP, data warehousing, and Web 2.0 apps. You will learn very quickly which designs stand and which physical foundations crack when you go up against heavy-duty Web and mega-user systems. Of course, there are also a variety of good educational classes and books on the subject of physical design to aid in the learning process.
Creating robust efficient models and physical designs can be difficult and intricate work. You will need to arm yourself with some serious power tools that have the capability to slice through the difficulties involved in building and retrofitting complex physical database designs. Long gone are the days when a DBA or modeler could handle most of their work with a SQL query interface and a drawing tool. Today, relational databases are just too robust and contain too many complexities for such primitive aids.
At a minimum, you will need two things flanking both ends of your arsenal: a serious data modeling tool and a robust performance monitoring product. We have already established the fact that performance monitoring is really the validation of a database's physical design. When foundational cracks are identified with the monitor, you will need a high-quality design tool to aid in rectifying the situation.
Another great benefit of using good data modeling techniques along with a tool to create your physical database designs is that the same tool can be used for change control purposes. The tool that is normally used in an ad-hoc manner for creating the original database design is also good for graphically redesigning a database in real-time mode. Further, keep in mind that you should always have a way for protecting designs that are in place and are working that goes beyond traditional backup and recovery situations. Having such "snapshot backups" of your database's schemas will prove invaluable when disaster strikes.
A DBA that I once worked with was managing a large packaged financial application when she learned the value of practicing this type of change control management. She had to make a complex change to one of the database's critical tables and had thought she had built the right script to do the job. Unfortunately, she didn't have everything in place and when she ran her change job, she ended up losing a number of important indexes that existed on the table. Worse yet, she also lost a lot of complex security grants that were on the table. The next day, many parts of the application slowed down to a snail's pace as queries that used to complete in an instant now were taking forever, and some users couldn't access any data at all. The changed table was identified as the source of the problem, but while my DBA friend discovered that the table now had no indexes, she didn't know which columns had been indexed (something not uncommon in huge financial applications) and didn't know how the security grants had been set up. Through trial and error, she was able to get her indexing and security scheme back in place, but not before a lot of time had been lost.
This is one case where a good data modeling tool that has change control tool utilities built in can save you. Nearly every good tool in this category offers a synchronization feature that allows a DBA to compare an up-and-running database with a saved snapshot of that database's object definitions. Once differences are identified, a click of the mouse can restore any missing objects and security privileges.
But a change control tool can also help you in your physical design iterations. By periodically capturing changes you make to the physical design of your database, you can learn what worked and what didn't. And if you make an "oops" and actually cause more harm than good, you can instruct your data modeling tool to automatically put things back to the way they were.
It's clear that being good at data modeling and physical DB design is key to having a well running database-driven app, so if you haven't made it a priority in your MySQL career, you might want to reconsider things. The good news is, you don't have to break the bank to get a good MySQL-based data modeling and design tool. You can download and use MySQL Workbench right now and get started with either your next MySQL application or reverse-engineer any existing DB and start using the change control aspects of the tool.
You can download MySQL Workbench from the MySQL web site at http://dev.mysql.com/downloads/workbench/. The first version of the tool only supports Windows, but other platforms are coming soon. Give the tool a try and let us know what you think - your feedback and comments help ensure we continue to build high-quality tools at MySQL. You can also visit the MySQL Workbench forums and both get help and post feedback at http://forums.mysql.com/index.php?151.
And as always, thanks for your support of MySQL and Sun!