Go On, Live a Little. Denormalize Your Data

I have years of professional experience with and completed several academic courses about relational databases. This is a pretty familiar topic for me, and I’ve worked with a bunch of them: SQL Server, SQLLite, MySQL, PostgreSQL, Oracle, and even MS Access. I’ve always found them, paradoxically, to be both comfortable and awkward. They’re comfortable because of my familiarity with them and awkward because I’ve used them almost exclusively with OO languages and incurred the impedance mismatch.

While RDBMS are in my comfort zone, NoSQL alternatives are comparably alien but also intriguing. Ayende has made some pretty convincing arguments as to why one should think outside the RDBMS, including a .NET Rocks episode in which he talks about constraints, such as expensive disk space, that applied decades ago when RDBMS were designed and are no longer of concern. I’m receptive, my interest is piqued, and I’ve played a bit with MongoDB To understand how these things work. But I think it really clicked for me the other day when I was staring at a schema that had some definite complexity issues. I think I finally groked the use case for document databases.

I was working with a pre-existing schema recently in which (mildly obfuscated) you have a “customer” who can be part of one or more “programs.” The entities “customer” and “program” each have their own properties and, as any diligent normalizer will tell you, that information should not be repeated. So they get their own tables. Since a customer can participate in multiple programs and multiple customers can be in the same program, this is an “M to N” relationship — represented, predictably, by a linking table. What’s more, customers can participate repeatedly in programs so the linking table will sometimes have multiple participation records per customer-program pair, differing by a participation timestamp. I’m sure those of us who have done any significant database development have played out similar scenarios countless times.

As I wrangled Entity Framework to deal with this and other similar relationships that were making it relatively complicated to pull certain information, I started thinking that this seemed harder than it needed to be. There was no use case for handling “what if the program name or another attribute changes in the middle of its offering,” and there really are no good answers to that question. Right now, the implementation is “you can’t do that,” but this is hardly satisfying. To administrative users, it seems arbitrary and like a system shortcoming. But the alternative is unpleasant, too — additional complexity to allow “partial programs” in the schema or the code.

I started thinking about normalization and whether it really mattered here. Allowing duplication is a cardinal RDBMS sin, but it sure would make life easier for a lot of use cases. I mean, imagine a scheme where there was just a table of customers, and participation was recorded as you went, with the program information being duplicated in each participation record. The downside is the information duplication, but the upside is that changing programs midstream is trivial and there is less table overhead and complexity to maintain. No M to N relationships and worrying about whether a program is current or replaced by a new one.

And that’s when it hit me. RDBMS are great for maintaining, well, relational information. For instance, if I work at a company, I have a boss, and my boss has many reports. Modeling the manager-line employee relationship is a great fit for a relational database because it’s about modeling relationships, like “manager to reports.” Tree structures of composition tend to be a good fit as well, such as assemblies like cars and other pieces of machinery. Parts consist of sub-parts and so on. These are relationships in the sense that operations like “delete” and “replace” make sense.

But what about the one-and-done concept of a customer participating in a program one day? That isn’t a relationship any more than I form a relationship with a waiter that I tip and then never see again. That’s something that happened once — not an ongoing arrangement. And that’s where things really snapped into focus for me. RDBMS and the normalized model is great for managing relationships but is not necessarily the be-all-end-all for managing extremely transactional data in the sense of event recording.

And while I’m probably not in a position for this project to quickly introduce document databases and stir up the works too much, I am considering modeling what I might do storage-wise after a document database there in SQL Server. Perhaps it’s time to throw caution to the winds and start allowing some information repetition in there for historical modeling. If the experiment fails, I can always normalize it. But if it succeeds, adding document databases to the mix will be a lot easier since I’ll just be learning the syntax of the API as I go rather than the fundamental underlying concepts and use cases.