In religion, politics, and other endeavors, Truth is an elusive goal. Depending on your beliefs, Truth might be found in the Bible, the Torah, Koran, the Democratic Party platform, or the lessons you learned while traipsing through the woods. Truth, in most endeavors, is highly subjective.
Truth is subjective in programming, too. If you have any doubt, just ask a dozen different programmers to tell you what is the best programming language, the best indentation style, whether domain driven design is a good idea, or whether inversion of control is just a fancy way to say, “do things in the most complicated way possible.” There are plenty of “truths” in programming.
But in a computer program, there can be only one source of Truth. That is, there can be many representations of the data that your program relies on, but only one representation can be considered the Authority. If you create different views of the data or cache some data in order to speed access, you are making a copy that at some point will differ from the Authority. It is no longer Truth.
Once you do this, you have to make a decision. Your choices are:
- Periodically invalidate the cache so that it will be updated from the Authority. This ensures that your cache will reflect the Authority with a maximum latency of some given period of time. The cache represents Truth as it existed the last time the cache was refreshed. This technique works well if your program can function well with data that is slightly out of date. We use this technique in the crawler to cache robots.txt files. If we always required the most up to date robots.txt, the crawler would have to issue two Web requests for every page it downloaded (one for robots.txt, and then one for the page). Instead, our crawler caches a site’s robots.txt file for a maximum of 24 hours. Truth, in this case, is “as it existed the last time I downloaded the robots.txt,” which will never be more than 24 hours out of date.
- Update the cache whenever the Authority changes. This sounds like a good idea, but there are drawbacks.First, the Authority has to be built with caching in mind, and must supply an API that clients can plug in to. The clients have to accept the Authority’s caching API, which might be overly restrictive.This can also put an unacceptable performance burden on Authority updates, especially if more than one client is updating its cache. If the Authority has to call each client’s update method, then update speed is limited by the speed of all the subscribed clients. If, instead, the Authority posts updates to a message queue, then there won’t be a perceptible delay in Authority updates, but there will be a non-zero and potentially large latency in the cache updates.There are many ways of reacting to an update message posted by the Authority. The simplest is to invalidate any cache of the affected data. That can be quite effective, but you have to be careful that your caching layer knows exactly what data it’s holding on to. That turns out to be a rather difficult task, at times.This update strategy is usually used when you want to maintain an up-to-date view of the Authority data, but with a different organization. It works best when updates are infrequent. If you’re doing frequent updates to the view, you probably want to re-think the Authority and have it maintain a view that’s more amenable to however you’re querying it.
- Understand that your alternate view is a snapshot of Truth as it existed at some point in time, and it is never updated. This works well if you’re reporting on a snapshot, but it’s not a good general caching solution.
There are hybrid solutions that combine options 1 and 2, but in general that’s pretty rare. It seems like the height of folly to implement option 3 if you’re working with live data, but it’s distressingly easy to fall into that trap inadvertently. For example, you might build a denormalized view of some data in your database because querying the normalized view is prohibitively expensive. You initially use that denormalized view for reporting purposes, but then you foolishly decide that you can use it for other things, too. Pretty soon, large parts of your system are depending on the denormalized view, and changes to the Authority aren’t reflected, or aren’t reflected quickly enough. At that point, your system is broken because your user interface isn’t reflecting Truth.
My experience with relational databases has been that if you denormalize the data, you cannot rely on it reflecting any further changes. You can try to write your code so that it maintains the denormalized view whenever updates are made to the normalized data, but those efforts will almost certainly fail. This is especially true over time, when the original developer moves off the project and somebody new who doesn’t understand all of the denormalized structures is assigned to the project. The result is … well, it’s not pretty. I’ve never seen a case in which trying to maintain two separate views of a database worked well over the long term. Don’t try it!
Where relational databases are concerned, your best bet is to design your database so that you can update and query it efficiently. If it’s still too slow after you’re sure that your design is as good as it can be, then you throw hardware at the problem: more memory, a faster processor, faster drives, or a distributed databse.
Note that I’m not necessarily advocating a fully normalized database design. There are very good and compelling reasons to design your database to be partially denormalized. What I’m arguing against is maintaining a denormalized view in addition to a fully normalized view. I know that it’s possible with triggers and other such database machinery. It can even be done well if you fully understand the ramifications of what you’re doing and if you are meticulous in adding and maintaining your triggers. I’ve found, though, that most development teams are incapable of that level of attention to detail.
Segal’s Law states, “A man with a watch knows what time it is. A man with two watches is never sure.” The same holds true when you have more than one source of Truth in your system. You have to understand that, unless you’re querying the Authority, the data you get back will be, at best, slightly out of date. At worst, it will be so wildly out of date that it’s just plain wrong.