Friday, 26 August 2011

Designing for Database Replication

If you are writing a web application by yourself or in a small team, the temptation is usually to get it finished as soon as possible because even small sites don't take 5 minutes and once it is finished, there is a sense of satisfaction. The problem with this is that many things aren't really considered early on. Sites like Twitter didn't expect so many people to sign up so quickly and if the site wasn't able to scale, people would have got fed up and left or gone to a rival site. We should be thinking about scalability early on in the design and production cycle.
You might decide that due to the nature of the system, it is unlikely ever to exceed a certain size. Perhaps it is an internal company site and would never require more than, say, 1000 users doing simple things. If this is the case, fine. However, any public site runs the risk that it will become popular and an increase in user numbers means an increase in database, memory, network and processor load. If this is the case, you should design in the ability to scale as early as possible, even if your site will not be scaled initially.
One of the realistic possibilities is that you will need multiple database servers and if you need the ability to write to more than one of these at the same time, you will need some way to replicate the data. The difficulty with this relates to the actual design and might not be too bad.
For example, take the most basic replication issue of multiple inserts into the same table. If you have an index column (which you should) and you create user John Smith with ID 1000 in DB1 and then someone else creates user Michael Jones with ID 1000 in DB2, what happens when these databases replicate? In this case, although we might presume that the different names are different people, it would be possible for 2 John Smiths to create an account at the same time (in the gap between replication cycles). In this scenario, the best idea is to use a GUID for the id instead of an integer. Since these are globally unique, you would never have 2 users with ID 1000 on 2 different DBs. This also applies to foreign keys which would also use the GUID and would not be in danger of incorrect linking when copied from one table to its counterpart in the second database.
Updates can be more tricky. If someone sets a piece of data in the row for John Smith in DB1 at the same time as someone changes it in DB2, replication will do 1 of 2 things. If different fields have changed and this can be detected, it will simply take both changes and merge them. If the same field has changed or there is no way in your DB to detect which field has changed, you have the following options:

  1. Flag it to a user who needs to manually decide what to do
  2. Accept the low risk of this happening and simply take the latest change as the correct one
  3. Create a manual system which records the changes so they can be automatically or manually merged
  4. Apply a design which avoids the chance of this happening.
Really, to design out of these scenarios is always best because it is less maintenance. In our example, suppose we run a site like Facebook. One way we could avoid the problem is only to allow writes on a single database server, the others are read-only. The second option would be to ensure that updating user "John Smith" is only permitted on a specific 'home' database in which case multiple changes are applied sequentially and are replicated to other databases correctly.
Really the options depends heavily on the type of application you are creating. You might need to perform some load testing on your environment to find out where it is stressed under high-load. Don't spend time improving things that are working OK.
Post a Comment