Friday, 22 January 2016

Designing for a distributed scalable web application

Scalable Applications

At PixelPin, we produce an authentication mechanism that uses pictures instead of passwords. It is fairly lightweight since each login will only require a couple of page hits, sometimes only one although because of the potential for massive uptake across the globe, it has been designed from the outset to be scalable.

However, scalable is fine in a single data centre, as our income increases, we can afford more web servers in the cloud, as well as higher-spec web servers and a higher database service tier that could provide all of the performance we ever need for the UK and Ireland, it might even work for Western Europe which is not very far electronically from the Microsoft data centre in Ireland. What it won't do is work globally, although it might work acceptably for many people, it will never be great with the latency incurred when criss-crossing the globe.

Designing for scale therefore involves a number of decisions, compromises, costs and complexities that at some point you will need to decide on - hopefully before your site becomes really popular. No-one wants to be rushed into making a major redesign to cope with a sudden increase in the number of users.

The first stage involves reasonably low cost and simple measures and will allow a basic scalability of your web application to a reasonable level. Of course, each method described will depend on the exact balance of your application's CPU, network and database loading.

Stage 1

Stage 1 involves the basic understanding of how an application works in a scalable data centre. What this means, simply is that your application needs to work across more than one web server which means a couple of things if you want to stay sane and have a working application.

Firstly, you want the servers to be copies of each other, not each with their own configuration. This is because you want to scale upwards automatically by cloning web servers, not by manually configuring each one.

Secondly, you cannot store session data in-memory on the web server because even if you want to send your users to the same server each time (sticky sessions), these cannot usually be guaranteed and if a web server falls over, which they do, you want another to start up automatically and not drop a load of sessions in the process. This requires either a shared-session mechanism or otherwise designing the application to be stateless, which is possible but usually slightly more complicated to implement. Shared session is easy on Azure because they provide a mechanism that works in .Net invisibly and shares session across N instances (3 in our case), which provides resilience and sharing. You can share using file systems or the database but these are less optimal for performance reasons since the database is often a bottleneck and there are timing/consistency issues if you write anything asynchronously. You might be surprised to find out that there is no real open-source option for shared session, even things like memcache are not designed for session and has several failure modes that make it less than ideal for the job!

You should always make other basic optimisations, or at least consider them. Things like correct caching of assets (including cache busting), Content Delivery Networks, bundling and minification and just good discipline when creating pages to reduce the overall overhead for the network.

At the database level, you should make the database as dumb as possible and move any processing into a software layer that can be scaled up, the database being the hardest thing to scale. It is possible, with correct design, to defer scaling the database as long as possible or even not at all if your application is not database heavy.

Good use of memory caching can also speed things up and reduce CPU, disk access and network as well. Various people have said that hardware is cheap compared to people so upgrading a database server to use 2TB of RAM might avoid having to employ a database team to change the architecture! You should test and monitor cache usage however since you can easily use up memory this way and not necessary get many cache hits for the effort. On the other hand, you might be better moving the caching into a few areas that make a big difference or that are used most often.

Stage 2

So at stage 1, you can scale up in a single data centre and get reasonable performance across a reasonably large geographic area but if you need a global presence you have a couple of options. This is stage 2 time and will involve more cost and complexity since you will probably have to add some techniques that you haven't needed until now.

The first question is whether you can have a geo-presence by simply duplicating the systems across the continents and simply using DNS to send traffic to the "closest" data centre. Clearly, this involves no more work than before but has a severe limitation that you cannot share data directly between the systems, particularly user data for people who travel the globe. You can implement workarounds like having a single login location and then forwarding the user to their "home data centre" but this means that travellers experience poor performance sometimes just to make your life easier! Maybe that is acceptable if it is an edge case but if not, you will need some ability to connect the separate systems together!

There are a few basic guidelines about making geo-located systems work correctly and with the least amount of effort.

  1. All synchronous reading and writing should occur locally to the data centre because latency is not only poor globally but more importantly, it is volatile and might seem OK one day and not the next. Anything that needs to be sent to another data centre needs to be done asynchronously.
  2. Database replication is a hard problem. Design around it wherever possible and prefer to have a single master (authority) and multiple read-only replicas.
  3. Remove or reduce the need to write to the master database by separation of concerns, either into multiple master databases or preferably to local databases that do not require replication.
  4. If you are designing for this scale, you will need to consider how your system will fail and include facilities such as automatic failover. You can use buffers to help you when a destination system has gone down but these won't work forever so you need to consider an efficient disaster recovery process. Would you even know if your application went down? Have you ever tested it?
So at PixelPin, our architecture is currently a single database with failover in a single data centre. What we are going to do as we design for the future involves a couple of techniques, although some of the work we built in from the beginning makes this much easier.
  1. We will plan to have a single master database located in Ireland and several read-only replicas, probably using Azure geo-replication so we don't need to manage it. We can only have 4 slaves so this is slightly limiting but will buy us time for now.
  2. Any temporary data for sessions will no longer live in the database but will instead live in a local database, probably using Azure documentdb.
  3. Any logging/audit type data will no longer live in the database but will be passed via a local message queue to a system that will run on an in-house server that will pull this data back to base into an audit database and will never need to go into the main database.
A local queue is essential in this case because the message sending needs to avoid the latency of the global internet and therefore the call to the queue needs to be local. This means that the receiver of these messages will need to pull them from multiple queues but this way, the receiver is the system that experiences the delays and latencies and not the user.

Our existing CDN will be enabled for more global locations to help feed our static assets and the only question left is how we handle image downloads which could either always come from the origin server (since they are cached anyway) or they could be effectively replicated to other sites on-demand for situations where a user has travelled to another part of the world.

One step at a time!

Thursday, 21 January 2016

Facebook and Google (and other) OAuth2 Logins are NOT always authentication

As people are trying to understand more about web security, we have confused two concepts that are related but not the same: authentication and logging in. You might naturally think that authentication is required to login (at least in systems that have some kind of "login" system) but that is not true and it has come out of the OAuth2 myth.

The OAuth2 myth is that OAuth2 is Open Authentication 2. It is not. It is Open Authorisation 2. If you do not understand the difference, let me define it: Authentication is making a (semi-provable) claim about who you are to the system. Authorisation is asking what an entity is allowed to do or delegating permissions on behalf of someone else. Note that you do not need to be authenticated to use authorisation since an anonymous user will still have some types of permissions on a site.

Firstly, why is authentication only semi-provable. Well, as a person (more generally, an entity), you are supposedly proving who you are by one of the 3 factors of authentication e.g. a password but that does not prove to any high level of assurance who you are, since a password is transferable. In fact, in most systems, even 2nd and 3rd factor logins are not always proof at any absolute level since even if the hardware is reliable, it is sometimes possible to break a system elsewhere. We accept that risk to provide something usable. It would be impractical to require some kind of ultimate proof of identity to access our GMail although, of course, people are working on ways to make this proof of identity process more secure.

So let's start from the basis that we will trust that knowledge of a password is an acceptable level of proof for authentication. If we write our own authentication code, we are likely to have a username (or email address) and a password. When done this way, we specifically tie authentication to logging in but as Dominick Baier stated, "we don't do our own auth any more do we? This is the 21st century". There are many reasons why delegating our authentication to another provider is desirable from the convenience of not needing people to sign up to our service, through to the fact that a single-sign-on provider is more likely to be implemented securely than doing it ourselves (I am not aware of any SSO providers who have been hacked).

So we can just use SSO right? Facebook, Google and several others provide OAuth2-based login capability, it is pretty easy to integrate and it works. However, we have unknowingly believed the myth that OAuth2 is providing us with authentication. In most cases it isn't!

A few years back, there were three, seemingly competing systems, for our use in centralised authentication. SAML-based systems were designed for the corporate world and are heavyweight, XML-based (with all the pain that involves) and are largely completely impractical for the public web. Even corporates struggle with SAML! The second, openid, was designed for the web but was also very complicated, both to understand and to implement - perhaps because it was trying to do too many different things at once. Sure, some of these might have been optional but perception is everything, and people decided they didn't like it. Along came Twitter who were trying to create their openid implementation and realised that there was no system to delegate API access, something that was important at the time as APIs were becoming all the rage. They created OAuth v1 with a group of other people and produced something that was much simpler than openid. Somehow, people got confused at the distinction and started to see and use OAuth (and its simpler and more popular version 2) as an authentication mechanism.

For some reason, this myth has never been adequately dispelled and Facebook, Google and other providers are very commonly used for authenticating into web sites - many times, not using the optional mechanism that adds authentication to the protocol (see later).

You could argue that authentication is a specific subset of authorisation i.e. User X has permission to log in and although that is true, it is the breadth of the spec that provides various other ways in which a system can still "authenticate" using OAuth2 without the user being present and therefore without true authentication taking place!

Let us consider a simple way in which we can see that Facebook login is not authentication. We assume that if someone enters their password in a normal situation, that they are present and have proved to an acceptable level that they are who they claim to be - they have authenticated. What happens when you login to the same site with Facebook? You might have noticed that if you are already logged into Facebook in the browser (which we all are right?) and that the Facebook session seems to last forever, we are NEVER asked for a Facebook password. We simply give permission for the calling application to access our details and we return to the site and are logged in! In other words, for most of us, all an attacker needs is access to our computer and they can "authenticate" into thousands of web sites without ever needing to know our password - it's not authentication at all!

Why is it this way? OAuth2 is not designed for authentication but for API access. If a site is accessing our FB timeline or our Twitter feed, it would be unusable to ask the user to enter their password every time access is made. In fact, the access is allowed to be made when the user is not even present in most cases and can be made over several days or weeks. The user has the ability to revoke this permission but it still holds that the concept of the user being present is never required in the protocol - it is up to the provider to decide how they want to give the user the choice to allow or disallow access.

So can OAuth2 never be used for authentication?

There are two ways that it can. The first of these is to do what we have done at PixelPin. We have limited the parts of the protocol that have been implemented (for instance, there is no long term or unattended access to our system) and we also require the user to authenticate every time PixelPin is invoked. This way, we remove the vectors that would allow the data to be retrieved automatically at a later date. It is also for this reason that we do not need to revoke permissions for applications since they are only given access exactly once per session after authentication.

The second way, which is necessary for sites that do have APIs and do need to allow long-term access is to use OpenID connect. The OpenID group have presumably noticed that OAuth2 was being used incorrectly and insecurely and have produced a mechanism designed to sit above OAuth2 so that it can be easily introduced into existing providers and which is both more specific on fields that are used in the handshake (OAuth2 is deliberately vague and encourages non-standard implementations!) and also provides a signed authentication token as part of the handshake which provides a provable way to tell the calling site what authentication was used for this user during the handshake. The mechanism allows either the calling site to require authentication during the OAuth2 handshake or at least allows them to know what has or hasn't taken place so that certain future actions might trigger a full authentication.

At PixelPin, we are looking to implement OpenID connect over the next few months, despite not needing to on our system, to ensure that we are using the latest security best-practices and to give people the option to plug pixelpin into a standard OpenID connect plugin (whereas currently, most plugins require specific code for each authentication provider).

Watch this space!

Monday, 11 January 2016

Client authentication broken (Azure) - cert expired

I woke up to a message on my phone, "The system is not working". Gulp, we only have one system so if it isn't working, that's pretty serious.

Ignoring the red-herring caused by Virgin media returning random IP addresses to DNS requests, I realised that the client certificate used for authentication between web app and web service had expired. I didn't actually realise this was an issue for client auth, although it is fair enough. I then scrambled around for an hour or so trying to fix it, and test it and deploy all the changes. Here is what I learned that should save you some time!

Note: You will often need to use mmc.exe to manage certificates. When you run it, choose File->Add/Remove snap-in->certificates and either choose Local Machine or Current User (or both if you need to).

  1. You need a cert that has not expired!
  2. Your certificate needs to have Client Authentication as one of its permitted uses.
  3. It needs to be in the relevant store (usually LocalMachine/My, which needs to be matched on your local machine to test it with).
  4. You might need to give permission for all users to read the private key of the cert in mmc.exe depending on what testing you are doing.
  5. The certificate needs to chain to a root certificate. Theoretically, you could do this on Azure by installing your own root cert on the cloud service but this is not directly supported by Visual Studio and would need to be done via Powershell or similar in the startup script.
  6. You need to make sure that you have uploaded the new certificate into the certificates tab of the portal (for app and web service).
  7. You need to reference the new certificate in the Azure project (role) settings so that Azure installs it into the cloud instance from the certificates tab.
  8. You need to change the thumbprint in the service settings of the web service to reference the new thumbprint. It is easiest to copy this from the Azure portal because the thumbprint is displayed in a single block of text.
  9. Upload the modified web service and access it from the browser. You should be offered a dialog to select a client cert, choosing it should allow you to access the svc of the web service. If it is not in the list you are shown, it is either expired, not present in the store for the Local User or does not have Client Authentication as one of its uses. It might be caused if the certificate does not chain on your local machine, you can check this in mmc.exe by double-clicking the certificate and choosing the Certification Path tab.
  10. You will probably need to refresh the service references for the web app (see below).
  11. If you have used Windows credential manager for client certs, you will need to update this to use the new cert otherwise svcutil.exe will fail with 403 (forbidden).
  12. Refresh the service references in the web app.
  13. Change the thumbprint for the client cert in the web app (probably web.config).
  14. If you can test the web app locally, it will save you the upload time to find out if it doesn't!
  15. If you get a 403 at any point, it means the certificate cannot be founded or read. This might mean the permissions are not correct, the certificate is not in the correct store (remember to differentiate between CurrentUser and LocalMachine) or you do not have permissions to read it from the store.

Friday, 8 January 2016

Android "This certificate is not from a trusted authority" or "no peer certificate" errors - fine on desktop!

Moving from Azure VMs back to Cloud Services

Sooooo, I am running a cloud service in a pair of VMs. It started life as a cloud service (PaaS) on Azure but after someone broke the Azure Powershell tools, it no longer deployed and so I went traditional and installed it manually.

Recently, I decided to change it back. VMs are OK in that you can deploy very quickly using Powershell/SVN etc. but they also require regular maintenance and monitoring and they are also slightly more expensive than the equivalent cloud services.

The Problem

Anyway, deployed it all, tested it in Chrome and it looked fine so I changed over the DNS to point to the new cloud service. Our SSO service seemed to work OK but the Android app didn't so after a hasty swap back, I opened the Android app in the debugger to find out what was happening.

Getting to the web service call showed the exception "No peer certificate", which I understand but which didn't make sense. I visited the site in the browser and even ran a couple of SSL tests like the Qualys one and they reported no problems. Clearly there was a chain problem. As a quick check, I also tried to visit the same URL in the Android browser and got another error, which was more useful: "This certificate is not from a trusted authority", it also showed the certificate chain and the fact that the chain was somehow broken - again, I knew what this meant in theory but didn't understand why it was OK from the desktop and from the online test tools.

A clue was that the Qualys test showed 2 certificate paths, one that pointed to a new CA root certificate and another longer chain that used something called a cross-root to point to an older root certificate, something done for backward compatibility reasons (but one which causes problems!).

The Cause

Windows (and other servers?) use the issuer and subject names to match certificate chains up, it turns out that although I had my own "COMODO RSA Certification Authority" intermediate certificate (which used the cross-root and old root cert), it was also the name of a trusted root certificate in Windows - a newer cert.

Windows scores the paths (apparently) and all things being equal, chooses the shorter one as the standard certificate path to use in the SSL handshake - or at least, the validation process on the client does this.

In this case, the shorter path used a newer root certificate that simply isn't present on Android (not sure how often these are updated). You can see what is supported under settings -> security -> trusted credentials.

For some reason, desktops can handle this, probably because their root certs are more up to date but they also cache intermediate certificates, which might make a site work because of a previously visited site.

The Solution

You have to break the path you don't want by deleting the relevant certificates (the ones whose names conflict). In this case, I deleted the newer root cert by logging in with remote desktop. I also REBOOTED and then only a single path gets returned and it all works again.

Clearly, I have to be aware that there is a chance this problem will rear its head again if the cloud services are every deployed again from scratch. I should probably write a script to delete the offending certificate but for now I will add it to the checklist for deployment so a quick check can ascertain if the problem is still resolved or not.

Thursday, 7 January 2016

Publish-AzureServiceProject - There is an error in XML document (12, 90)

It would be helpful if errors had something more useful in them. This is a generic error in an Xml parser and it doesn't even tell you which document has failed.

It turns out that I had changed the thumbprintAlgorithm to SHA2 because I had a new SSL cert but it turns out that even SHA2 certs use SHA1 for the thumbprint and it should stay set as sha1 in the cscfg file.

Anyway, I found out what the error was partly due to what I know I had changed and also by passing the -debug flag to Publish-AzureServiceProject which showed the specific error even though -verbose didn't!