Thursday, 20 October 2016

Azure App Services not ready for the big time yet?

I really like the idea of App Services for web apps. It basically looks like cloud services but with shared tenancy so lower cost per "server" than a cloud service.

The other really big winner for me is the much quicker scalability when scaling out. Cloud services understandably take a long time to scale out and cannot handle short traffic bursts for that reason.

App Services on the other hand can scale out in seconds. I think they do this by sharing the code and just increasing the number of virtual servers.

It looks great on paper and deployment is as easy as Right-Click->Publish from visual studio, using web deploy and therefore taking seconds. Cloud service deployment is still cryingly slow!

So that's the good news, what's the bad news?

It doesn't seem to work properly!

We set up an app service plan (Standard) and deployed a web application that calls a web service, which accesses a database. We also use Azure Redis Cache.

We fired a load test at the system to see what sort of load we could handle and setup auto-scale for a CPU of 80% to allow form 2 (minimum) to 6 (maximum) instances.

So what happens when we run the test? Loads of Redis timeout errors. Hmm, the requests per second are only about 100/150 and the response times are less than 1 second so why is the redis 1.5second timeout occurring?

We had a few pains, where re-deployments didn't seem to fix the problem and even though we could see the remote files easily using the Server Explorer in Visual Studio, the presence of the machine key element didn't seem to remove the errors related to "could not decrypt anti-forgery token". Hmmm.

To be fair, it did expose a problem related to the default number of worker threads. In .Net, this is set to the number of cores (1 in our case) and any new threads can take a minimum of 500mS time to create, easily tipping the request over the 1.5 seconds. So I increased the number of threads available but still we were getting timeouts even when the number of threads was way below the threshold for delays.

Maybe we were pegging CPU or memory causing the delays? No idea. The shared tenancy of App Service hosting does not currently allow the real time metrics to display CPU or memory for the individual virtual server so that's no help. It also wasn't scaling up so I guessed it wasn't a problem with resources and why should it be at such a low user load?

I finally got fed up and created a cloud service project, set the instances to the same size as the app service instances, using the SAME redis connection and ran the same test. Only a single error and this pointed to an issue with a non-thread safe method we were using (which is now fixed). No redis timeouts and a CPU and memory barely breaking a sweat.

We ended up in a weird situation where we couldn't seem to get much more than about 100 requests per second from our AWS load test but that is another problem.

Why does a cloud service running the same project work where the same thing on App Services doesn't? I couldn't quite work out what was wrong. A problem with Kudu deployment? Some kind of low-level bug in App Services? A setting in my project that was not playing nice with App Services?

No idea but there is NO way I am going to use App Services in production for another few months until I can properly understand what is wrong with it.

These PaaS tools are great but when they go wrong, there is almost no way to debug them usefully. The problem might have been me (most likely!) but how can I work that out?
Post a Comment