How Azure Web Sites Sucked in Production

R&D engineer and solutions architect at ELEKS

A couple of months ago, we started a major redesign of our blog. Since the new theme required a much higher level of customizability, we had to migrate the thing from Blogger to WordPress and choose a new hosting.

Having played around with Windows Azure for quite a lot in the past, we decided to give Azure Web Sites a shot. So far, we’ve only used the old compute model: as an example, our main website, eleks.com, is a usual Azure Web Role on ASP.NET MVC 3. When Azure Web Sites came out, it made an impression of a service that had all the features we had been missing for a long time.

And it delivered! In just two minutes, we created a site and a database, deployed a WordPress template, set up Git deployment, and were ready to proceed with the development. Since we needed a custom domain and did not expect heavy traffic, Shared Plan ($10/mo) was a reasonable option.

Everything was awesome. We implemented and tested the new design and were ready to go live. And then it began.

Strike One: Memory Quota

Being aware that the Shared plan had certain quotas for CPU, storage and RAM, we added:

  • define('WP_MEMORY_LIMIT', '128M'); to wp-config.php.
  • php_value memory_limit 128M to .htaccess (edit: which was our own mistake, since WAWS ignores .htaccess. Instead, it uses web.config for server configuration and .user.ini for PHP settings)

One day, we were adding some final strokes, when suddenly we saw this:

Azure: Site Not Available

We went to the Management Portal and…

Azure: Web Site Suspended Message

Azure: RAM Quota Exceeded

So, despite the configuration values, one developer and two editors were able to take down a Shared website due to a RAM quota. The troubling thing is that we received no prior notification whatsoever. The site just went offline for an hour.

With a traditional hosting, even if you are on a free plan, you can at least expect an email like: “Dear Customer. Not cool. Do something about your resource usage or next time we’ll suspend you”. In our case, we were on a paid $10 plan, and Azure just pulled the plug. To see a warning before this happens, you need to be on the Management Portal. Luckily, this happened on a pre-release stage, and we went to the Scale tab and added 3 more instances. Which, I suppose, increased our monthly bill from $10 to $40.

Good thing we did it. When we launched and served about 10K pageviews during the first day, the RAM meter fluctuated between 1.25 GB and 1.7 GB. (edit: this is not our typical daily load, we owed it to the HN effect, usually the traffic is much lower).

All in all, the mechanics of memory measurement is a mystery so far. Looks like Azure hosts PHP applications on IIS and measures the memory footprint of w3wp.exe. And if the PHP handler leaks under IIS in production, there’s very little you can do as a user.

Strike Two: DB Outages

Unlike a traditional hosting environment, Azure Web Sites relies on another cloud provider, ClearDB, to provide MySQL infrastructure. ClearDB describes their platform as ‘Completely Fault Tolerant, with Global Multi-Master Design, 100% uptime guarantee’. So we were a bit surprised to see this:

Azure: DB Outage

We looked into the logs, and…

[10-Apr-2014 20:58:59 UTC] WordPress database error Server shutdown in progress for query SELECT t.term_id FROM wp_terms [...text truncated...]

Okay, accidents happen. But then we noticed that almost every day the blog could not establish a DB connection for 30-60 minutes.

Is there a quota for DB queries that was not mentioned on Azure website? Sure enough:

[11-Mar-2014 11:15:44 UTC] WordPress database error User 'b73a2dfc60748c' has exceeded the 'max_questions' resource (current value: 3600) for query [...text truncated...]

From ClearDB FAQ:

The max_questions resource is defined by how many queries you may issue to your database in an hour. Our free plans start with 3,600 queries per hour and increase to 18,000 upon purchasing a paid plan with us.

Strike Three: Phantom HTTP 500 Errors on AJAX

Our new design utilises AJAX heavily, and the developers noticed that, from time to time, while testing the site hosted on Azure, the AJAX calls returned HTTP 500 with no detectable patterns or steps to reproduce. Of course, this was true not only for the developers:

Azure: HTTP Logs with phantom 500 Error

My only version was that it had something to do with 4 instances of our site running simultaneously (or is it just 4 times the quota, but 1 worker process?). Anyway, we had very little information to reason about this issue, so we just tried to copy the same code and DB to a traditional hosting. Sure enough, everything was smooth there and we haven’t noticed any errors ever since.

Sorry WAWS, You’re Out

One of the most disturbing things about these issues was the fact that we could not contact Azure support about any of them, because in order to ask technical questions you need a $29/mo support plan, which we didn’t have on that particular subscription. If you tally up the figures, here’s what it would look like to make everything work:

4 x $10 (instances) + $29 (support) = $69/month

With a traditional hosting, you get the same hosting model, with support, for 10 times less price, but, probably, with less features. We absolutely didn’t mind to pay some extra bucks and have more productivity with Azure (see Penny Pinching in the Cloud: When do Azure Websites make sense? by Scott Hanselman for details), provided that the primary features work reliably.

I’ve always loved Microsoft for their excellent developer products and tools, and I still do. But when it comes to production environments, having a good night’s sleep would take priority over any productivity boosting feature. So WAWS will still be at the top of my development list, but for production needs I’d give it a couple of years to mature.

Discuss on Hacker News

tags

Comments

  • WordPress on Windows is almost always a bad idea. But to your point, these limitations need to be spelled out ahead of time with user warnings via email. I don’t run any any high traffic (or even moderate traffic) WordPress sites so shared plans work (I like NearlyFreeSpeech). Media Temple has a compelling WordPress plan where core updates are auto applied and there is a staging environment: http://mediatemple.net/webhosting/wordpress/

  • I dont see a problem here. PHP limit is per process/thread (practically same thing in linux), which is generally a thread per request, with a dog like wordpress issuing serial mysql queries per request then and say each request takes 250ms to complete then just 16 req/sec would fcuk. Dont blame Azure for PHP request model sucking.

    • Agree, we should have load-tested better. But what I blame Azure for, though, is not mentioning the query quota at all, as well as not mentioning the fact that the associated ClearDB plan is free. Basically the only thing the customer sees in the documentation is the 20MB storage limit. Getting to the ClearDB portal from the Management Portal is not that obvious, unfortunately.

  • ClearDB freemium plan of course has limited, just like Firebase, noted in the FAQ. Of course if you missed this you’d have picked it up in real world testing (with blazemeter or otherwise). 500 generally indicates a problem with your code, possibly session handling presuming frontend had sticky session which may of not been the case (hence periodic issues?). I think you’re fast to blame Azure here.

  • WordPress is a system hog. Without W3Supercache or TotalCache it won’t handle any kind of moderate load. This is especially more noticeable on Windows as it was never optimized for IIS and barely runs. Perhaps the title should be “How Azure Web Sites Sucked in Production for WordPress” because for .NET applications they’re awesome.

  • For what it’s worth, I’ve had pretty good luck with Azure websites, but I’ve always run .NET code on them. Given that probably 99% of all WP websites run on Linux, I would be surprised if running it on Windows didn’t have some significant gotchas. LAMP is supported on Azure, but it doesn’t seem to be quite the first-class citizen that .NET is.

  • As Damien mentions above – you can’t use WordPress in any capacity (especially in production!) without rolling in some kind of caching. Cloudflare, TotalCache – *something*. This would have mitigated much of your trouble. Also – yeah a shared plan on a production site with 10K visits using WordPress… I might suggest reading the limits of said plans before you push live. I have been bitten as well, but the shared plan is rather stingy.

    • @Rob: I must admit that the 10K pageviews statement sounded a bit too loud. This was the kind of traffic we got because of a HN effect that day. Usually, when we don’t publish new content, the traffic is much, much lower.

  • We’ve been very successfully running Azure Websites for large scale customers for over 1 year now. The trick is, you have to get a proper VM, standard instance. As with any cloud deployment, database retry is a necessary evil, and understanding the usage patterns of the site so that you can ensure the appropriate VM size. I agree that the free version isn’t acceptable for most anything if you have traffic, quota shutdown sucks.

    • Yep, we use that sort of environment for our corporate website. Though we launched it when WAWS was still in preview, so it’s just a standard Web Role on a dedicated VM.

  • Yuriy, the recommended way to run a production site on WAWS is to scale it to Basic or Standard mode which would put the site on a separate VM and will remove any quota limits. As with regards to the database outages – I had the same problem when I was migrating my WordPress-based blog onto WAWS. I eventually upgraded the database to a paid option. It looked like the free ClearDB option was only suitable for a development usage, but ran into limitations under more or less real production load.
    Thanks for the feedback on the quotas functionality for shared sites – we will look into ways to improve it.

    • Thanks for the advice Ruslan. The main motivation for using a shared plan was a modest volume of traffic we normally serve, as well as the fact that for two years our blog has been able to work on a shared hosting (I assume Blogger is) without interruptions. Looking back at all this: we made quite a few mistakes on our part, although I wish Azure documentation provided more details about ClearDB limitations (unfortunately, the pricing page doesn’t even mention the existence of ClearDB at all). This might have eliminated a lot of confusion for us as first-time ClearDB users. Anyway, we’ll be very happy to see the shared mode evolve since it’s still an awesome place to start building new projects.

  • We were just in the same boat. As far as all the IIS/Windows comments, or the you-should-have-read-the-limits comments… 1) I’ve run WordPress on Windows shared hosting on many other low-cost providers without these issues. I always have had Windows hosting plans. 2) Yeah, ok, the limits are spelled out somewhere. But the point is, the limits are too low, making it an uneconomic solution for a small-traffic site. Other shared hosting providers offer a much better value. On Azure, there is basically no way to do WordPress other than a VM where you can install your own MySQL instance (even though that’s less than ideal for a number of reasons.) I hit ClearDB’s low query limit even on a PAID plan, JUST with developer traffic.