On the Quest for Blazing Performance – NginX, DNS, and caching (Part 1)
As you may know, I am eminently concerned with ad delivery (here: economics of ad delivery), and as part of that, have recently embarked on a quest for blazing and blistering web performance out of my webservers. After all, more pages delivered (= more ads) & more returning visitors (= more ads).
When most people think of optimizing their server performance, usually they think about speeding up page composition time (image 2), sometimes delivery time, and usually page rendering time (image 3). But performance time starts from when the person types your site name in the location bar or clicks on you in Google, and there are plenty of things that happen before the server even gets hit. If we think about optimizing all of these things at once, we’ll see that if we do it right, we’ll get failover for free.
Now before talking about server optimization, it is important to at least nod my head towards optimizing page rendering speed. Yahoo has a number of excellent tips on how to get your existing content and page to show up more quickly, and has correctly pointed out that it is an 80-20 relationship — it is easier to change page elements to speed display than fix up the webserver performance — low hanging fruit and all. That said, usually not all elements can be successfully optimized, and even an embedded image can stall rendering. YES, contrary to belief, consider multiply embedded full-width tables used for layout, in which images are stuck in for spacing (yes, people still do this). Believe it or not, without height and width properties explicitly set, the browser has to request the image to complete the layout. That “top 20%” can contribute substantially to page load time, especially when there are multiple document elements. Now, if your page takes 10 seconds to load, chances are that people won’t really notice that 20% of performance, but as Google published [link], just 100ms in speed costs you 20% of your traffic.
The Web Transaction, the top 20%
If one looks at the image below, there two stages of connect that happen before a person can even think about issuing their HTTP request and getting their content back.
Right at this top level, there are 3 times involved in the transaction: Tdns — the time to query the DNS servers, Tc – the connect and transmission time from our server (Tc*3 is the TCP handshake), and Td — the time to deliver multiple packets of content.
Tdns – people generally think that this cannot be improved directly, but surprisingly, it can be. The worst-case time for a DNS request to come back is the time for the visitor’s machine to contact the local DNS server, then for it to not have your site in its records, it contacts the top level root servers, which redirect to slightly more local servers, which in the end redirect to YOUR nameserver, which returns the result back to your client’s nameservers. Estimated time? 100-300ms. That’s right, .3 seconds — long enough for them to swat the gun out of an assailant’s hands and kick him a few times. What about DNS caching? Right. DNS servers have limited memory and thus primarily cache the most popular web sites. Problem is, your visitor’s local windows machine has probably already cached them too. Their local DNS servers will miss far more than you’ll like to admit, and almost certainly if your site is not in the Alexa top 100,000. Do I have any stats for this? No. But I can tell you that in visiting my site StudentsReview, with 1M people a month, in the Alexa top 50,000, that about 50% of the time, my local MIT DNS misses and has to query SR’s nameservers directly. Cost? 500ms. To improve it, you want to measure the delay, and subscribe to a distributed DNS service that will provide the locally fastest answer, while also being redundant.
- DNSWATCH.info – measures the dns lookup time of your site, click a few times to get the average.
- DNSmadeeasy.com – We use these guys because they are nationally distributed, are way faster than ultradns [link], and provide failover.
Tc – This is the server connect time — Usually this is in the sub 100ms range, so you might think “what’s the point of optimizing something so fast?”. Problem is, pages can consist of 20 a normal 20 requests, and a 20ms saving on Tc can result in a nearly 400ms overall savings (more for HTTP 1 or keepalive turned off). It is the average minimum time for a packet to go from your visitor TO your server. You’ll notice that this is just ping/2. Problem is, if you ping from your machine, your server probably seems fast, but you really want it to be fast for ALL (or most) of your visitors, who are geographically distributed. This means you have to spend a great deal of time finding the fastest hosting company for all visitors. We mostly have used ThePlanet.com, but were in for a big surprise to find out they weren’t the fastest, and visitors from the west coast and florida were hitting 70ms delays, just from the backbone.
- Just-Ping.com – JustPing does exactly that, it pings whatever IP address you want from servers around the world. 8 in the US will tell you what your US visitors are seeing as best case delay.
- Server4You.com – These guys have (after testing about 30 hosting providers) the fastest average and national upper limit on their ping, coming in around 40-50ms, and 30ms average. They are a bit of a pain to sign up with, and their prices are little high for what you get, but the network performance gain is worth it.
- Aside: Another method is to go with your own CDN, and redirect users to your locally closest server, thus reducing Tc to sub 20ms. Huge gains, but not everyone can do it.
Td – Time to deliver full webpages. There is a lot on this already, so I will just say, turn on gzip compression, and turn Keep-Alive on. KeepAlive will save you the extra TCP handshake on each page item, and gzip will turn a 60KB page into 15KB one. Or more importantly, it will change 40 1500-byte packets into 10.
Ok, now that you optimized, or at least have an idea of those problems, time for the server optimized speed concerns. You’ll notice that the “theoretical” web transaction rarely reflects real life, even with one server. Here’s what that looks like:
Before I go any further, I should note that originally this article was going to include some absurdly fast setup using NginX and Varnish as a front-side cache, but in all my testing, I found that for static content, NginX is slightly faster than Varnish, with also a slower standard deviation in response time. That said, I don’t want to bash on Varnish, since I did learn some very important things from them, such as “don’t fight the kernel”, which I realized should be extended to “don’t fight the webserver” as well (also, “don’t fight the disk”).
Now looking at the above graph, you’ll see that page delivery consists of a number of more stages than simply requesting the file. Sockets have to be opened, the necessary files have to be looked up, database queries have to be called, the page has to be composed, delivered, and the process repeated.
Don’t Fight the Webserver
The webserver’s job is this and simply this, to handle an incoming request, allocate the socket, lookup the file, and return it. Understanding that that is the only job of the webserver is to understand the power and elegance of NginX (instead of Apache). Contrary to popular belief, NginX is not a reverse-proxy — it is a webserver heavily optimized around its single job, with proxying capabilities. To understand the Varnish mantra, “don’t fight the X”, is to understand that NginX should not be used for its proxying abilities, nor for SSI, nor for fastCGI PHP. It should only be used for delivering static content. Not that it isn’t good at all those other things, but that is a waste of time to optimize around.
Why not Apache? Well, Apache is pretty much tapped out. To compete with nginx, Apache <= 2.2 has to use its newer worker threading module, instead of prefork. The problem is, many of the modules Apache runs are not guaranteed to be “threadsafe” (specifically PHP), meaning to gain all that dynamic (and supposedly “correct/stable”) power of Apache, you basically are forced to use prefork, forking off processes to handle connections. How bad is that performance comparison to nginx? In prefork, my Apache handles 100 incoming connections at 200ms each, easily reaching memory limits. NginX, however reaches 1000 simultaneous connections before dropping to 100ms response time. At 10 connections, it is 1ms. Yes, 1. At 100, 10ms. Apache is even worse then because prefork runs afoul of the memory limit, forcing swapping, stalling out more incoming connections.
Alright, so how do we NOT FIGHT the Webserver?
- Make all your content static. Run wget -m -N –restrict-names=windows http://yoursite on a cron to generate a daily, or bidaily hot static mirror
- use rsync to PUSH the mirror over to your new fast NginX server, located on its central, fat, high speed network pipe.
- don’t run anything else on the server, set up NginX to failover to PHP or SSI whenever a static file of dynamic content is missing. If that fails, failover to proxy to your dynamic Apache servers.
- Leave caching to disk & kernel.
How is a hot mirror different than caching? Isn’t that what a
cache does? build a local repository dynamically? Yes and No. The
difference is subtle:
- A cache requires substantial extra logic to choose when to invalidate content; with a hot mirror, you choose.
- A cache is rarely optimized to handle sockets & incoming connections as well as a webserver.
- A cache has to manage what to “keep in” and what to flush — monitoring usage; performance optimization/delivery is dependent upon hitting the cache.
- A cache has to simply “manage its own cache”, keeping highly demanded content within in-memory pages, not swapping them out; the webserver leaves all caching to the kernel, and will never have a miss.
- A Cache can (depending on the cache) end up with “two content caches” — the cache itself, and then the disk cache of the Cache’s content repository (mirrored on the disk). Basically it can incur two cache misses when the cache hash misses.
- A cache can sometimes guarantee complete disk locality of the local mirror (depending upon implementation and extra writes to the disk (logging)).
In short, the cache is generally used to make up the difference for a poor design — by the time you start really needing caching, you should start thinking about re-engineering your entire delivery system.
Don’t Fight the Disk
The disk seeks can slow down the webserver badly, especially on busy sites, looking for files in very disparate locations that miss the kernel maintained disk cache. Basically, let the disk do its job and don’t get in its way.
- Maintain your Hot Mirror in a contiguous disk space, don’t fragment files — let disk readahead and the 16MB buffer do its job.
- Write logs and database tables to a separate disk. One disk for high speed reads, another for writes.
- Use a faster disk. Specifically, one without a read head that has to move around on your hot mirror. i.e. an SSD. Our benchmarks show that a regular disk can yield around 140 IOps (io’s per second), which every page request may be, if uniformly distributed. Our SSD raid delivers 40,000 IOPS.
Page Rendering Speed, Extra Tips
Just a few extra tips that I would add:
- Lazy Database Connections – don’t connect to the database until you have a query to run. Don’t put mysql_connect in your header files, or as a standard include. Include it in a query wrapper to check if the DB is connected, if not, connect. You’d be surprised the performance gain we saw with this simple thing.
- Put layout elements, script loads, css loads within the first 1500 bytes of your html file & header. This will allow the browser to start requesting the additional content in parallel as soon as the first packet is received.
- Order the elements such that layout elements are loaded first, and as quickly as possible. If you especially tweak-minded, you can order them such that no two keep-alive connections are opened simultaneously to the same server, when another server can be used.
To be continued….
Failover, Rendering, the connect and handle chain, and database optimization… Though, to be fair, with a hot mirror, we’ve taken a lot of the database compose time out of the picture.