Atom 330 – Configuring an Atomic Satellite?

As I’ve been in the quest for absurd performance out of my webserver cluster, I came up with the idea of using an extremely powerful central node (dual quad core i7′s with 12G of triple channel memory baby!) for building and storing content, and then pushing the content out to several satellite nodes (Atom 330′s), who’s sole job it would be to deliver static content at a super-high rate. Without getting into overall cluster configuration, this is some of my experiences configuring some Atom 330′s & their performance.

Why Atom 330′s?
Well most obviously, the power. The cpus only use ~7W, and an entire board uses ~40W. That means that I could pack a 2U case full of Atoms without overrunning my desired power and thermal envelope. (Originally I wanted a 1U case, but the vertical spdif/audio jack doesn’t fit in a 1U case w/o desoldering it entirely). Secondly, it was my overall belief that since these nodes would be exclusively running NginX, which uses like 3% of cpu on my other servers, content delivery would not be CPU bound, but disk and interface bound, so as long as I paired the atoms with GigE and OCZ vertex SSD’s, it could deliver at extremely high rates for very very low costs. ($80 for mb/atom combo). Finally, and somewhat esoterically, at such a low power, I could use some DC-DC PicoPSU’s and power the boards off of the powersupply’s standard 12/5V rails – no special connector.

So I bought them, they were delivered, and I started to set them up. 1st of all Gentoo is the linux of choice, partly because of I can totally avoid the other distribution’s bloat-crap, it’s easy to recompile a highly tuned system for one task, and because all the reasons “not to use Gentoo” that other distros proponents claim, are reasons not to use those distros as well (read: Gentoo is better than all the other distros combined. All the linux that makes linux awesome, and none of the crap that makes linux suck.).

The kernel of choice is 2.6.28 (w/ no forced preemption), and gentoo is coming with GCC 4.3.2, webserver NginX, OCZ Vertex 60G SSD, 1TB WD Caviar, and the platform is Intel’s BOXD945GCLF2 (here) w/ 2G of ram.

First Compilation Opts, setup & benchmarks
Because I couldn’t find it anywhere else, the -march setting for gcc 4.3.2 is -march=core2. Gentoo wikis suggest -march=nocona, which everyone else seems to be using, but the Atom 330 is in the core2 family, and benchmarks with -march=core2 show an average 15% improvement. I tested this with nbench (here), but subsequently lost the results. I’ll tell you why in a minute, but here is what they were estimated from my memory. Oh, secondly, the nbench only stresses one core, so to obtain the total score, 2 nbench instances have to be run, and they run fairly independently/no crossover slowdown. (total score is both nbench’s added together)

  • march=nocona:core2
  • mem:~13:14

    That’s what I remember; The biggest gain was from fp ops, ~15% for int, and 5% for mem. I only care about integer ops, since most memory address arith is handled by integer ops (unless special instructions exist), never by the FPU. I recompiled the kernel, then reemerged -D world with -march=core2 and rebooted without incident. Whole system feels about 15% snappier.

    Second, Gentoo does not come setup with irqbalance on by default, and with the hotplug_cpu compiler option on, all interrupts are bound to one core (see here). After emerging irqbalance & recompiling the kernel with hotplug off, I saw another 15% on the number of requests/sec that the atom could handle. No benchmarks to show on this one, just take my word for it.

    Why are there no hard numbers?
    Yes. I suck, I got it. What actually happened is that I built all of the benchmarks in the /tmp directory and without realizing paying attention, gentoo flushes /tmp periodically, one of these times on reboot, so I basically lost all the results. They took forever to build and I’m not wasting that time again. But since this isn’t a “scientific” study, use at your own risk, it was mostly for my own testing benefit anyway.

    NginX & ab (Apachebench)
    Now onto the real results. First, these results are with nginx & ab compiled with march=core2. I did run tests with nocona kernel & nginx, and just as with the benchmarks, I saw about 15% more requests/sec. Ok, so for this test, I set up nginx to serve up an empty gif using the empty gif module. 43 bytes in size, no hits to disk or memory, ab on localhost. Meaning, this test should test the atom’s ability to service an interrupt, pass the request off to nginx, who then responds with the minimal amount of work, and this should tell us the theoretical max number of requests per second that the Atom can service. I’ll be varying $X to see how it handles concurrency too
    ab -c $X -n 10000 http://localhost/1ptrans.gif

    So at $X, one ab process, the atom responded with:
    concurrency: requests/sec : avg response time
    10: 3K : 2ms
    20: 3K : 7ms
    50: 3K : 15ms

    This is a bit higher than the numbers you’ll see quoted online (~2K), but this is bcs I am using core2 compiler options in the kernel & have irqbalance on.

    Ok, now you’ll remember that the atom 330 is dual core, but being an in-order processor, it can’t reorder threads and operations to make maximal use of the pipeline. This means that 1 core services an interrupt, which makes some data available for nginx to read, in order, and then write, which then generates another interrupt for ab to read, all happening in order. This leaves a lot of the atom pipes empty, since every request becomes dependent and ordered.

    So I add another ab process to it, just to see if the overall max requests/sec handled go up. This is a totally fine assumption bcs most traffic is coming in from distributed sources, and can be handled by diff cores anyway.

    So at $X, two ab processes simultaneously, the atom responded with (total):
    concurrency: requests/sec : avg response time
    10(2): 4.5K : 3ms
    20(40): 5K : 10ms
    50(100): 4.5K : 20ms

    There you see, requests/sec went up by almost 2, and response time unfortunately also went up. This puts the Atom 330 about 10% faster than my current main SR server (p4 2Ghz), which serves about 4K/sec.

    I also briefly ran some bandwidth-read tests with wget, wgetting a 500MB file through nginx, off of the ssd, down onto a 1G ramdisk. wget on a single file through nginx provided a read rate of 120MB/s, adding a second wget added a read rate of 110MB/s, for a combined read rate of 230MB/s, which is very close to the ssd’s rated max of 250MB/s. additional wget’s divided the available 230-250MB/s evenly among the requests. While running, 30% of the cpu was spent in system, while less than 1% was spent in user/nginx.

    Analysis of these results is pretty tricky. First of all, most of the tests run ab on the same system as the webserver to calculate theoretical max, but I don’t think anyone has noticed how much ab starves the cpu. For instance, during these tests, about 20% cpu was spent with nginx, 30% in system (interrupt handling), and 100+% in ab (out of 200% for 2 cores). I ran some trials with ab process priority set to low (10), and nginx high (-10), without any appreciable change in results. That said, I think without ab running, the theoretical max could get better or stay the same. ab is effectively generating 2 interrupts: 1 on the socket for nginx to read, then on the response on the socket for ab to read. In my full running system, the second interrupt would be replaced with an interrupt from the disk controller, saying that the data is ready to be grabbed and sent by nginx.

    That said, my main goal/belief with this cpu was that most of the webservice/processing is not cpu bound, which is still mostly true, with the exception that the interrupt handling is cpu bound, and that limits the max theoretical requests/sec the atom could handle. This is somewhat supported by the limits on the wget bandwidth, which is generating read interrupts near the processor limit as well. There’s no reason that a single wget should not have reached full ssd performance. To reach full ssd, both cores & both interrupt handling had to be used.

    Overall I have mixed feelings about the results. I am impressed by the cpu’s performance per watt, when limited to the core (7W), and the fact that such an underpowered cpu can deliver slightly higher numbers than my main P4 server. That said, I’m really disappointed that given all the advanced hardware, that the requests/sec isn’t substantially higher. Considering I’m pairing this with a SSD, which I benchmarked a month ago to provide 22,000 completed read requests/s, the 4,000/s off of irq handling limit leaves about 18,000 IOPS of the ssd underutilized. That feels like a waste to me. Also, considering that I’m really targetting a sub 10ms latency for as much of the US as possible, 20ms added onto every request when there are 100 active connections sortof defeats the purpose.

    Finally, if you calculate the requests/sec/dollar, then the atom comes in at about: 4500/$80, or 56/$.
    I estimate that a core2 duo would come in at about 75/$ (18,000/$250). (This is a fair estimate considering that my athlon 64 3000+ single core comes in at 9,000/s, and the core2 duo cpu about 3x faster than that).
    Also, on a “per watt” basis, the Atom system is great if you aren’t going to be utilizing it up to its peak, considering that it will use so little power, but if you do utilize it entirely, then 4500/40W yields less request handling performance per watt than a core2 duo (18000/100W), so a server farm could save some money using them for a bunch of smaller customers on dedicated servers, but would lose money on power and boards using them as virtual hosts or for bigger customers.

    So, I guess I’ll be returning this for a faster machine with more rqsts/sec & rqsts/$ that is closer to the ssd’s capability and save some money by not buying too many atom’s as satellite nodes. Though I was hoping to use 4+ of them to failover/ add redundancy. Oh well.

    Sorry this was so long