From: Michael Bacon (no email)
Date: Mon Oct 19 2009 - 16:38:52 EDT
Today we're enjoying our first full work day of independence from the old
monolithic cyrus server installed in 1999 (Sun 6800 -- it's had new CPU
boards since then, but that's it), and on our new shiny cluster of T5220's
that are mostly happily operating as a murder.
I say mostly because while most of the times the thing handles our 80,000
users and 14,000+ simultaneous connections like a champ, some of the time,
we get some extreme pain, mostly due to syncs between the MUPDATE master
and the front-end servers.
When we spec'ed out our servers, we didn't put much I/O capacity into the
front-end servers -- just a pair of mirrored 10k disks doing the OS, the
logging, the mailboxes.db, and all the webmail action going on in another
solaris zone on the same hardware. We thought this was sufficient given
the fact that no real permanent data lives on these servers, but it turns
out that while most of thie time it's fine, if the mupdate processes ever
decide they need to re-sync with the master, we've got 6 minutes of trouble
ahead while it downloads and stores the 800k entries in the mailboxes.db.
During these sync periods, we see two negative impacts. The first is
lockup on the mailboxes.db on the front-end servers, which slows down both
accepting new IMAP/POP connections and the reception of incoming messages.
(The front-ends also accept LMTP connections from a separate pair of
queueing hosts, then proxy those to the back-ends.) The second is that,
because the front-ends go into a
It's awfully frustrating that a system that, as my boss says, performs like
a Camaro most of the times until you hit a little rock in the road, and it
suddenly turns into a Pinto. It's also frustrating that this seems like
one of the less complicated aspects of the system -- publishing replicas of
a read-only database to a few worker boxes.
I suppose this is Fastmail and others ripped out the proxyd's and replaced
them with nginx or perdition. Currently we still support GSSAPI as an auth
mechanism, which kept me from going that direction, but given the problems
we're seeing, I'd be open to architectural suggestions on either how to tie
perdition or nginx to the MUPDATE master (because we don't have the
back-ends split along any discernable lines at this point), or suggestions
on how to make the master-to-frontend propagation faster or less painful.
Sorry for the long message, but it's not a simple problem we're fighting.
UNC Chapel Hill
---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html