Making Replication Robust

From: Bron Gondwana (no email)
Date: Wed Oct 03 2007 - 22:36:57 EDT

  • Next message: Bron Gondwana: "Re: Visible shared folders?"

    Hi,

    As I've mentioned on the mailing list, we have had to put
    quite a lot of infrastructure around Cyrus to make
    replication robust in all cases.

    While the core replication protocol seems pretty stable now,
    and with GUID stuff it will be easier to do integrity checks,
    it's still very much not a turn-key solution. Every site
    will have to put a lot of effort into understanding how
    everything works and building their own systems for keeping
    things running.

    ... while there is some commercial advantage theory in us
    knowing how to do stuff and not telling anyone else, we
    think that's outweighed by having replication so good that
    lots of people are using it and helping improve Cyrus ...

    So I'd like to start a dialogue on the topic of making Cyrus
    replication robust across failures with the following goals:

    a) MUST never lose a message that's been accepted for
       delivery except in the case of total drive failure.

    b) MUST have a standard way to integrity check and
       repair a replica-pair after a system crash.

    c) MUST have a clean process to "soft-failover" to the
       replica machine, making sure that all replication
       events from the ex-master have been synchronised.

    d) MUST have replication start/restart automatically when
       the replica is available rather than requiring it be
       online at master start time.

    e) SHOULD be able to copy back messages which only exist
       on the replica due to a hard-failover, handling UIDs
       gracefully (more on this later), alternatively as least
       MUST (to satisfy point 'a') notify the administrator
       that the message has different GUIDs on the two copies
       and something will need to be done about it (to satisfy
       point 'd' this must be done without bailing out
       replication for the remaining messages in the folder)

    f) SHOULD keep replicating in the face of an error which
       affects a single mailbox, keeping track of that mailbox
       so that a sysadmin can fix the issue and then replicate
       that mailbox hand.

    g) MAY have a method to replicate to two different replicas
       concurrently (replay the same sync_log messages twice)
       allowing one replica to be taken out of service and
       a new one created while having no "gaps" in which there
       is no second copy alive (we use rsync, rsync again,
       stop replication, rsync a third time, start replication
       to the new site - but it's messy and gappy)

    Does that sound like a reasonable set of goals to everyone?
    Have I missed anything important? I'd like to see a
    situtation where everyone who wants to run replication can
    turn it on and pretty much forget about it, trusting that
    it will just_work[tm] (though reading the log files just in
    case!)

    COPYING BACK and UIDs:
    ======================

    The easy approach:

    * give both messages a new UID after uid_last and delete the
      old UIDs on both machines

    The tricky approach:

    * track the highest UID that has ever been presented to an
      IMAP client. UIDs above that point are "soft" UIDs, and
      you don't need to worry about changing them, so select
      whichever end has actually had the UID seen and keep the
      message there with the same UID, injecting only the other
      message with a new UID.

    * if both ends have been viewed via IMAP then your client is
      probably already confused. Fall back to nuking that UID
      and injecting both messages again with higher UIDs.

    Regards,

    Bron.


  • Next message: Bron Gondwana: "Re: Visible shared folders?"





    Hosted Email Solutions

    Invaluement Anti-Spam DNSBLs



    Powered By FreeBSD   Powered By FreeBSD