Re: LARGE single-system Cyrus installs?

From: Pascal Gienger (no email)
Date: Fri Nov 09 2007 - 02:24:34 EST

  • Next message: Rudy Gevaert: "Re: guid mismatch"

    Vincent Fox <> wrote:

    > Our working hypothesis is that CYRUS is what is choking up at a certain
    > activity level due to bottlenecks with simultaneous access to some shared
    > resource for each instance.

    Did you do a

    lockstat -Pk sleep 30

    (with "-x destructive" when it complains about the system being
    unresponsive)?

    We had that result, among others:

    Adaptive mutex block: 2339 events in 30.052 seconds (78 events/sec)

    Count indv cuml rcnt nsec Lock Caller
    -------------------------------------------------------------------------------
      778 79% 79% 0.00 456354473 0xffffffffa4867730 zfs_zget
       61 6% 85% 0.00 466021696 0xffffffffa4867130 zfs_zget
        8 1% 87% 0.00 748812180 0xffffffffa4867780 zfs_zget
       26 1% 88% 0.00 200187703 0xffffffff9cf97598 dmu_object_alloc
        2 1% 89% 0.00 1453472066 0xffffffffa4867de0 zfs_zget
       12 1% 89% 0.00 204437906 0xffffffffa4863ad8 dmu_object_alloc
        4 1% 90% 0.00 575866919 0xffffffffa4867838 zfs_zinactive
        5 1% 90% 0.00 458982547 0xffffffffa48677b8 zfs_zget
        4 1% 91% 0.00 563367350 0xffffffffa4867868 zfs_zinactive
        3 0% 91% 0.00 629688255 0xffffffffa48677b0 zfs_zinactive

    Nearly all locks caused by zfs. The Disk SAN system is NOT the bottleneck
    though, having average service times from 5-8 ms, and no wait queue.

    456354473 nsecs are 0,456 secs, that is *LONG*.

    What's also interestring is tracing open()-calls via dtrace.
    Just use this:

    #!/usr/sbin/dtrace -s
    #pragma D option destructive
    #pragma D option quiet

    syscall::open:entry
    {
            self->ts=timestamp;
            self->filename=arg0;
    }

    syscall::open:return
    /self->ts > 0/
    {
            zeit=timestamp - self->ts;
            printf("%10d %s\n",zeit,copyinstr(self->filename));
            @["open duration"] = quantize(zeit);
            self->ts=0;
    }

    It will show you all files opened and the time needed (in nanosecs) to
    accomplish that. After hitting CTRL-C, it will summarize:

      open duration
               value ------------- Distribution ------------- count
                1024 | 0
                2048 |@ 80
                4096 |@@@@@@@@@@@@@@@@@@@@@ 1837
                8192 |@@@@@@ 521
               16384 |@@@@@@@ 602
               32768 |@@@ 229
               65536 |@ 92
              131072 | 2
              262144 | 0
              524288 | 1
             1048576 | 1
             2097152 | 1
             4194304 | 3
             8388608 | 12
            16777216 |@ 51
            33554432 | 38
            67108864 | 25
           134217728 | 9
           268435456 | 2
           536870912 | 3
          1073741824 | 0

    You see the arc memory activity from 4-65 mikroseconds and disk activity
    from 8-33ms. And you see some "big hits" from 0,13 - 0,5 secs (!). This is
    far too much and I did not figure out why this is happening. As more users
    are connecting this "really long opens" become more and more.

    We have a Postfix spool running on the same machine and we got some relief
    in deactivating its directory hashing scheme. ZFS is very "angry" about
    having a deep directory structure it seems. But still, these "long opens"
    do occur.

    Pascal

    ----
    Cyrus Home Page: http://cyrusimap.web.cmu.edu/
    Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
    List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
    

  • Next message: Rudy Gevaert: "Re: guid mismatch"





    Hosted Email Solutions

    Invaluement Anti-Spam DNSBLs



    Powered By FreeBSD   Powered By FreeBSD