Uploaded image for project: 'Fabric8'
  1. Fabric8
  2. FABRIC-1194

Fabric ensemble does not recover after VM disconnection

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: 7.3.0.redhat-61
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Red Hat JBoss Fuse 6.1.0 GA

    • Steps to Reproduce:
      Hide

      1. On three virtual machines, create a fabric ensemble of three nodes, and ensure that all all running and connected
      2. Shut down one node in an orderly way. The problem only takes two nodes to reproduce, but you can't create a two-node ensemble.
      3. Suspend one of the remaining VMs using the virtual machine's facilities (not via the guest operating system). On VMWare, this can be achieved simply by closing the VM's window, and waiting for VMWare to suspend it.
      4. After a few minutes, restart the VM.
      5. Note that the awoken VM never rejoins the ensemble. On the machine that has never been shut down, container-list shows the following:

      JBossFuse:karaf@root> container-list
      [id]                           [version] [connected] [profiles]                                         [provision status]
      root*                          1.0       true        fabric, jboss-fuse-full, fabric-ensemble-0004-1    success
      toot                           1.0       false fabric                                             success
      zoot                           1.0       false       fabric, fabric-ensemble-0003-3                     success
      

      In this case, the machine that was suspended, 'toot', is no longer shown as part of the ensemble, even though it is running.

      Show
      1. On three virtual machines, create a fabric ensemble of three nodes, and ensure that all all running and connected 2. Shut down one node in an orderly way. The problem only takes two nodes to reproduce, but you can't create a two-node ensemble. 3. Suspend one of the remaining VMs using the virtual machine's facilities (not via the guest operating system). On VMWare, this can be achieved simply by closing the VM's window, and waiting for VMWare to suspend it. 4. After a few minutes, restart the VM. 5. Note that the awoken VM never rejoins the ensemble. On the machine that has never been shut down, container-list shows the following: JBossFuse:karaf@root> container-list [id] [version] [connected] [profiles] [provision status] root* 1.0 true fabric, jboss-fuse-full, fabric-ensemble-0004-1 success toot 1.0 false fabric success zoot 1.0 false fabric, fabric-ensemble-0003-3 success In this case, the machine that was suspended, 'toot', is no longer shown as part of the ensemble, even though it is running.

      Description

      A large Fuse fabric installation runs on a collection of virtual machines. After an outage at the VM networking level, customer reports that the ensemble did not recover normal operation, and a complete restart of the installation was required.

      While I can't reproduce the customer's exact problem, I can reproduce what I believe is a very similar one. All that is needed is to create a 3-node ensemble on virtual machines, and suspend one of the VMs for some time, then wake it up.

      If I do container-list on the machine that was suspended, it fails completely – the command does not exist. This is the expected result for a container does not consider itself part of a fabric. However the VM and the container are live, and there is network connectivity between the VMs.

      Looking at the logs for the container that gets resumed, I see a whole slew of zookeeper-related network exceptions. "java.lang.IllegalStateException: Client has been stopped" seems to be particularly relevant here. It does look as if there are some connection-related problems from which Zookeeper simply never recovers.

      per.server.quorum.LearnerHandler  562 | 53 - io.fabric8.fabric-zookeeper - 1.0.0.redhat-379 | Unexpected exception causing shutdown while sock still open
      java.net.SocketTimeoutException: Read timed out
      	at java.net.SocketInputStream.socketRead0(Native Method)[:1.7.0_55]
      	at java.net.SocketInputStream.read(SocketInputStream.java:152)[:1.7.0_55
       
      orum.QuorumCnxManager$RecvWorker  762 | 53 - io.fabric8.fabric-zookeeper - 1.0.0.redhat-379 | Connection broken for id 1, my id = 2, error = 
      java.net.SocketException: Connection reset
      	at java.net.SocketInputStream.read(SocketInputStream.java:196)[:1.7.0_55]
      	at java.net.SocketInputStream.read(SocketInputStream.java:122)[:1.7.0_55]
       
      2014-05-15 18:56:02,521 | ERROR | ZooKeeperGroup-0 | ConnectionState                  | g.apache.curator.ConnectionState  194 | 53 - io.fabric8.fabric-zookeeper - 1.0.0.redhat-379 | Connection timed out for connection string (lars:2182,toot:2181,zoot:2181) and timeout (15000) / elapsed (15004)
      org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
      	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:191)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:86)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:116)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:456)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:214)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:203)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at 
              ...
       
      2014-05-15 18:57:12,460 | WARN  | 0:0:0:0:0:0:2181 | Learner                          | zookeeper.server.quorum.Follower   89 | 53 - io.fabric8.fabric-zookeeper - 1.0.0.redhat-379 | Exception when following the leader
      java.net.SocketException: Connection reset
      	at java.net.SocketInputStream.read(SocketInputStream.java:196)[:1.7.0_55]
      	at java.net.SocketInputStream.read(SocketInputStream.java:122)[:1.7.0_55]
      	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)[:1.7.0_55]
      	
       
      2014-05-15 18:57:12,520 | INFO  | 0:0:0:0:0:0:2181 | Learner                          | zookeeper.server.quorum.Follower  166 | 53 - io.fabric8.fabric-zookeeper - 1.0.0.redhat-379 | shutdown called
      java.lang.Exception: shutdown Follower
      	at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
       
       
      2014-05-15 19:05:31,695 | WARN  | pool-61-thread-1 | GitDataStore                     | abric8.git.internal.GitDataStore 1208 | 85 - io.fabric8.fabric-git - 1.0.0.redhat-379 | Failed to perform a pull java.lang.IllegalStateException: Client has been stopped
      java.lang.IllegalStateException: Client has been stopped
      	at com.google.common.base.Preconditions.checkState(Preconditions.java:150)[83:com.google.guava:15.0.0]
      	at org.apache.curator.CuratorZookeeperClient.internalBlockUntilConnectedOrTimedOut(CuratorZookeeperClient.java:320)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:105)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at org.apache.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:252)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:239)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:39)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      	at io.fabric8.zookeeper.utils.ZooKeeperUtils.setData(ZooKeeperUtils.java:204)[53:io.fabric8.fabric-zookeeper:1.0.0.redhat-379]
      

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  sonicaaaa Paolo Antinori
                  Reporter:
                  kboone Kevin Boone
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  11 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: