Uploaded image for project: 'Red Hat Fuse'
  1. Red Hat Fuse
  2. ENTESB-5816

Changing ZooKeeper settings invalidates ZK client session

    Details

    • Steps to Reproduce:
      Hide

      Create a 3 node ensemble
      Change the zk settings:

      fabric:profile-edit --pid io.fabric8.zookeeper/zookeeper.retry.interval=11000 default 1.0
      fabric:profile-edit --pid io.fabric8.zookeeper/zookeeper.retry.max=340 default 1.0
      fabric:profile-edit --pid io.fabric8.zookeeper/zookeeper.session.timeout=310000 default 1.0
      fabric:profile-edit --pid io.fabric8.zookeeper/zookeeper.connection.time.out=65000 default 1.0
      

      Shutdown 1 server
      Start server

      Fabric is split

      Show
      Create a 3 node ensemble Change the zk settings: fabric:profile-edit --pid io.fabric8.zookeeper/zookeeper.retry.interval=11000 default 1.0 fabric:profile-edit --pid io.fabric8.zookeeper/zookeeper.retry.max=340 default 1.0 fabric:profile-edit --pid io.fabric8.zookeeper/zookeeper.session.timeout=310000 default 1.0 fabric:profile-edit --pid io.fabric8.zookeeper/zookeeper.connection.time.out=65000 default 1.0 Shutdown 1 server Start server Fabric is split

      Description

      Setting zookeeper timeout and retry options lead to a failure of the zookeeper session.

      fabric:profile-edit --pid io.fabric8.zookeeper/zookeeper.retry.interval=11000 default 1.0
      fabric:profile-edit --pid io.fabric8.zookeeper/zookeeper.retry.max=340 default 1.0
      fabric:profile-edit --pid io.fabric8.zookeeper/zookeeper.session.timeout=310000 default 1.0
      fabric:profile-edit --pid io.fabric8.zookeeper/zookeeper.connection.time.out=65000 default 1.0
      

      When the profile edit is done, the connection is recreated followed by a number of client exceptions.

      2016-07-06 18:19:56,161 | INFO  | MCF-1-thread-1   | CuratorFrameworkImpl             | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Starting
      2016-07-06 18:19:56,162 | INFO  | MCF-1-thread-1   | ZooKeeper                        | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Initiating client connection, connectString=fusefabric1.gsslab.rdu2.redhat.com:2182,fusefabric2.gsslab.rdu2.redhat.com:2182,fusefabric3.gsslab.rdu2.redhat.com:2182 sessionTimeout=310000 watcher=org.apache.curator.ConnectionState@694f15db
      2016-07-06 18:19:56,226 | INFO  | ad-1-EventThread | ConnectionStateManager           | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | State change: CONNECTED
      2016-07-06 18:01:33,838 | ERROR | ZooKeeperGroup-0 | ZooKeeperGroup                   | 143 - io.fabric8.fabric-groups - 1.2.0.redhat-621107 |
      java.lang.IllegalStateException: Client is not started
              at com.google.common.base.Preconditions.checkState(Preconditions.java:176)[138:com.google.guava:17.0.0]
              at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:113)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
              at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:474)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:214)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:203)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
              at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:199)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:191)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
              at org.apache.curaf
              at io.fabric8.groups.internal.ZooKeeperGroup$4.run(ZooKeeperGroup.java:157)[143:io.fabric8.fabric-groups:1.2.0.redhat-621107]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)[:1.7.0_60]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)[:1.7.0_60]
              at java.lang.Thread.run(Thread.java:745)[:1.7.0_60]
      

      And

      2016-07-06 18:01:33,852 | WARN  | MCF-1-thread-1   | ServletEventDispatcher           | 92 - org.ops4j.pax.web.pax-web-runtime - 3.2.9 | Listener caused an exception, will be ignored
      java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Client is not started
              at java.util.concurrent.FutureTask.report(FutureTask.java:122)[:1.7.0_60]
              at java.util.concurrent.FutureTask.get(FutureTask.java:188)[:1.7.0_60]
              at java.util.concurrent.AbstractExecutorService.doInvokeAny(AbstractExecutorService.java:193)[:1.7.0_60]
              at java.util.concurrent.AbstractExecutorService.invokeAny(AbstractExecutorService.java:225)[:1.7.0_60]
              at org.ops4j.pax.web.service.internal.ServletEventDispatcher.callListener(ServletEventDispatcher.java:185)
              at org.ops4j.pax.web.service.internal.ServletEventDispatcher.callListeners(ServletEventDispatcher.java:174)
              at org.ops4j.pax.web.service.internal.ServletEventDispatcher.servletEvent(ServletEventDispatcher.java:135)
              at org.ops4j.pax.web.service.internal.HttpServiceStarted.servletEvent(HttpServiceStarted.java:193)
              at org.ops4j.pax.web.service.internal.HttpServiceStarted.unregister(HttpServiceStarted.java:270)
              at org.ops4j.pax.web.service.internal.HttpServiceProxy.unregister(HttpServiceProxy.java:77)
              at io.fabric8.maven.proxy.impl.MavenProxyRegistrationHandler.destroy(MavenProxyRegistrationHandler.java:145)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)[:1.7.0_60]
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)[:1.7.0_60]
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)[:1.7.0_60]
              at java.lang.reflect.Method.invoke(Method.java:606)[:1.7.0_60]
      

      If you shutdown one of the ensemble servers, state changes to suspended, even though there are still 2 or 4 up and only 1 down.

      2016-07-06 18:21:03,076 | WARN  | SendWorker:3     | QuorumCnxManager                 | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Send worker leaving thread
      2016-07-06 18:21:03,978 | INFO  | ad-1-EventThread | ConnectionStateManager           | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | State change: SUSPENDED
      

      Bring up the shutdown node and it reconnects to the other 2 fabric nodes, but they do not recognize it. Fabric commands do not work.

      18:22:04,793 | INFO  | redhat.com:2182) | ClientCnxn                       | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Session establishment complete on server fusefabric1.gsslab.rdu2.redhat.com/10.10.183.203:2182, sessionid = 0x155c23b4fbd0007, negotiated timeout = 40000
      18:22:04,798 | INFO  | ad-1-EventThread | ConnectionStateManager           | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | State change: CONNECTED
      
      18:22:05,124 | INFO  | 0:0:0:0:0:0:2182 | Learner                          | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | FOLLOWING - LEADER ELECTION TOOK - 90
      18:22:05,139 | INFO  | 0:0:0:0:0:0:2182 | Learner                          | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Getting a diff from the leader 0x2000008b2
      18:22:05,143 | WARN  | 0:0:0:0:0:0:2182 | Learner                          | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Got zxid 0x2000008ab expected 0x1
      18:22:05,158 | INFO  | 0:0:0:0:0:0:2182 | FileTxnSnapLog                   | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Snapshotting: 0x2000008b2 to /opt/fuse/jboss-fuse-6.2.1.redhat-107/data/zookeeper/0001/version-2/snapshot.2000008b2
      18:22:05,206 | INFO  | SyncThread:3     | FileTxnLog                       | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Creating new log file: log.2000008b3
      

      The other server known it got the message, but continues in a connection failure state

      2016-07-06 18:22:14,175 | INFO  | .10.183.203:3888 | QuorumCnxManager                 | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Received connection request /10.10.183.205:50322
      2016-07-06 18:22:14,192 | INFO  | Receiver[myid=1] | FastLeaderElection               | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Notification: 1 (message format version), 3 (n.leader), 0x2000008aa (n.zxid), 0x1 (n.round), LOOKING (n.state), 3 (n.sid), 0x2 (n.peerEpoch) FOLLOWING (my state)
      2016-07-06 18:23:27,003 | ERROR | ZooKeeperGroup-0 | ConnectionState                  | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Connection timed out for connection string (fusefabric1.gsslab.rdu2.redhat.com:2182,fusefabric2.gsslab.rdu2.redhat.com:2182,fusefabric3.gsslab.rdu2.redhat.com:2182) and timeout (66000) / elapsed (143025)
      org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
      	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
      	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
      	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
      	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:474)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
      	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
      	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
      	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
      	at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:287)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
      	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
      	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41)[149:io.fabric8.fabric-zookeeper:1.2.0.redhat-621107]
      	at io.fabric8.groups.internal.ZooKeeperGroup.getDataAndStat(ZooKeeperGroup.java:400)[143:io.fabric8.fabric-groups:1.2.0.redhat-621107]
      	at io.fabric8.groups.internal.ZooKeeperGroup.processChildren(ZooKeeperGroup.java:476)[143:io.fabric8.fabric-groups:1.2.0.redhat-621107]
      	at io.fabric8.groups.internal.ZooKeeperGroup.refresh(ZooKeeperGroup.java:378)[143:io.fabric8.fabric-groups:1.2.0.redhat-621107]
      	at io.fabric8.groups.internal.RefreshOperation.invoke(RefreshOperation.java:32)[143:io.fabric8.fabric-groups:1.2.0.redhat-621107]
      	at io.fabric8.groups.internal.ZooKeeperGroup.mainLoop(ZooKeeperGroup.java:496)[143:io.fabric8.fabric-groups:1.2.0.redhat-621107]
      	at io.fabric8.groups.internal.ZooKeeperGroup.access$200(ZooKeeperGroup.java:66)[143:io.fabric8.fabric-groups:1.2.0.redhat-621107]
      	at io.fabric8.groups.internal.ZooKeeperGroup$4.run(ZooKeeperGroup.java:157)[143:io.fabric8.fabric-groups:1.2.0.redhat-621107]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)[:1.7.0_60]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)[:1.7.0_60]
      	at java.lang.Thread.run(Thread.java:745)[:1.7.0_60]
      

      Eventually, when the session timeout happens, the session is destroyed and things reform.

      2016-07-06 18:26:45,854 | WARN  | l Console Thread | ConnectionState                  | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Connection attempt unsuccessful after 341876 (greater than max timeout of 310000). Resetting connection and trying again with a new connection.
      2016-07-06 18:26:45,864 | INFO  | l Console Thread | ZooKeeper                        | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Session: 0x355c23be86f0002 closed
      2016-07-06 18:26:45,864 | INFO  | l Console Thread | ZooKeeper                        | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | Initiating client connection, connectString=fusefabric1.gsslab.rdu2.redhat.com:2182,fusefabric2.gsslab.rdu2.redhat.com:2182,fusefabric3.gsslab.rdu2.redhat.com:2182 sessionTimeout=310000 watcher=org.apache.curator.ConnectionState@694f15db
      2016-07-06 18:26:45,881 | INFO  | read-EventThread | ConnectionStateManager           | 149 - io.fabric8.fabric-zookeeper - 1.2.0.redhat-621107 | State change: RECONNECTED
      2016-07-06 18:26:45,884 | INFO  | onStateManager-0 | GitDataStoreImpl                 | 144 - io.fabric8.fabric-git - 1.2.0.redhat-621107 | Shared Counter (Re)connected, doing a pull
      

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  sonicaaaa Paolo Antinori
                  Reporter:
                  mrobson Matt Robson
                  Tester:
                  Roman Jakubco
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  2 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: