Uploaded image for project: 'WildFly Core'
  1. WildFly Core
  2. WFCORE-3302

Intermittent protocol and controller module unit test failures since move to JBoss Remoting 5

    XMLWordPrintable

Details

    Description

      This bug is about problems in WF Core management tests. I believe it exposes a flaw in how remoting handles server sockets, but AFAIK there is no impact on WF Core remoting server sockets.

      Since the move to JBoss Remoting 5 we've seen intermittent failures in the protocol and controller module testsuites involving the tests that use their respective copies of the ChannelServer + RemoteChannelPairSetup test fixture. These tests all do a setup and teardown of the fixture for each test method (i.e. @Before and @After) with the failure being that a test fails creating a remoting server with a failure that indicates the server from a previous test hasn't completely shut down yet:

      java.lang.RuntimeException: java.net.BindException: Address already in use: bind
          at sun.nio.ch.Net.bind0(Native Method)
          at sun.nio.ch.Net.bind(Net.java:433)
          at sun.nio.ch.Net.bind(Net.java:425)
          at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
          at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
          at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67)
          at org.xnio.nio.NioXnioWorker.createTcpConnectionServer(NioXnioWorker.java:181)
          at org.xnio.XnioWorker.createStreamConnectionServer(XnioWorker.java:282)
          at org.jboss.remoting3.remote.RemoteConnectionProvider$ProviderInterface.createServer(RemoteConnectionProvider.java:372)
          at org.jboss.as.controller.support.ChannelServer.create(ChannelServer.java:92)
          at org.jboss.as.controller.support.RemoteChannelPairSetup.setupRemoting(RemoteChannelPairSetup.java:88)
          at org.jboss.as.controller.ModelControllerClientTestCase.setupTestClient(ModelControllerClientTestCase.java:94)
          at org.jboss.as.controller.ModelControllerClientTestCase.testCloseInputStreamEntry(ModelControllerClientTestCase.java:346)
      

      These failures have been mildly annoying on ci.wildfly.org, but now that the same code is being on other test machines, e.g. brontes used for EAP testing, they are completely intolerable, affecting a high percentage of CI runs for pull requests.

      I believe the issue arises from changes to these fixtures that came in as part of the Remoting 5 upgrade such that a remoting Endpoint is not being created/shutdown for each test method. This causes a problem because the AcceptingChannel<StreamConnection> created by Endpoint.getConnectionProviderInterface(...).createServer(...) does not synchronously close down the underlying socket as part of a call to its close() method.

      The socket is not closed synchronously because the ServerSocketChannel impl of close() does not close the socket if there are any registered keys. Debugging shows the socket is not closed until this stack happens:

      "XNIO-1 Accept@1562" daemon prio=5 tid=0xf nid=NA runnable
        java.lang.Thread.State: RUNNABLE
      	  at sun.nio.ch.ServerSocketChannelImpl.kill(ServerSocketChannelImpl.java:307)
      	  - locked <0xc0d> (a java.lang.Object)
      	  at sun.nio.ch.KQueueSelectorImpl.implDereg(KQueueSelectorImpl.java:229)
      	  at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:149)
      	  - locked <0xc38> (a java.util.HashSet)
      	  at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:107)
      	  at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
      	  - locked <0xc2b> (a sun.nio.ch.KQueueSelectorImpl)
      	  - locked <0xc39> (a java.util.Collections$UnmodifiableSet)
      	  - locked <0xc3a> (a sun.nio.ch.Util$2)
      	  at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
      	  at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
      	  at org.xnio.nio.WorkerThread.run(WorkerThread.java:519)
      

      That thread is not under the control of the test fixture, which means there's a race between it closing the socket and the test moving on the next setup where it tries to open the socket.

      I think the only solution for this is to bring the endpoint lifecycle back under the control of the test fixture such that the fixture knows all is shutdown. I don't see anything else the test can block on to ensure the server socket is closed.

      I think this would be a bug for any use of remoting where a server may quickly be shutdown and then recreated.

      Attachments

        Issue Links

          Activity

            People

              dlloyd@redhat.com David Lloyd
              bstansbe@redhat.com Brian Stansberry
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: