Uploaded image for project: 'JBoss Enterprise Application Platform 4 and 5'
  1. JBoss Enterprise Application Platform 4 and 5
  2. JBPAPP-7205

HornetQ - HA with disconnected journal

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Critical
    • EAP_EWP 5.1.1, EAP_EWP 5.1.2 CR1, EAP_EWP 5.1.2 CR3, EAP_EWP 5.1.2 CR4
    • HornetQ
    • None
    • RHEL 6 x86-64 with GFS2/SAN

    • Hide
      In a situation with clustered HornetQ instances where a cluster node has its journal disconnected - e.g. when the server loses its connection to the SAN - the other nodes did not take over in place of the failed node. This problem has now been fixed and failover from a failed HornetQ node now occurs without interruption to the client.
      Show
      In a situation with clustered HornetQ instances where a cluster node has its journal disconnected - e.g. when the server loses its connection to the SAN - the other nodes did not take over in place of the failed node. This problem has now been fixed and failover from a failed HornetQ node now occurs without interruption to the client.
    • Documented as Resolved Issue
    • NEW

    Description

      Hi Clebert,

      as we agreed we've started developing tests with disconnected journal according to HornetQ test plan (section 10). For now all test scenarios are failing because HornetQ architecture was not initially designed to handle such a situation. I'd like to share here current test results and some information about testing environment.

      Test Scenario - "Node is disconnected from journal" - collocated backup (corresponds to section 10.1.1):
      1. Start cluster - EAP servers A and B
      2. Start "live" producer and "live" consumer connected to server A and sending messages to "liveQueue" - active for the whole duration of the test
      3. Start producer - send 1000 messages to "testQueue" to server A
      4. Disconnect SAN from server A
      5. Start consumer - read from server B from "testQueue"

      Pass criteria:
      After step 4 the backup node will take its role.
      Clients will be reconnected to backup node and will be able to continue with its work.

      Test results:
      After step 4.:

      • EAP server B won't take its role - backup doesn't come to live
      • "live" producer/consumer ends with exception - attached logs - and don't failover to EAP server B
        In step 5. consumer on EAP node B is able to read only half of the messages sent in step 3. to "testQueue" (load- balancing)

      Note about testing environment:
      GFS2/SAN is using "fenced" daemon which power off nodes which failed. By disconnecting SAN this happens but it takes couple of minutes. Considering our test scenario after step 4 - New clients can connect to EAP server A and fail to read/send any messages. EAP server B just deliver messages which are in its journal when clients connect to it.

      Do we have some solution already?

      Thank you,

      Mirek

      Attachments

        1. jmsClient.zip
          127 kB
        2. live_consumer.log
          5 kB
        3. live_producer.log
          10 kB
        4. logs.zip
          3.86 MB
        5. logs.zip
          1.03 MB
        6. newJmsClient.zip
          92 kB
        7. reproducer.zip
          9.04 MB
        8. san_consumer_threaddump.txt
          12 kB
        9. serverA.log
          79 kB
        10. server-A-threaddump.txt
          157 kB
        11. serverB.log
          37 kB
        12. server-B-threaddump.txt
          187 kB

        Issue Links

          Activity

            People

              csuconic@redhat.com Clebert Suconic
              mnovak1@redhat.com Miroslav Novak
              Russell Dickenson Russell Dickenson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: