FUSE Message Broker
  1. FUSE Message Broker
  2. MB-511

KahaDB store should handle missing or corrupt log files gracefully

    Details

    • Type: Enhancement Enhancement
    • Status: Resolved Resolved
    • Priority: Critical Critical
    • Resolution: Done
    • Affects Version/s: 5.3.0.2-fuse
    • Fix Version/s: 5.3.0.5-fuse
    • Component/s: broker
    • Labels:
      None
    • Environment:
      Solaris 10
    • Similar Issues:
      Show 10 results 

      Description

      If a log file in KahaDB are missing or corrupt, KahaDB comes to a halt for consumers. It should be able to handle this situation gracefully and try to recover as many messages as it can.

        Activity

        Hide
        Hiram Chirino
        added a comment -

        Could you provide specifics on what kind of corruption and when/how it it happens.

        Show
        Hiram Chirino
        added a comment - Could you provide specifics on what kind of corruption and when/how it it happens.
        Hide
        Joe Luo
        added a comment -

        Here is instruction to reproduce the problem:
        1. copy over attached activemq.xml configuration file to fuse MB 5.3.0.2-fuse version "conf" directory;
        Note, I am using KahaDB and I set maximum journal file length to be 1MB only so it will create a lot of db log files.
        2. unzip mb511-testcase.zip to Fuse MB 5.3.0.2-fuse installation directory;
        3. start consumer by "ant consumer";
        4. start producer by "ant producer";
        5. wait for at least 4 or 5 db log files appear under "data\amq" directory;
        6. stop the activemq broker and then delete db-3.log file;
        7. restart the activemq broker;
        8. restart the consumer and the consumer continue receives messages till it finishes;
        9. restart the producer to send more messages;
        10. the consumer remains halt;

        Show
        Joe Luo
        added a comment - Here is instruction to reproduce the problem: 1. copy over attached activemq.xml configuration file to fuse MB 5.3.0.2-fuse version "conf" directory; Note, I am using KahaDB and I set maximum journal file length to be 1MB only so it will create a lot of db log files. 2. unzip mb511-testcase.zip to Fuse MB 5.3.0.2-fuse installation directory; 3. start consumer by "ant consumer"; 4. start producer by "ant producer"; 5. wait for at least 4 or 5 db log files appear under "data\amq" directory; 6. stop the activemq broker and then delete db-3.log file; 7. restart the activemq broker; 8. restart the consumer and the consumer continue receives messages till it finishes; 9. restart the producer to send more messages; 10. the consumer remains halt;
        Hide
        Hiram Chirino
        added a comment -

        Add ignoreMissingJournalfiles option to KahaDB store to allow graceful recovery even when active journal files have been removed.

        In the example, activemq configration attached to this issue you would just need to add a property element to the store bean like:

        <property name="ignoreMissingJournalfiles" value="true"/>
        

        This should become available in the 5.3.0.5-fuse release.

        Show
        Hiram Chirino
        added a comment - Add ignoreMissingJournalfiles option to KahaDB store to allow graceful recovery even when active journal files have been removed. In the example, activemq configration attached to this issue you would just need to add a property element to the store bean like: <property name= "ignoreMissingJournalfiles" value= " true " /> This should become available in the 5.3.0.5-fuse release.
        Hide
        Joe Luo
        added a comment -

        There were two issues:
        1. Missing Journal Files
        2. Corrupt Journal Files. (In this case the file exists but for some reason is unreadable).

        I had a test with this new property file and it works for the first issue with "missing Journal files". However, it does not work for the second one and the broker still could not recover if a Journal file became corrupted.

        Here is some way to corrupt a Journal file:
        1. Open it in an editor and randomly delete portions of the file. Since the file is binary, it doesnt really matter what I delete. What I saw was that the broker could not recover from this condition.
        Or
        2. Randomly append some junk to the end of the file (with some output redirection like >>). This also results in a condition where the broker cannot recover and I have to delete all data files to make things work again.

        Below is exact steps I did:
        repeate steps from my previous comments;
        5. wait for at least 3 or 4 db log files appear under "data\amq" directory;
        6. stop the activemq broker and then open db-2.log file in a text editor. Then randomly delete some parts of the file and save the file again.
        7. restart the activemq broker;
        8. restart the consumer and the consumer continue receives messages till stops;
        Note, the consumer will only consume small part of messages before it stops and the broker starts to throw below exceptions:

        2009-08-18 09:56:54,684 [ue://test.queue] ERROR Queue - Failed to page in more queue messages
        java.lang.OutOfMemoryError: Java heap space
        at org.apache.activemq.openwire.v5.BaseDataStreamMarshaller.looseUnmarshalByteSequence(BaseDataStreamMarshaller.java:639)
        at org.apache.activemq.openwire.v5.MessageMarshaller.looseUnmarshal(MessageMarshaller.java:229)
        at org.apache.activemq.openwire.v5.ActiveMQMessageMarshaller.looseUnmarshal(ActiveMQMessageMarshaller.java:101)
        at org.apache.activemq.openwire.v5.ActiveMQTextMessageMarshaller.looseUnmarshal(ActiveMQTextMessageMarshaller.java:101)
        at org.apache.activemq.openwire.OpenWireFormat.doUnmarshal(OpenWireFormat.java:368)
        at org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:279)
        at org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:536)
        at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:230)
        at org.apache.kahadb.page.Transaction.execute(Transaction.java:728)
        at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:223)
        at org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:81)
        at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:227)
        at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:99)
        at org.apache.activemq.broker.region.cursors.StoreQueueCursor.reset(StoreQueueCursor.java:157)
        at org.apache.activemq.broker.region.Queue.doPageIn(Queue.java:1279)
        at org.apache.activemq.broker.region.Queue.pageInMessages(Queue.java:1419)
        at org.apache.activemq.broker.region.Queue.iterate(Queue.java:1103)
        at org.apache.activemq.thread.DeterministicTaskRunner.runTask(DeterministicTaskRunner.java:84)
        at org.apache.activemq.thread.DeterministicTaskRunner$1.run(DeterministicTaskRunner.java:41)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)
        2009-08-18 09:56:54,980 [ue://test.queue] ERROR Queue - Failed to page in more queue messages
        java.lang.OutOfMemoryError: Java heap space
        2009-08-18 09:56:55,277 [ue://test.queue] ERROR Queue - Failed to page in more queue messages
        java.lang.OutOfMemoryError: Java heap space
        2009-08-18 09:56:55,574 [ue://test.queue] ERROR Queue - Failed to page in more queue messages
        java.lang.OutOfMemoryError: Java heap space
        ...

        9. restart the producer to send more messages;
        10. the consumer remains halt and the broker continue throws above exception and never recovers.

        So in case of a corrupted Journal file, the broker should at least be able to recover from the condition first and then recover/process messages as much as possible.

        Show
        Joe Luo
        added a comment - There were two issues: 1. Missing Journal Files 2. Corrupt Journal Files. (In this case the file exists but for some reason is unreadable). I had a test with this new property file and it works for the first issue with "missing Journal files". However, it does not work for the second one and the broker still could not recover if a Journal file became corrupted. Here is some way to corrupt a Journal file: 1. Open it in an editor and randomly delete portions of the file. Since the file is binary, it doesnt really matter what I delete. What I saw was that the broker could not recover from this condition. Or 2. Randomly append some junk to the end of the file (with some output redirection like >>). This also results in a condition where the broker cannot recover and I have to delete all data files to make things work again. Below is exact steps I did: repeate steps from my previous comments; 5. wait for at least 3 or 4 db log files appear under "data\amq" directory; 6. stop the activemq broker and then open db-2.log file in a text editor. Then randomly delete some parts of the file and save the file again. 7. restart the activemq broker; 8. restart the consumer and the consumer continue receives messages till stops; Note, the consumer will only consume small part of messages before it stops and the broker starts to throw below exceptions: 2009-08-18 09:56:54,684 [ue://test.queue] ERROR Queue - Failed to page in more queue messages java.lang.OutOfMemoryError: Java heap space at org.apache.activemq.openwire.v5.BaseDataStreamMarshaller.looseUnmarshalByteSequence(BaseDataStreamMarshaller.java:639) at org.apache.activemq.openwire.v5.MessageMarshaller.looseUnmarshal(MessageMarshaller.java:229) at org.apache.activemq.openwire.v5.ActiveMQMessageMarshaller.looseUnmarshal(ActiveMQMessageMarshaller.java:101) at org.apache.activemq.openwire.v5.ActiveMQTextMessageMarshaller.looseUnmarshal(ActiveMQTextMessageMarshaller.java:101) at org.apache.activemq.openwire.OpenWireFormat.doUnmarshal(OpenWireFormat.java:368) at org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:279) at org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:536) at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:230) at org.apache.kahadb.page.Transaction.execute(Transaction.java:728) at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:223) at org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:81) at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:227) at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:99) at org.apache.activemq.broker.region.cursors.StoreQueueCursor.reset(StoreQueueCursor.java:157) at org.apache.activemq.broker.region.Queue.doPageIn(Queue.java:1279) at org.apache.activemq.broker.region.Queue.pageInMessages(Queue.java:1419) at org.apache.activemq.broker.region.Queue.iterate(Queue.java:1103) at org.apache.activemq.thread.DeterministicTaskRunner.runTask(DeterministicTaskRunner.java:84) at org.apache.activemq.thread.DeterministicTaskRunner$1.run(DeterministicTaskRunner.java:41) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675) at java.lang.Thread.run(Thread.java:595) 2009-08-18 09:56:54,980 [ue://test.queue] ERROR Queue - Failed to page in more queue messages java.lang.OutOfMemoryError: Java heap space 2009-08-18 09:56:55,277 [ue://test.queue] ERROR Queue - Failed to page in more queue messages java.lang.OutOfMemoryError: Java heap space 2009-08-18 09:56:55,574 [ue://test.queue] ERROR Queue - Failed to page in more queue messages java.lang.OutOfMemoryError: Java heap space ... 9. restart the producer to send more messages; 10. the consumer remains halt and the broker continue throws above exception and never recovers. So in case of a corrupted Journal file, the broker should at least be able to recover from the condition first and then recover/process messages as much as possible.
        Hide
        Hiram Chirino
        added a comment - - edited

        The second part of this issue is now implemented.

        Two new options have been added to the kahaDB configuration element:

        • checksumJournalFiles : if true enables storing checksums of the journal data to detect corruption.
        • checkForCorruptJournalFiles : if true then all the data file contents are validated against their checksums

        And like before, the ignoreMissingJournalfiles should be set to true if you want to gracefully ignore corrupted data files.

        So the new config should have:

        <property name="ignoreMissingJournalfiles" value="true"/>
        <property name="checksumJournalFiles" value="true"/>
        <property name="checkForCorruptJournalFiles" value="true"/>
        
        Show
        Hiram Chirino
        added a comment - - edited The second part of this issue is now implemented. Two new options have been added to the kahaDB configuration element: checksumJournalFiles : if true enables storing checksums of the journal data to detect corruption. checkForCorruptJournalFiles : if true then all the data file contents are validated against their checksums And like before, the ignoreMissingJournalfiles should be set to true if you want to gracefully ignore corrupted data files. So the new config should have: <property name= "ignoreMissingJournalfiles" value= " true " /> <property name= "checksumJournalFiles" value= " true " /> <property name= "checkForCorruptJournalFiles" value= " true " />
        Hide
        Joe Luo
        added a comment -

        Further update on the test for the new properties:
        It was some minor mistake in the activemq.xml configuration file in relation to the two new properites that had caused my test to fail. After correcting those mistake, the test went well.

        The instruction to corrupt a KahaDB log file I had described above was not an typical way of getting into corrupted DB file situation. Therefore, it was inapproriate. The way I corrupted a log file was to use a hex editor randomly apply edits on the file, not delete or insert in the middle of the file but only overwrite. It is also ok to delete blocks of bytes from end of the file using hex editor though. Using the corrupted KahaDB log file in this way, two new properties worked fine and I can see that only messages in affected area were skipped and the broker recovered well.

        I also did some tests to measure startup time of the broker becase adding checksum to the DB log file might cause the broker to startup slower. I did two tests with and without two new properties defined in "org.apache.activemq.store.kahadb.KahaDBStore" bean for comparision purpose. For each of the tests, I produced around 1.25 ~ 1.27 million messages into a single queue and each message is about 700 bytes in size. My tests showed that the broker startup time was not really slower with checksum writting into the DB log file versus default setting. Say the most of times, the startup time for each of the two tests produced quit similar result. Only once, I got 40 seconds startup time for the broker with checksum logs whileas 25 seconds was the longest for the broker to startup with non-checksum log (default).

        I believe this issue can be closed.

        Show
        Joe Luo
        added a comment - Further update on the test for the new properties: It was some minor mistake in the activemq.xml configuration file in relation to the two new properites that had caused my test to fail. After correcting those mistake, the test went well. The instruction to corrupt a KahaDB log file I had described above was not an typical way of getting into corrupted DB file situation. Therefore, it was inapproriate. The way I corrupted a log file was to use a hex editor randomly apply edits on the file, not delete or insert in the middle of the file but only overwrite. It is also ok to delete blocks of bytes from end of the file using hex editor though. Using the corrupted KahaDB log file in this way, two new properties worked fine and I can see that only messages in affected area were skipped and the broker recovered well. I also did some tests to measure startup time of the broker becase adding checksum to the DB log file might cause the broker to startup slower. I did two tests with and without two new properties defined in "org.apache.activemq.store.kahadb.KahaDBStore" bean for comparision purpose. For each of the tests, I produced around 1.25 ~ 1.27 million messages into a single queue and each message is about 700 bytes in size. My tests showed that the broker startup time was not really slower with checksum writting into the DB log file versus default setting. Say the most of times, the startup time for each of the two tests produced quit similar result. Only once, I got 40 seconds startup time for the broker with checksum logs whileas 25 seconds was the longest for the broker to startup with non-checksum log (default). I believe this issue can be closed.
        Hide
        Hiram Chirino
        added a comment -

        Issue resolved and validate by CS.

        Show
        Hiram Chirino
        added a comment - Issue resolved and validate by CS.

          People

          • Assignee:
            Hiram Chirino
            Reporter:
            Dhiraj Bokde
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: