Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: 2.12
Affects Version/s: 2.11
Labels:
None

Workaround:

Workaround Exists
Workaround Description:

Hide

disconnect from JChannel and connect again.

Show
disconnect from JChannel and connect again.
Steps to Reproduce:

Hide

Start several nodes (>3 I think), set Xmx for JVM to 8 or even 16 Gbytes, then use jmap tool (mentioned above) to take a memory dump. You should set config accordingly - so that other nodes will update their view while taking the dump. In my tests the problematic node where I did tests was NOT coordinator.

Show
Start several nodes (>3 I think), set Xmx for JVM to 8 or even 16 Gbytes, then use jmap tool (mentioned above) to take a memory dump. You should set config accordingly - so that other nodes will update their view while taking the dump. In my tests the problematic node where I did tests was NOT coordinator.

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

In our production system I can see that a node desappers from the cluster if its server was heavily-loaded. It's OK, but the node never comes back to the cluster even after its server is working normally, without load. I can easily reproduce the problem in 2 cases:

1) by taking a memory dump on the node: jmap -dump:format=b,file=dump.hprof <pid>
Since we have 8-16 GB of RAM, this operation takes much time and blocks JVM - so other members exclude this node from View.

2) GC (garbage collection) - if JVM is doing GC constantly (and almost can not work)

In both situations the stuck node never reappears in the cluster (even after 1 h). Below are more details.

We have 12 nodes in our cluster, we problematic node is "gate5".

View on gate5: [gate11.mydomain|869] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain, gate5.mydomain]

View on gate11 (coordinator): [gate11.mydomain|870] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain]

The coordinator (gate11) is sending GET_MBRS_REQ periodically - I see them in gate5. But I do NOT see response to this request!
All jgroups threads are alive, not dead (I took stack traces).
Another strange thing is that the problematic gate5 sends messages to other nodes and even receives messages from SOME of them! How is it possible - I double-checked that ALL other nodes have view_id=870 (without gate5)?

The only assumption I have is race-conditions which occurs (as always) under high load.
In normal situations such as temporary network failure everything works perfectly - gate5 joins the cluster.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

jgroups-tcp.xml
2010/12/30 3:43 AM
2 kB
Victor N

Assignee:: Bela Ban

Reporter:: Victor N (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2010/12/30 3:08 AM

Updated:: 2021/10/24 5:47 AM

Resolved:: 2011/01/16 7:07 AM

Details

Description

Attachments

Attachments

Activity

People

Dates

Hide