Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: 4.0.10
Affects Version/s: 4.0.8
Labels:
None

Steps to Reproduce:

Hide

just use jgroup with JDBC PING in AWS cloud and crash the leader instance multiple time.

You will see that ping table is growing. logical_addr_cache is growing. When new nodes restarts, you will see a bunch of TQ Bundler errors

Show
just use jgroup with JDBC PING in AWS cloud and crash the leader instance multiple time. You will see that ping table is growing. logical_addr_cache is growing. When new nodes restarts, you will see a bunch of TQ Bundler errors

SFDC Cases Counter:
SFDC Cases Links:

Description

1) In AWS cloud environments, IP address will be different when a node crashes and when a new cluster node gets recreated.
2) In this situation, JGroup is not clearing logical_addr_cache and it gets confused, when we restart the cluster nodes.
3)logical_addr_cache_max_size and the eviction did not work because, the cache is again getting updated from the ping and it never getting marked as removable.

I think the issue is

handleView method is always re writing the entire cache on view change to the db. So even if we clear the table with the help of above mentioned flags (remove_all_data_on_view_change && remove_old_coords_on_view_change) , its getting re written to the table.

 // remove all files which are not from the current members
    protected void handleView(View new_view, View old_view, boolean coord_changed) {
        if(is_coord) {
            if(coord_changed) {
                if(remove_all_data_on_view_change)
                    removeAll(cluster_name);
                else if(remove_old_coords_on_view_change) {
                    Address old_coord=old_view != null? old_view.getCreator() : null;
                    if(old_coord != null)
                        remove(cluster_name, old_coord);
                }
            }
            if(coord_changed || View.diff(old_view, new_view)[1].length > 0) {
                writeAll();
                if(remove_all_data_on_view_change || remove_old_coords_on_view_change)
                    startInfoWriter();
            }
        }
        else if(coord_changed) // I'm no longer the coordinator
            remove(cluster_name, local_addr);
    }

4) Because of the crashed members (non existing ip address), we are getting lot of socket timeouts

sendToMembers of TP is trying to send messages to old crashed members and writing error logs while startup.

Attachments

Issue Links

is related to

JGRP-2232 Using NATIVE_S3_PING old members doesn't seem to get removed

Resolved

Activity

People

Assignee:: Bela Ban

Reporter:: Sibin Karnavar (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2018/01/18 5:22 PM

Updated:: 2018/02/02 10:33 AM

Resolved:: 2018/01/31 10:27 AM