Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2239

AUTH + ASYM_ENCRYPT causes problem with re-joining cluster (MERGE)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 4.0.11
    • 4.0.6
    • None
    • Hide

      It was not easy for me to reproduce this problem on linux prod server, so what I did on my Windows dev machine:

      1. Deployed 3 nodes (Not needed to have all 7 for test, 3 is enough. And I couldn't reproduce the issue on 2 nodes in cluster).
      2. Started "CPU Stress" util with maximum thread priorities and activity for all CPU cores on my PC for 3-5 minutes.
      3. Stopped "CPU Stress".
      4. See the problem in logs: one node has left the cluster, and after "merge" request cluster became look a bit strange - now has 2 subgroups. And kicked node can't communicate with any other node, log is fulfilled with error messages like "unrecognized cipher".

      Show
      It was not easy for me to reproduce this problem on linux prod server, so what I did on my Windows dev machine: 1. Deployed 3 nodes (Not needed to have all 7 for test, 3 is enough. And I couldn't reproduce the issue on 2 nodes in cluster). 2. Started "CPU Stress" util with maximum thread priorities and activity for all CPU cores on my PC for 3-5 minutes. 3. Stopped "CPU Stress". 4. See the problem in logs: one node has left the cluster, and after "merge" request cluster became look a bit strange - now has 2 subgroups. And kicked node can't communicate with any other node, log is fulfilled with error messages like "unrecognized cipher".

    Description

      Hello,
      I am using the following configuration:

      <config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      	xmlns="urn:org:jgroups" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">
      	<UDP />
      	<PING />
      	<MERGE3 />
      	<FD />
      	<VERIFY_SUSPECT />
      
      	<ASYM_ENCRYPT encrypt_entire_message="true" sym_keylength="128"
      		sym_algorithm="AES/ECB/PKCS5Padding" asym_keylength="2048"
      		asym_algorithm="RSA" />
      
      	<pbcast.NAKACK2 />
      	<UNICAST3 />
      	<pbcast.STABLE />
      	<FRAG2 />
      	<AUTH auth_class="org.jgroups.auth.X509Token" auth_value="auth"
      		keystore_path="keystore.jks" keystore_password="pwd" cert_alias="alias"
      		cipher_type="RSA" />
      
      	<pbcast.GMS />
      </config>
      

      I have 7 services, but will try to show logs for 2 ones, coordinator and some random node, and all the other nodes behave similarly.

      Initially, when these nodes join the cluster, everything is fine.
      The server is a shared machine with slow CPU and also slow HDD, so sometimes, when other applications are busy with their tasks, whole my cluster can get frozen for 3-5 minutes. During/in the end of this freeze, some service may tell me the following (in logs):

      org.jgroups.protocols.FD up
      WARNING: node-26978: I was suspected by node-27291; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
      WARNING: node-26978: unrecognized cipher; discarding message from node-27291
      org.jgroups.protocols.Encrypt handleEncryptedMessage
      WARNING: node-26978: unrecognized cipher; discarding message from node-27291
       org.jgroups.protocols.Encrypt handleEncryptedMessage
      WARNING: node-26978: unrecognized cipher; discarding message from node-36734
      org.jgroups.protocols.Encrypt handleEncryptedMessage
      

      so the node was kicked out from the cluster, as it became "suspect", but the node doesn't agree with that fact. Cluster coordinator has already changed sym private key, so in the further logs of this server I see "unrecognized cipher".

      In cluster coordinator logs I see the following:

      INFO: ISPN100000: Node node-26978 joined the cluster
      ****
      WARN: node-27291: unrecognized cipher; discarding message from node-26978
      org.jgroups.logging.Slf4jLogImpl error
      ERROR: key requester  node-26978 is not in current view [***]; ignoring key request
      org.jgroups.logging.Slf4jLogImpl warn
      WARN: node-27291: unrecognized cipher; discarding message from node-26978
      
      INFO: ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[node-26978|8] (7) [node-26978, node-12721, node-17625, node-45936, node-56674, node-36734, node-27291], 2 subgroups: [node-27291|7] (6) [node-27291, node-12721, node-17625, node-45936, node-56674, node-36734], [node-27291|6] (7) [node-27291, node-26978, node-12721, node-17625, node-45936, node-56674, node-36734]
      
      

      My understanding of what has happened:
      For example I have 3 nodes

      {A, B, C}

      in the cluster. The cluster gets frozen for some minutes, so node

      {C} becomes suspected, and kicked out from the cluster by coordinator. For some reason {C}

      ignores that fact. Later, after cluster is up again, it becomes ignoring messages from

      {C}, because it is using ASYM encryption and private key has been re-generated by coordinator. Also, for some reason MERGE operation doesn't work, and {C}

      can not join back to cluster, and now cluster has 2 subgroups, that don't communicate to each other, and I don't fully understand why this happens.

      How I temporary resolved this issue: changed ASYM_ENCRYPT to SYM_ENCRYPT, and now any node can come back to the cluster successfully after freeze, as the key doesn't change.

      Also, I didn't test, but think change_key_on_leave="false" will help, but this is not the way I want to use.

      So looks like this a problem with AUTH + ASYM_ENCRYPT protocol combination, when node in some cases can not rejoin the cluster.

      Attachments

        Activity

          People

            rhn-engineering-bban Bela Ban
            brutallio Boris Sh (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: