Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1532

Don't receive heartbeat in Nic Teaming configuration after NIC switch

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Major Major
    • 3.3
    • 2.12.2
    • None
    • High

      we haven't problems in single cards configuration without NIC Teaming.
      But with all machines with dual cards with Nic Teaming is activated, we have a problem of "didn't received heartbeat".
      With WireShark analyser, we observed that when heartbeat Multicast packet stay on same card, we did not have problem but if the heartbeat Multicast packet switches to second card, we have in logs file failure detections.

      For example : the first heartfailure in logs appears at 03:41:25 until 05:03:20
      2012-10-23 03:41:25.234 [FINE] - FD_ALL: haven't received a heartbeat from ctc809091084-27510(5ae571864ef0) for 11061 ms, adding it to suspect list
      2012-10-23 03:41:25.234 [FINE] - FD_ALL: suspecting [ctc809091084-27510(5ae571864ef0), ctc804291084-11401(de9a6a421087)]
      2012-10-23 03:41:28.245 [FINE] - FD_ALL: haven't received a heartbeat from ctc809091084-27510(5ae571864ef0) for 14072 ms, adding it to suspect list
      2012-10-23 03:41:28.245 [FINE] - FD_ALL: haven't received a heartbeat from ctc804291084-11401(de9a6a421087) for 12044 ms, adding it to suspect list
      2012-10-23 03:41:28.245 [FINE] - FD_ALL: suspecting [ctc809091084-27510(5ae571864ef0), ctc804291084-11401(de9a6a421087)]
      2012-10-23 03:41:31.255 [FINE] - FD_ALL: haven't received a heartbeat from ctc809091084-27510(5ae571864ef0) for 17082 ms, adding it to suspect list
      2012-10-23 03:41:31.255 [FINE] - FD_ALL: haven't received a heartbeat from ctc804291084-11401(de9a6a421087) for 15054 ms, adding it to suspect list
      2012-10-23 03:41:31.255 [FINE] - FD_ALL: suspecting [ctc809091084-27510(5ae571864ef0), ctc804291084-11401(de9a6a421087)]
      2012-10-23 03:41:34.266 [FINE] - FD_ALL: haven't received a heartbeat from ctc809091084-27510(5ae571864ef0) for 20093 ms, adding it to suspect list
      2012-10-23 03:41:34.266 [FINE] - FD_ALL: haven't received a heartbeat from ctc804291084-11401(de9a6a421087) for 18065 ms, adding it to suspect list
      2012-10-23 03:41:34.266 [FINE] - FD_ALL: suspecting [ctc809091084-27510(5ae571864ef0), ctc804291084-11401(de9a6a421087)]
      2012-10-23 03:41:37.277 [FINE] - FD_ALL: haven't received a heartbeat from ctc809091084-27510(5ae571864ef0) for 23104 ms, adding it to suspect list
      2012-10-23 03:41:37.277 [FINE] - FD_ALL: haven't received a heartbeat from ctc804291084-11401(de9a6a421087) for 21076 ms, adding it to suspect list
      2012-10-23 03:41:37.277 [FINE] - FD_ALL: suspecting [ctc809091084-27510(5ae571864ef0), ctc804291084-11401(de9a6a421087)]
      2012-10-23 03:41:40.288 [FINE] - FD_ALL: haven't received a heartbeat from ctc809091084-27510(5ae571864ef0) for 26115 ms, adding it to suspect list
      2012-10-23 03:41:40.288 [FINE] - FD_ALL: haven't received a heartbeat from ctc804291084-11401(de9a6a421087) for 24087 ms, adding it to suspect list
      ...

      the logs of Card 1 during the period :
      ----------------------------------------------------
      2012-10-23 03:41:15.563 MULTICAST id=321 src=/10.120.180.64:45588 dest=/228.8.8.8:45588 (47 bytes)

      Msg1 src=cc74a22f-6e18-1b7a-5521-3abebdd47ab6(3ba17876e725) dest=ALL
      flags=[OOB]
      headers=[
      HeartbeatHeader:heartbeat
      ]

      ----------------------------------------------------
      2012-10-23 03:41:15.996 MULTICAST id=7481 src=/10.120.120.64:45588 dest=/228.8.8.8:45588 (47 bytes)

      Msg1 src=17da3e81-158b-4440-50c7-412aebce41e2(de9a6a421087) dest=ALL
      flags=[OOB]
      headers=[
      HeartbeatHeader:heartbeat
      ]

      ----------------------------------------------------
      2012-10-23 04:25:49.221 MULTICAST id=2868 src=/10.120.180.64:45588 dest=/228.8.8.8:45588 (47 bytes)

      Msg1 src=cc74a22f-6e18-1b7a-5521-3abebdd47ab6(3ba17876e725) dest=ALL
      flags=[OOB]
      headers=[
      HeartbeatHeader:heartbeat
      ]

      The Cards was in standby between 03:41:15 and 04:25:49

      The logs of Card 0 during the period :
      -------------------------------------------------
      ----------------------------------------------------
      2012-10-23 03:41:25.029 MULTICAST id=74b1 src=/10.120.120.64:45588 dest=/228.8.8.8:45588 (47 bytes)

      Msg1 src=17da3e81-158b-4440-50c7-412aebce41e2(de9a6a421087) dest=ALL
      flags=[OOB]
      headers=[
      HeartbeatHeader:heartbeat
      ]

      ----------------------------------------------------
      2012-10-23 03:41:25.961 MULTICAST id=5adb src=/10.120.220.64:45588 dest=/228.8.8.8:45588 (47 bytes)

      Msg1 src=f1e9fdac-6d36-d321-6f9d-ec0cbf771608(5ae571864ef0) dest=ALL
      flags=[OOB]
      headers=[
      HeartbeatHeader:heartbeat
      ]

      ----------------------------------------------------
      2012-10-23 03:41:26.874 MULTICAST id=5ae0 src=/10.120.220.64:45588 dest=/228.8.8.8:45588 (91 bytes)

      Msg1 src=f1e9fdac-6d36-d321-6f9d-ec0cbf771608(5ae571864ef0) dest=ALL
      flags=[OOB]
      headers=[
      PingHeader:[PING: type=GET_MBRS_REQ, cluster=REPL, view_id=[f1e9fdac-6d36-d321-6f9d-ec0cbf771608(5ae571864ef0)|2]]
      ]

      ----------------------------------------------------
      2012-10-23 03:41:27.607 MULTICAST id=362 src=/10.120.180.64:45588 dest=/228.8.8.8:45588 (47 bytes)

      Msg1 src=cc74a22f-6e18-1b7a-5521-3abebdd47ab6(3ba17876e725) dest=ALL
      flags=[OOB]
      headers=[
      HeartbeatHeader:heartbeat
      ]

      ----------------------------------------------------
      2012-10-23 03:41:28.040 MULTICAST id=74bf src=/10.120.120.64:45588 dest=/228.8.8.8:45588 (47 bytes)

      Msg1 src=17da3e81-158b-4440-50c7-412aebce41e2(de9a6a421087) dest=ALL
      flags=[OOB]
      headers=[
      HeartbeatHeader:heartbeat
      ]

      ----------------------------------------------------
      2012-10-23 03:41:28.962 MULTICAST id=5ae8 src=/10.120.220.64:45588 dest=/228.8.8.8:45588 (47 bytes)

      Msg1 src=f1e9fdac-6d36-d321-6f9d-ec0cbf771608(5ae571864ef0) dest=ALL
      flags=[OOB]
      headers=[
      HeartbeatHeader:heartbeat
      ]

      ----------------------------------------------------
      2012-10-23 03:41:30.617 MULTICAST id=36f src=/10.120.180.64:45588 dest=/228.8.8.8:45588 (47 bytes)

      Msg1 src=cc74a22f-6e18-1b7a-5521-3abebdd47ab6(3ba17876e725) dest=ALL
      flags=[OOB]
      headers=[
      HeartbeatHeader:heartbeat
      ]

      etc ... heartbeats received every 3 secondes until 06:00

      The two cards have been configured with the same IP Address (10.120.180.64) and also virtual NIC (10.120.180.64).
      We tested with Mcast.exe on these configuration without problems.
      All is working like JGroups (or JAVA) was "plugged" only the card n°1.

      JGroups was been configured with this parameters.
      <?xml version="1.0" encoding="UTF-8"?>
      <config xmlns="urn:org:jgroups">
      <UDP bind_addr="10.120.180.64" bind_interface="eth10" bind_port="7800" diagnostics_addr="224.0.75.75" discard_incompatible_packets="true" enable_bundling="true" enable_diagnostics="true" ip_ttl="10" loopback="true" max_bundle_size="64K" max_bundle_timeout="30" mcast_group_addr="228.8.8.8" mcast_port="45588" mcast_recv_buf_size="25M" mcast_send_buf_size="640K" oob_thread_pool.enabled="true" oob_thread_pool.keep_alive_time="5000" oob_thread_pool.max_threads="8" oob_thread_pool.min_threads="1" oob_thread_pool.queue_enabled="false" oob_thread_pool.queue_max_size="100" oob_thread_pool.rejection_policy="Run" singleton_name="UDP" thread_naming_pattern="pl" thread_pool.enabled="true" thread_pool.keep_alive_time="5000" thread_pool.max_threads="8" thread_pool.min_threads="2" thread_pool.queue_enabled="false" thread_pool.queue_max_size="100" thread_pool.rejection_policy="Run" tos="8" ucast_recv_buf_size="20M" ucast_send_buf_size="640K"/>
      <PING num_initial_members="3" timeout="2000"/>
      <MERGE2 max_interval="30000" min_interval="10000"/>
      <FD_SOCK bind_addr="10.120.180.64" bind_interface="eth10"/>
      <FD_ALL/>
      <VERIFY_SUSPECT bind_addr="10.120.180.64" bind_interface="eth10" timeout="1500"/>
      <pbcast.NAKACK discard_delivered_msgs="false" exponential_backoff="150" gc_lag="0" retransmit_timeout="300,600,1200" use_mcast_xmit="true" use_stats_for_retransmission="false"/>
      <UNICAST timeout="300,600,1200"/>
      <pbcast.STABLE desired_avg_gossip="50000" max_bytes="4M" stability_delay="1000"/>
      <pbcast.GMS join_timeout="5000" print_local_addr="true" view_bundling="true"/>
      <UFC max_credits="2M" min_threshold="0.4"/>
      <MFC max_credits="2M" min_threshold="0.4"/>
      <FRAG2 frag_size="60K"/>
      <pbcast.STREAMING_STATE_TRANSFER bind_addr="10.120.180.64" bind_interface="eth10" bind_port="7810" socket_buffer_size="16384" use_default_transport="false"/>
      </config>

      Have you ever heard about NIC teaming problems ?

      Thanks.
      Pascal BROUWET

            rhn-engineering-bban Bela Ban
            pbrouwet PASCAL BROUWET (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: