Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-28749

pcsd processes are not terminated with SIGTERM

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • rhel-9.5
    • rhel-9.3.0
    • pcs
    • Major
    • sst_high_availability
    • ssg_filesystems_storage_and_HA
    • 13
    • 26
    • 8
    • False
    • Hide

      None

      Show
      None
    • Yes
    • Red Hat Enterprise Linux
    • Bug Fix
    • Hide
      Cause (the user action or circumstances that trigger the bug):
      The default method of process creation (fork) of python multiprocessing library can cause a deadlock during processes termination.

      Consequence (what the user experience is when the bug occurs):
      The pcsd processes are stuck in deadlock and therefore are not terminated correctly. They are killed after systemd timeout which is 90s by default.

      Fix (what has changed to fix the bug; do not include overly technical details):
      Change the method of process creation to 'forkserver' for the python multiprocessing library.

      Result (what happens now that the patch is applied):
      There is no deadlock during the stop of pcsd processes and pcsd is stopped correctly within a short time.
      Show
      Cause (the user action or circumstances that trigger the bug): The default method of process creation (fork) of python multiprocessing library can cause a deadlock during processes termination. Consequence (what the user experience is when the bug occurs): The pcsd processes are stuck in deadlock and therefore are not terminated correctly. They are killed after systemd timeout which is 90s by default. Fix (what has changed to fix the bug; do not include overly technical details): Change the method of process creation to 'forkserver' for the python multiprocessing library. Result (what happens now that the patch is applied): There is no deadlock during the stop of pcsd processes and pcsd is stopped correctly within a short time.
    • Proposed

      What were you trying to do that didn't work?

      systemctl stop pcsd takes 90 seconds, and the following messages are shown:

      Feb 27 11:58:24 node1 systemd[1]: pcsd.service: State 'stop-sigterm' timed out. Killing.
      Feb 27 11:58:24 node1 systemd[1]: pcsd.service: Killing process 4423 (pcsd) with signal SIGKILL.
      Feb 27 11:58:24 node1 systemd[1]: pcsd.service: Killing process 4426 (pcsd) with signal SIGKILL.
      Feb 27 11:58:24 node1 systemd[1]: pcsd.service: Killing process 4440 (pcsd) with signal SIGKILL.
      Feb 27 11:58:24 node1 systemd[1]: pcsd.service: Killing process 4427 (pcsd) with signal SIGKILL.
      Feb 27 11:58:24 node1 systemd[1]: pcsd.service: Killing process 4435 (pcsd) with signal SIGKILL.
      Feb 27 11:58:24 node1 systemd[1]: pcsd.service: Killing process 4468 (pcsd) with signal SIGKILL.
      Feb 27 11:58:24 node1 systemd[1]: pcsd.service: Killing process 4484 (pcsd) with signal SIGKILL.
      Feb 27 11:58:24 node1 systemd[1]: pcsd.service: Killing process 4486 (pcsd) with signal SIGKILL.
      Feb 27 11:58:24 node1 systemd[1]: pcsd.service: Main process exited, code=killed, status=9/KILL
      Feb 27 11:58:24 node1 systemd[1]: pcsd.service: Failed with result 'timeout'.

      When stopping pcsd.service, some pcsd processes are not terminated as follows.

      1. ps -ef
        UID PID PPID C STIME TTY TIME CMD
        [...]
        root 1848 1 0 14:43 ? 00:00:50 /usr/bin/python3 -Es /usr/sbin/pcsd
        root 2565 1848 0 14:43 ? 00:00:15 /usr/bin/python3 -Es /usr/sbin/pcsd
        root 2578 1848 0 14:43 ? 00:00:00 /usr/bin/python3 -Es /usr/sbin/pcsd
        root 2580 1848 0 14:43 ? 00:00:00 [pcsd] <defunct>
        root 2581 1848 0 14:43 ? 00:00:00 [pcsd] <defunct>
        root 2583 1848 0 14:43 ? 00:00:00 [pcsd] <defunct>
        root 2584 1848 0 14:43 ? 00:00:00 [pcsd] <defunct>
        root 2585 1848 0 14:43 ? 00:00:00 [pcsd] <defunct>
        root 2588 1848 0 14:43 ? 00:00:00 [pcsd] <defunct>
        root 2590 1848 0 14:43 ? 00:00:00 [pcsd] <defunct>
        [...]
        root 9652 4487 0 17:24 pts/0 00:00:00 systemctl stop pcsd

      Please provide the package NVR for which bug is seen:

      pcs-0.11.4-6.el9 or later

      How reproducible:

      randomly - the phenomenon doesn't emerge each time when stopping pcsd

      Steps to reproduce

      According to user the issue can be reproducible with the following command (however RH support team didn't manage to reproduce this way with same pcs version):

      1. while true; do date; time systemctl stop pcsd; systemctl start pcsd; echo; sleep 10; done

      Expected results

      pcsd stops successfully including it internal processes

      Actual results

            mpospisi@redhat.com Michal Pospisil
            rhn-support-pzimek1 Pepa Zimek
            Pepa Zimek
            Miroslav Lisik Miroslav Lisik
            Cluster QE Cluster QE
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: