Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-6103

systemd deadlocks waiting for its child forever when receiving SIGQUIT signal

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • rhel-8.8.0
    • systemd
    • Major
    • sst_cs_plumbers
    • ssg_core_services
    • False
    • Hide

      None

      Show
      None
    • If docs needed, set a value

    Description

      Description of problem:

      We have a customer that could observe systemd not reaping any children, hence zombies accumulating.
      The vmcore showed that PID 1 was waiting forever for a child it spawned, still named "systemd", the child being itself hanging in writing to a pipe.

      We believe some SIGQUIT signal was sent to PID 1, which leads to exact same observation:
      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

      1. kill -QUIT 1
      1. cat /proc/1/stack
        [<0>] do_wait+0x165/0x2f0
        [<0>] kernel_waitid+0x118/0x180
        [<0>] __do_sys_waitid+0x120/0x130
        [<0>] do_syscall_64+0x5b/0x1b0
        [<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
      1. cat /proc/2037/stack
        [<0>] pipe_wait+0x6c/0xc0
        [<0>] pipe_write+0x16c/0x3f0
        [<0>] new_sync_write+0x112/0x160
        [<0>] __kernel_write+0x4f/0x100
        [<0>] dump_emit+0x91/0xd0
        [<0>] elf_core_dump+0x890/0xa50
        [<0>] do_coredump+0x73f/0xf52
        [<0>] get_signal+0x14f/0x870
        [<0>] do_signal+0x36/0x690
        [<0>] exit_to_usermode_loop+0x89/0x100
        [<0>] do_syscall_64+0x19c/0x1b0
        [<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
                    • 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

      The root cause for this is systemd, upon receiving SIGQUIT, will dump core.
      To do that it spawns a child, then waits for it to terminate.
      The child sends itself the SIGQUIT signal.

      This triggers the core_pattern handler to execute and kernel to create a pipe for coredumping and send data to the pipe immediately.
      The core_pattern handler executes, sends to /run/systemd/coredump Unix socket where the coredump service will read the coredump from (the pipe create by the kernel).

      Normally systemd should spawns systemd-coredump instantiable service, which in turn reads from the pipe, but it cannot since it's waiting for its child to terminate.

      --> Deadlock

      Version-Release number of selected component (if applicable):

      systemd-239

      How reproducible:

      Always

      Steps to Reproduce:
      1. Send SIGQUIT to PID 1

      Actual results:

      Deadlock of PID 1 and its child

      Expected results:

      PID 1 functional

      Additional infos:

      Because PID 1 is critical for RHEL systems, it has to be super robust.

      Attachments

        Activity

          People

            msekleta@redhat.com Michal Sekletar
            rhn-support-rmetrich Renaud Metrich
            Michal Sekletar Michal Sekletar
            Frantisek Sumsal Frantisek Sumsal
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: