Uploaded image for project: 'Red Hat OpenShift AI Engineering'
  1. Red Hat OpenShift AI Engineering
  2. RHOAIENG-5314

Pipeline Server fails to deploy due to network policies

XMLWordPrintable

    • 1
    • False
    • Hide

      None

      Show
      None
    • False
    • Release Notes
    • No
    • Hide
      Data science pipeline server fails to deploy in fresh cluster due to network policies
      When you create a data science pipeline server on a fresh cluster, the user interface remains in a loading state and the pipeline server does not start. A “Pipeline server failed” error message might be displayed.
      Workaround:
      1. Log in to the OpenShift Container Platform web console as a cluster administrator.
      2. Click Networking > NetworkPolicies.
      3. Click the Project list and select your project.
      4. Click the Create NetworkPolicy button.
      5. For Configure via, select YAML view and define the network policy as shown [see comments].
      6. Click Create.
      Show
      Data science pipeline server fails to deploy in fresh cluster due to network policies When you create a data science pipeline server on a fresh cluster, the user interface remains in a loading state and the pipeline server does not start. A “Pipeline server failed” error message might be displayed. Workaround: 1. Log in to the OpenShift Container Platform web console as a cluster administrator. 2. Click Networking > NetworkPolicies. 3. Click the Project list and select your project. 4. Click the Create NetworkPolicy button. 5. For Configure via, select YAML view and define the network policy as shown [see comments]. 6. Click Create.
    • Known Issue
    • Done
    • No
    • RHOAI DSP Sprint 1, RHOAI DSP Sprint 2
    • Testable

      Today, in working with rhn-gpte-dtorresf , in a customer environment, we ran across the exact same symptoms as what was described in this previous Jira: https://issues.redhat.com/browse/RHOAIENG-2099 

      In a disconnected RHOAI 2.8.0 environment, at a customer, the pipeline server would not get created. 

      We observed that the mariadb pod would get started ok, fairly fast too, but no other pipeline pod would get created. 

      When we looked at the events of the DSPA, we could see a successful Object Storage check, but a failed database check. 

       

      Now, we have seen this happen, intermittently, in the past (https://issues.redhat.com/browse/RHOAIENG-2099). but in those cases, 99% of the time, deleting and re-creating the pipeline server would fix eveything on the second attempt. 

      Not in this scenario. Deleting and re-creating got us the exact same result. 

      The only thing that helped here was to disable the Health Check for the database. 

      (set spec.database.disableHealthCheck = true) 

      Doing so immediately fixed the issue, got the other pods to start and the DS pipeline server to work. 

       

      We tried deleting and re-creating 3 times in a row and the behavior was always the same. 

      The customer will likely soon be opening a case about this, at which time we'll link to this Jira.  

       

      I'd like to understand a bit better the nature of the health check for the database, and what could be the cause of this. 

      So far, my only theory is that the health check is a network connection, initiated from another namespace, and that it can't reach the mariadb service inside the project, potentially because the customer has network policies that are more restrictive than what we usually see.  (just a theory, by all means, poke holes in it). 

      If there is a manual equivalent to the health check, it would worth testing it in the customer's environment. The MariaDB pod was fully running. (although we never checked its logs). 

       

       

            vmudadla@redhat.com Vani Haripriya Mudadla
            egranger@redhat.com Erwan Granger
            Jorge Garcia Oncins Jorge Garcia Oncins
            RHOAI Data Science Pipelines
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: