Uploaded image for project: 'Maistra'
  1. Maistra
  2. MAISTRA-862

Galley can drop watches on Istio CRs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • maistra-1.0.2
    • maistra-1.0.0
    • None
    • None
    • MAISTRA 1.0.2

    Description

      Originally reported by maschmid@redhat.com as a comment on MAISTRA-833:

      On my cluster I am reproducing a similar problem,

      I also get 503 from ingressgatway, ,in my case the ingressgateway logs contains "unknown cluster outbound|80||autoscale-go-ljzb2.myproject.svc.cluster.local"

      [2019-08-28 11:53:39.001][27][debug][router] [external/envoy/source/common/router/router.cc:308] [C9047][S6470669494928115903] unknown cluster 'outbound|80||autoscale-go-ljzb2.myproject.svc.cluster.local'
      [2019-08-28 11:53:39.001][27][debug][filter] [src/envoy/http/mixer/filter.cc:133] Called Mixer::Filter : encodeHeaders 2
      [2019-08-28 11:53:39.001][27][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1322] [C9047][S6470669494928115903] encoding headers via codec (end_stream=true):
      ':status', '503'
      'date', 'Wed, 28 Aug 2019 11:53:38 GMT'
      'server', 'istio-envoy'
      

      Note that autoscale-go-ljzb2 is indeed something that doesn't exist anymore. It is currently named autoscale-go-2zjgp

       oc get services -n myproject
      NAME                       TYPE           CLUSTER-IP       EXTERNAL-IP                                           PORT(S)             AGE
      autoscale-go               ExternalName   <none>           istio-ingressgateway.istio-system.svc.cluster.local   <none>              21m
      autoscale-go-2zjgp         ClusterIP      172.30.252.247   <none>                                                80/TCP              22m
      autoscale-go-2zjgp-dpghl   ClusterIP      172.30.110.171   <none>                                                9090/TCP,9091/TCP   22m
      autoscale-go-2zjgp-hv68c   ClusterIP      172.30.23.93     <none>    
      

      Grepping the ingressgateway config dump shows that the config contains both the new and some old version [...]

      Marek's logs are attached to this issue. I checked the operator logs on the cluster itself, there was nothing out of the ordinary.

      I investigated a bit and found the following additional information:

      • Istio configuration was accurate
      • the route information in the envoy config was outdated and still pointing to the old, non-existant service
      • cluster information was up-to-date
      • pilot was still pushing config updates
      • Galley reported much fewer "Underlying Result Channel closed" messages (one every few seconds before incident, one every few minutes afterwards)
      • Just prior to the error, two namespaces were deleted in quick succession
      • a test VirtualService I created never ended up in the envoy configuration
      • restarting Galley fixed the issue

      Current theory:

      • Just prior to the error, two namespaces were deleted in quick succession
      • Galley started re-creating watches after the first deletion had propagated
      • As the second namespace was already being terminated, it failed to create a bunch of watches
      • Apparently, it was never able to recover/ recreate the missing watches
      • As a consequence, Pilot would never receive resource updates for Istio objects
      • The ingress-gateway would never receive proper Route information. The clusters were created, though, as they are built from pilot's own kubernetes watches

      I'm currently trying to reproduce this to find out if this can be traced to e.g. the galley/pkg/source/kube or MultiNamespaceListerWatcher code

      Attachments

        Issue Links

          Activity

            People

              dgrimm@redhat.com Daniel Grimm
              dgrimm@redhat.com Daniel Grimm
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: