ApigeeX cross region failover setting in TargetEndpoints

We are using the system.region.name variable to identify the region in which the GCLB sent the traffic to our MIGs /Apigeex runtime. Using hat, we are handling the traffic in same region till the backen.

Recently when we had an issue with one region backend (POD restart), the ApigeeX runtime was not able to move the traffic to other region. Apigee was throwing 503 service unavailable errors that was passed from backend L4 ILB/ GKE cluster since we have IsFallback option which doesnt care about HTTP response code.

Does below config with HealthMonitor help on this case?

 

<TargetEndpoint name="central">
  <Description/>
  <FaultRules/>
  <PreFlow name="PreFlow">
    <Request/>
    <Response/>
  </PreFlow>
  <PostFlow name="PostFlow">
    <Request/>
    <Response/>
  </PostFlow>
  <Flows/>
  <HTTPTargetConnection>
    <LoadBalancer>
      <Algorithm>RoundRobin</Algorithm>
      <Server name="central"/>
      <Server name="east">
        <IsFallback>true</IsFallback>
      </Server>
    </LoadBalancer>
    <HealthMonitor>
      <IsEnabled>true</IsEnabled>
      <IntervalInSec>5</IntervalInSec>
      <HTTPMonitor>
        <Request>
          <ConnectTimeoutInSec>10</ConnectTimeoutInSec>
          <SocketReadTimeoutInSec>30</SocketReadTimeoutInSec>
          <Port>443</Port>
          <Verb>GET</Verb>
          <Path>/health</Path>
        </Request>
        <SuccessResponse>
          <ResponseCode>200</ResponseCode>
        </SuccessResponse>
      </HTTPMonitor>
    </HealthMonitor>
  </HTTPTargetConnection>
</TargetEndpoint>

 

 

above is for one region & similar config will be available for 2nd region with fallback of primary region.

Also does any log in apigeex show that it did a fallback?

0 7 366
7 REPLIES 7

You asked about HealthMonitor, but I think the relevant portion of the configuration is the LoadBalancer.

    <LoadBalancer>
      <Algorithm>RoundRobin</Algorithm>
      <Server name="central"/>
      <Server name="east">
        <IsFallback>true</IsFallback>
      </Server>
    </LoadBalancer>

This is independent of the HealthMonitor.  This load balancer should do what you want: route to central when it's available.  Route to east when central is not available. 

Have you tried it?  Does it behave the way you're expecting?

we are currently using IsFallback but this option helps only incase of I/o execptions or timeouts from backend which is not the case when the backend POD is in restart state or network issue in one region. its not helping us in our use case currently. so based on documentation, we  were wondering if health monitor would help routing the traffic to other region based on the health check http response code not being 200 when it is down.

Yes I understand.

There's something I forgot to suggest previously.  ServerUnhealthyResponse.  Have you tried this?  It looks like this: 

      <LoadBalancer>
        <Algorithm>RoundRobin</Algorithm>
        <Server name="central" />
        <Server name="east">
          <IsFallback>true</IsFallback>
        </Server>
        <MaxFailures>2</MaxFailures> <!-- as you wish -->
        <ServerUnhealthyResponse>
            <ResponseCode>500</ResponseCode>
            <ResponseCode>502</ResponseCode>
            <ResponseCode>503</ResponseCode>
        </ServerUnhealthyResponse>
      </LoadBalancer>
      ...

In this case, the LoadBalancer takes the server out of rotation as soon as it sees a response code of 500, 502 or 503.  You can add more codes as you like there.  The LoadBalancer will not re-add the unhealthy server automatically!  For that you need a HealthMonitor. 

 

The HealthMonitor will check all servers, with the HTTP request you configure.  This includes the IsFallback server. 

When the HealthMonitor gets a "I am healthy" response from the target, it will re-add it into rotation. In that case, your "RoundRobin" rotation will select the "central" server unless and until it responds with a 5xx code, or until such time as the HealthMonitor gets an unhealthy response in the future.

 

Only problem with above solution is the MaxFailures. What is the time interval in which MaxFailures count resets? because we have regular 500s due to bad data or application exception which is common throughout the day in very small amounts & that should not cause region failover from apigeex end. Also what is the variable that would get the backend http response code in apigeex ? because we are seeing only 500 being logged in our splunk so need to correct it if we are using incorrect variable.

The count resets when the server is placed back into rotation, eg, when the HealthMonitor determines that the server is healthy.

If there is a way for you to distinguish "server is not available" from "transient server error due to application exception" by status code, that would be good. Eg 503 for server unavailable, versus 500 for application exception. In that case just omit 500 from the list you specify in ServerUnhealthyResponse.

The var that gets the status code is response.status.code

Based on our logging we see response.status.code as always 500 for a non success response from backend irrespective of the backend http response but the error.status.code seems to show correctly with 400, 500, 502 & 503 response code in apigeex. Is this a bug in the product as of now?

I think that behavior is expected.  the Apigee runtime will set error.status.code for the thing that gets sent from the target. Apigee then enters "fault status", and invokes your FaultRules. If you have no FaultRules handling this condition, then I believe Apigee may just return a 500 error. 

So if you use FaultRules, you can allow the response.status.code to coincide with the error.status.code, or ... do whatever you like with it.