We have identified an issue with the Load balancer and Health Monitor Configuration on apigee proxies on how the MaxFailure is being interpr

Not applicable

We have identified an issue with the Load balancer and Health Monitor Configuration on apigee proxies on how the MaxFailure is being interpreted and believe that this is an issue. We have a hypothesis and would appreciate if some one can validate our understanding

Our Code Snippet

<HTTPTargetConnection> <LoadBalancer> <Server name="lisa_server"/><MaxFailures>5</MaxFailures> </LoadBalancer> <Path>/</Path> <HealthMonitor><IsEnabled>true</IsEnabled> <IntervalInSec>20</IntervalInSec> <HTTPMonitor> <Request><ConnectTimeoutInSec>3</ConnectTimeoutInSec><SocketReadTimeoutInSec>30</SocketReadTimeoutInSec> <Verb>GET</Verb><Path>/health_check/</Path> </Request> <SuccessResponse><ResponseCode>200</ResponseCode> </SuccessResponse> </HTTPMonitor></HealthMonitor> </HTTPTargetConnection>


Here is our hypothesis and need your help proving this out:

The ‘MaxFailure’ field is being managed by both the Load balancer/Target Server and the HealthMonitor attached to the target server;

Load balancer/Target Server considers only an i/o error (5XX Series) as a failure. So every time a target server returns a HTTP code other than 5XX it believes that the target server is working correctly.

It is further resetting the number of consecutive failures to 0 every time it sends out this response. The Health Monitor attached to the Target server only recognizes the Target servers as healthy when it receives a 200 response. Anything else is considered otherwise

We believe that when a target server is actively being pinged and it is returning a response other than 5XX , it is not allowing the health monitor to take the target server out of rotation despite the health monitor failing a health check When the target sever is not being actively pinged and the health monitor is maintaining the server, it is successfully able to remove a target server from rotation, when the health check fails Sequence of Events as we see them.

We assume that a variable failureCount denotes the failures for a target server and the MaxFailure allowed is 5

  1. Health Monitor pings target server. Health Check fails ; failureCount++; failureCount is less than MaxFailure . Donot remove target server from rotation
  2. Health Monitor pings target server. Health Check fails ; failureCount++; failureCount is less than MaxFailure . Donot remove target server from rotation
  3. Api call is made from an Api Client ; Response = 400 ; Response is not an IO Error. Set failureCount= 0
  4. Health Monitor pings target server. Health Check fails ; failureCount++; failureCount is less than MaxFailure . Donot remove target server from rotation
  5. Health Monitor pings target server. Health Check fails ; failureCount++; failureCount is less than MaxFailure . Donot remove target server from rotation
  6. Health Monitor pings target server. Health Check fails ; failureCount++; failureCount is less than MaxFailure . Donot remove target server from rotation
  7. Api call is made from an Api Client ; Response = 400 ; Response is not an IO Error. Set failureCount= 0

As you can see from the above, the target server will never be removed from rotation despite health monitoring.

have you seen this issue before ? is yes can you suggest how we can get over this

2 6 911
6 REPLIES 6

Not applicable

@Ajay Naidu Can you confirm if this is on 17.09?

the behavior can also be observed on 17.09

A new feature was provided that directly addresses this situation.
Its the <ServerUnhealthyResponse> element under <LoadBalancer>.

See docs here: https://docs.apigee.com/api-platform/deploy/load-balancing-across-backend-servers#settingloadbalance...

Minor challenge is to use an HTTP response code that is unique to health checks and not part of normal traffic (e.g. 420).

OR maybe 418?

So for the people that were impacted by this; how does it impact the SLA

I don't know the answer to that. Can you contact your sales team to inquire?