MaxFailures configuration in load balancing target not working as expected

I have 2 targets-Target-A and Target-B , configured based on Weighted algorithm . I made target-b down . As per my configuration , am expecting first 5 calls to Target-B should fail, from 6th call it has to go to Target-A.But As soon as Target-B is down . the immediate request and following all requests are going only to Target-A. Is this as expected ? Code snippet is as follows

 	<LoadBalancer>
            <Algorithm>Weighted</Algorithm>
            <Server name="Target-A">
                <Weight>1</Weight>
            </Server>
            <Server name="Target-B">
                <Weight>2</Weight>
            </Server>
             <MaxFailures>5</MaxFailures>
        </LoadBalancer>
0 15 586
15 REPLIES 15

@Dino Any insights please?

Checking...

That is indeed strange behavior. Do you have a HealthMonitor configured in the TargetEndpoint?

Yes It is present in target and I have also configured the health monitor in HTTPTargetConnection. Still the behavior is same.

Could you post the HealthMonitor configuration as well? I'm wondering if the health checks could be bringing Target-B down before the next call. What is the interval between health checks?

I would also love an answer on this. I'm seeing exact same behavior in one of our environments as well. We've got multiple clients actively using the Apigee LB feature so we're not completely unfamiliar with how it should behave. With maxfailure set > 0, and NO HEALTH MONITOR, downed target should present (# of MP's) failures, THEN drop that node and send traffic to avail target. I'm also opening a case for this same issue

Interesting that it happens even without a health monitor. Let me just clarify the behavior. The target goes down, and then there are zero requests (not even the first one) that attempt to go to the downed target? With retries enabled, even if it tries to hit the downed target the returned response from the proxy should still be successes because it will retry to the other healthy targets.

ZERO requests we can see to the downed target. We don't have retries enabled or health monitor. What's even MORE perplexing, is that once Target_1 is brought back up, it's immediately thrown back into the rotation. This (as per my understanding) should only be possible when Health Monitor is enabled. With no health monitor, redeployment should be required to add node back into rotation.

Yeah this is very strange. There really isn't anything else that would be monitoring the health of the targets. wesley.scott5 would you mind posting your LoadBalancer config?

Also it will be good to know if you are using a self-managed Apigee installation (and if so what version), or Apigee cloud.

Meena Gupta-Iwasaki - Config pasted below.

Dino - This is happening on SaaS. I created a case with details and in-depth explanation of the scenarios - 1473072

<HTTPTargetConnection> <LoadBalancer> <Algorithm>RoundRobin</Algorithm> <Server name="Njs-mock-target1"/> <Server name="Njs-mock-target2"/> <MaxFailures>1</MaxFailures> <ServerUnhealthyResponse>400</ServerUnhealthyResponse> </LoadBalancer> <!-- <HealthMonitor> <IsEnabled>true</IsEnabled> <IntervalInSec>5</IntervalInSec> <HTTPMonitor> <Request> <ConnectTimeoutInSec>60</ConnectTimeoutInSec> <SocketReadTimeoutInSec>60</SocketReadTimeoutInSec> <Verb>GET</Verb> <Payload contentType="text/xml"> <soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:sch="http://www.tibco.com/schemas/WebservicesGateway/SharedResources/Schema/Canonical/Schema.xsd"> <soapenv:Header/> <soapenv:Body> <sch:healthCheck>TEST</sch:healthCheck> </soapenv:Body> </soapenv:Envelope> </Payload> <Path>/Implementation/WebServiceGateway/WebServiceGatewayEndpoint</Path> <Header name="SOAPAction">/Implementation/WholesaleServicesV2Service.serviceagent/HealthCheck</Header> <Header name="contentType">text/xml</Header </Request> <SuccessResponse> <ResponseCode>200</ResponseCode> </SuccessResponse> </HTTPMonitor> </HealthMonitor>--> <Path>/v1/mock-target</Path> </HTTPTargetConnection> </TargetEndpoint>

A few comments. It seems like retries are actually enabled (they're enabled by default). How are you measuring whether a request hits the downed target server? Is it based off of the returned response?

Also it seems like the docs are incorrect right now. But the correct way to specify ServerUnhealthyResponse is the same way as SuccessResponse as so:

<ServerUnhealthyResponse>

<ResponseCode>400</ResponseCode>

</ServerUnhealthyResponse>

ps: We have filed a doc bug.

How can you tell retries are enabled by default? I don't see that in documentation either. As for the downed target, we are just watching traces and seeing that once brought down, no calls are going to that instance (at least not in the trace). Once the target comes back up, it starts taking traffic again. That's weird right?

@Dino-at-Google - Hey Dino, not EXACTLY the subject of this thread, but related to LB behavior and documentation. I filed another case (1473715) for <IsFallback>true</IsFallback>. Basically traffic is sent to all 3 targets even though target_3 is defined as fallback. fyi.

Is it possible that some of this functionality has changed or is no longer supported? We have users applying this in production and lately it seems like I'm fielding many incinsistent behavior questions.