Fault rule conditional policies failing to execute

Not applicable

Our proxy fault rules (simplified) look like this:

    <FaultRules>
        <FaultRule name="api_key_faults">
            <Step>
                <Name>fault_invalid_key</Name>
                <Condition>(fault.name = "InvalidApiKey")</Condition>
            </Step>
            <Step>
                <Name>fault_missing_key</Name>
                <Condition>(fault.name = "FailedToResolveAPIKey")</Condition>
            </Step>
        </FaultRule>
        <FaultRule name="no_routes_matched_rule">
            <Step>
                <Name>fault_no_route</Name>
            </Step>
            <Condition>(fault.name = "NoRoutesMatched" || fault.name = "RouteFailed" || fault.name = "NotFound")</Condition>
        </FaultRule>
        <FaultRule name="quota_rule">
            <Step>
                <Name>fault_developer_quota_exceeded</Name>
                <Condition>(ratelimit.check_developer_quota.exceed.count > 0)</Condition>
            </Step>
            <Step>
                <Name>fault_global_quota_exceeded</Name>
                <Condition>(ratelimit.check_global_quota.exceed.count > 0)</Condition>
            </Step>
            <Condition>(fault.name = "QuotaViolation")</Condition>
        </FaultRule>
        <FaultRule name="spike_arrest">
            <Step>
                <Name>fault_spike_arrest</Name>
                <Condition>spikearrest.spike_protection.failed = "true"</Condition>
            </Step>
            <Step>
                <Name>fault_revalidate_spike_arrest</Name>
                <Condition>(spikearrest.revalidate_spike_protection.failed = "true")</Condition>
            </Step>
            <Condition>(fault.name = "SpikeArrestViolation")</Condition>
        </FaultRule>
    </FaultRules>
    <DefaultFaultRule name="default_fault_rule">
        <Step>
            <Name>fault_generic_error</Name>
            <Condition>(wf.fault_assigned != "true")</Condition>
        </Step>
        <Step>
            <Name>calculate_response_headers</Name>
        </Step>
        <AlwaysEnforce>true</AlwaysEnforce>
    </DefaultFaultRule>

Note that every fault_xxxx policy sets (wf.fault_assigned = "true") so if any error has been assigned the fault_generic_error policy will not execute. This works.

As described here, conditions can go inside or outside the STEP tag.. I'm seeing a couple of issues:

If I move the CONDITION tag from outside to inside the STEP tag on "no_routes_matched_rule", then *all* error fault conditions fail and the fault_generic_error executes. Since there is only one step, the execution behaviour would seem that it should not change, not to mention fail.

The same failure happens if I remove from "spike_arrest" the outer condition (fault.name = "SpikeArrestViolation") that is actually not needed.

Additionally, in the existing form, "spike_arrest" never executes at least "fault_revalidate_spike_arrest", even though the condition appears to be correct.

Note that the "quota_rule" above seems to execute as desired, so its basic structure of nested conditionals seems to work.

I suspect that this is a bug, otherwise, what is wrong here?

Thanks, George

Solved Solved
2 28 3,619
2 ACCEPTED SOLUTIONS

@Floyd Jones- I think I recently observed the Fault Rules are evaluated Bottom to Top in the ProxyEndpoint, but Top to Bottom in the TargetEndpoint. Might be worth double checking!

View solution in original post

@Floyd Jones, @Sean Davis

I have confirmed that this is weirdly true.

I added the following to both proxy and target endpoint FaultRules and observed the test flow in the UI and in trace downloads.

<FaultRule name="raise-fault-1">
    <Condition>(fault.name = "RaiseFault")</Condition>
    <Step>
        <FaultRules/>
        <Name>Assign-Message-Raise-Fault-1</Name>
    </Step>
</FaultRule>
<FaultRule name="raise-fault-2">
    <Condition>(fault.name = "RaiseFault") and ("fault.name" = "RaiseFault")</Condition>
    <Step>
        <FaultRules/>
        <Name>Assign-Message-Raise-Fault-2</Name>
    </Step>
</FaultRule>
<FaultRule name="raise-fault-3">
    <Condition>(fault.name = "RaiseFault") and ("fault.name" = "RaiseFault") and ("fault.name" = "RaiseFault")</Condition>
    <Step>
        <FaultRules/>
        <Name>Assign-Message-Raise-Fault-3</Name>
    </Step>
</FaultRule>

In the proxy trace download I see:

    <Point id="Condition">
        <DebugInfo>
            <Timestamp>02-08-16 14:11:24:763</Timestamp>
            <Properties>
                <Property name="ExpressionResult">false</Property>
                <Property name="Expression">((fault.name equals "RaiseFault") and (("fault.name" equals "RaiseFault") and ("fault.name" equals "RaiseFault")))</Property>
                <Property name="Tree">FAULT_raise-fault-3</Property>
            </Properties>
        </DebugInfo>
        <VariableAccess>
            <Get value="InvalidApiKey" name="fault.name"/>
        </VariableAccess>
    </Point>
    <Point id="Condition">
        <DebugInfo>
            <Timestamp>02-08-16 14:11:24:763</Timestamp>
            <Properties>
                <Property name="ExpressionResult">false</Property>
                <Property name="Expression">((fault.name equals "RaiseFault") and ("fault.name" equals "RaiseFault"))</Property>
                <Property name="Tree">FAULT_raise-fault-2</Property>
            </Properties>
        </DebugInfo>
        <VariableAccess>
            <Get value="InvalidApiKey" name="fault.name"/>
        </VariableAccess>
    </Point>
    <Point id="Condition">
        <DebugInfo>
            <Timestamp>02-08-16 14:11:24:763</Timestamp>
            <Properties>
                <Property name="ExpressionResult">false</Property>
                <Property name="Expression">(fault.name equals "RaiseFault")</Property>
                <Property name="Tree">FAULT_raise-fault-1</Property>
            </Properties>
        </DebugInfo>
        <VariableAccess>
            <Get value="InvalidApiKey" name="fault.name"/>
        </VariableAccess>
    </Point>

In the target trace download I see:

<Point id="Condition">
        <DebugInfo>
            <Timestamp>02-08-16 14:11:41:101</Timestamp>
            <Properties>
                <Property name="ExpressionResult">false</Property>
                <Property name="Expression">(fault.name equals "RaiseFault")</Property>
                <Property name="Tree">FAULT_raise-fault-1</Property>
            </Properties>
        </DebugInfo>
        <VariableAccess>
            <Get value="ErrorResponseCode" name="fault.name"/>
        </VariableAccess>
    </Point>
    <Point id="Condition">
        <DebugInfo>
            <Timestamp>02-08-16 14:11:41:101</Timestamp>
            <Properties>
                <Property name="ExpressionResult">false</Property>
                <Property name="Expression">((fault.name equals "RaiseFault") and ("fault.name" equals "RaiseFault"))</Property>
                <Property name="Tree">FAULT_raise-fault-2</Property>
            </Properties>
        </DebugInfo>
        <VariableAccess>
            <Get value="ErrorResponseCode" name="fault.name"/>
        </VariableAccess>
    </Point>
    <Point id="Condition">
        <DebugInfo>
            <Timestamp>02-08-16 14:11:41:101</Timestamp>
            <Properties>
                <Property name="ExpressionResult">false</Property>
                <Property name="Expression">((fault.name equals "RaiseFault") and (("fault.name" equals "RaiseFault") and ("fault.name" equals "RaiseFault")))</Property>
                <Property name="Tree">FAULT_raise-fault-3</Property>
            </Properties>
        </DebugInfo>

View solution in original post

28 REPLIES 28

If I move the CONDITION tag from outside to inside the STEP tag on "no_routes_matched_rule", then *all* error fault conditions fail and the fault_generic_error executes. Since there is only one step, the execution behaviour would seem that it should not change, not to mention fail.

No, I think that's not quite true. The documentation describes how FaultRules work: the LAST FaultRule for which the outer Condition statement evaluates to true, is executed.

It isn't quite clear from that statement , but... when you omit the outer Condition element, then the evaluation result is implicitly true. In other words, moving the condition from outside, to inside the Step element, changes the semantics of the FaultRule: it means the FaultRule always applies, and no further (prior) FaultRules will be evaluated.

Let's take a step back: what are you really trying to do? I think what you want is ONE FaultRule to apply. But consider that the FaultRules are not like a chain of "if" statements. the Conditions are evaluated in reverse order, and for the first outer Condition that is True or omitted - that FaultRule executes. At that point, evaluation of Conditions ceases. It's similar to the rules regarding Flows - except the order is reversed.

Helpful?

Thank you. It took a bit of poking alongside your comments to see what was happening. Happenstance of the code layout, the first fault rule "api_key_faults" works because the later rules all happen to need outer conditionals, making the "api_key_faults" the last fault rule executed and implicitly true, but not harmful because there are no earlier rules that it would have blocked.

Oh man, thanks for this.

I was racking my brain wondering what was going on. I inadvertently had a mix of inner and outer conditions. The FaultRules with outer conditions didn't match as expected, but when the next FaultRule with an inner condition (that also didn't match) was evaluated, trace showed the inner step was skipped and the next FaultRule wasn't even evaluated.

I'm keeping it simple, all my FaultRules have outer conditions.

Thanks!

Good, because they MUST, or the rule is implicitly true!

Wow, this is super confusing. Sorry for the hassle, @George Shaw. I'll update the docs (saw the note you left). Just want to double-check what you found to make sure I'm on the same page.

The outer conditions determine which fault rule gets executed. In your code block above:

  1. spike_arrest outer condition is evaluated first (bottom to top)
  2. quota_rule outer condition is next
  3. no_routes_matched_rule is next
  4. api_key_faults has no outer condition, so it's implicitly true, and it gets executed. Each *step* condition is now evaluated. Each one that's true gets executed (including any that don't have conditions).

So for fault rule design best practices, each should have an outer condition to make sure each gets evaluated (bottom to top). Those outer conditions should be more general, like, "I found a QuotaViolation, so I execute (if I'm the last one that's true). Now we'll check the step conditions to see what kind of specific quota violation it is. Execute all that apply." If nothing matches anywhere, DefaultFaultRule.

Sound about right?

"The outer conditions determine which fault rule gets executed." is not quite complete. Maybe "The outer conditions determine which is the one fault rule that gets executed when evaluated in reverse order." But discussing reverse order conflicts with the execution model in the docs at present.

The docs currently describe the LAST fault rule to be evaluated true to be executed (implying top to bottom evaluation). Unclear in that statement is that it is the LAST executed where the outer Conditional evaluates to true (not the conditional inside STEP). Trace shows, however, that the rules are actually evaluated in reverse order, from bottom to top. I'm not sure which conceptual model is better.

As to best practices, that is hard to say. Using the nested conditionals allowed me to reduce some of the XML structure noise by cutting out many FaultRule tags and also associating faults into named blocks with niecely named fault.name tests. But, it also hid a poorly understood (by me) execution issue. I think just using the inner conditionals as outer conditionals in individual fault rules would be sufficient, but I seem to recall some issue in originally getting the quota rule to work, so I need to retest with my new understanding.

Also, the "spike_arrest" FaultRule does not seem to execute correctly as the revalidate_spike_arrest occurs several times an hour but is processed by fault_generic_error. So, it is not clear at present if the nested conditionals work at all here. I'll be working with it tomorrow. The quota_rule has worked, but it is possible it was in an earlier form, so I need to recheck that. I'll let you know what I find.

Thanks, George. Then that does line up with my updated understanding. I'll do my own testing as well. WRT revalidate_spike_arrest not firing, I'm not sure, but maybe try putting parens around the spikearrest.spike_protection.failed condition in the step above if they're not in your live code. Maybe conditions are invalid without parens and causing an execution abort before revalidate_spike_arrest can fire? May also be issues with steps not firing in order (https://community.apigee.com/questions/5600/policies-in-faultrules-are-not-executing-in-sequen.html).

Also, shouldn't the spikearrest.*.failed be ratelimit.*.failed? My trace on spike arrest calls only has ratelimit.* variables. But oddly, when I exceed my limit and get a 500, the .failed variable is still false. I'll need to check with engineering on what's expected there.

Thanks. I just discovered the same naming. The few usages I've seen imply:

policy-type.policy-name.failed

However, if appears to be

policy-variable-namespace.policy-name.failed

I suggest adding a line to each policy that can fail with the correct spelling for the above variable.

Changing to ratelimit does work in my case.

I suspect I has the same issue with Quota...it is in the ratelimit namespace, hence I'm testing the exceed counts rather than the failed values.

@George Shaw - Some updated findings based on my testing. It looks like the evaluation of FaultRules moves *downward*. For example, I have 2 fault rules that evaluate to true. The second one, *below* the first one, gets executed. Here's mine:

3171-faultruleexecution.png

If I reverse the order in the XML config, invalid_key_rule fires.

@Floyd Jones Your graphic did not appear. The ordering I see, in reference to my listing above, is:

spike_arrests

quota_rule

no_routes_matched_rule

api_key_faults

which is the reverse of the physical order. Within each fault-rule the steps are executed in physical order (from top to bottom).

@George Shaw

- Below are my 2 fault rules, both of which are true. In the following, "random" executes. When I reverse the order (with random first physically), "invalid_key_rule" executes.

<FaultRules>
  <FaultRule name="invalid_key_rule">
    <Step>
      <Name>Invalid-key-raise-fault</Name>
      <Condition>(oauthV2.Verify-API-Key-1.failed = true)</Condition>
      </Step>
      <Condition>(fault.name = "FailedToResolveAPIKey")</Condition>
  </FaultRule>
<!-- If both are true, the following executes for me -->
  <FaultRule name="random">
    <Step>
      <Name>Random-message</Name>
    </Step>
  </FaultRule>

Another thing I found is that I'm not allowed to use operator symbols (such as < and >) in my conditions, because it causes invalid XML. Instead, I need to use the word versions of operators (shown here http://docs.apigee.com/api-services/reference/conditions-reference#operators).

For example:

(response.status.code GreaterThan 299)

instead of

(response.status.code > 299)

Maybe try that to see if your conditions are triggered differently. (The equals symbol '=' works, though. Probably can't use characters that conflict with reserved XML characters.)

@Floyd Jones Interesting about the operator symbols. In the quota_rule code above, the > in the > 0 comparison is highlighted in my editor, but the policy deploys and seemed to execute correctly. I don't use the code at present now that I know that the namespace is 'ratelimit'...I just check got 'failed' method on the policy.

> It looks like the evaluation of FaultRules moves *downward*. For example, I have 2 fault rules that evaluate to true. The second one, *below* the first one, gets executed.

Yes, that is as documented.

Hi @George Shaw / @Dino - FYI, I've done an overhaul of the Fault Handling topic. Hoping it's clearer than it was.

Another interesting thing I found was that request.* and response.* variables were available, but only those that were present when the error occurred.

Please holler back at me with any questions about or issues with the updated content.

Thanks!

@Floyd Jones Just skimmed through it, very comprehensive.

It may be worth noting how to differentiate among faults for the same fault.name

<Condition>(oauthV2.{policy-name}.failed = true)</Condition>
<Condition>(servicecallout.{policy-name}.failed = true)</Condition>
<Condition>(javascript.{policy-name}.failed = true)</Condition>

Is that pattern of using the condition: ({policy-type}.{policy-name}.failed = true) consistent?

See my updated discussion here: https://community.apigee.com/questions/26917/policy-fault-handling.html#answer-29240

Great call, @Kurt Kanaskie. I've added a brief section for that here. Next step is to update the policy variables reference with more info about that.

Also, in the best practices section, I've documented a couple of patterns: one for handling multiple policies of the same type and one for handling one policy of one type. (I just added a link to the community thread you mention!) Thanks again.

@jonesfloyd Your links go to 404 errors...

Thanks! Appreciated.

@Floyd Jones- I think I recently observed the Fault Rules are evaluated Bottom to Top in the ProxyEndpoint, but Top to Bottom in the TargetEndpoint. Might be worth double checking!

Thanks for the heads up about that, @Sean Davis. I'll check it out.

@Floyd Jones, @Sean Davis

I have confirmed that this is weirdly true.

I added the following to both proxy and target endpoint FaultRules and observed the test flow in the UI and in trace downloads.

<FaultRule name="raise-fault-1">
    <Condition>(fault.name = "RaiseFault")</Condition>
    <Step>
        <FaultRules/>
        <Name>Assign-Message-Raise-Fault-1</Name>
    </Step>
</FaultRule>
<FaultRule name="raise-fault-2">
    <Condition>(fault.name = "RaiseFault") and ("fault.name" = "RaiseFault")</Condition>
    <Step>
        <FaultRules/>
        <Name>Assign-Message-Raise-Fault-2</Name>
    </Step>
</FaultRule>
<FaultRule name="raise-fault-3">
    <Condition>(fault.name = "RaiseFault") and ("fault.name" = "RaiseFault") and ("fault.name" = "RaiseFault")</Condition>
    <Step>
        <FaultRules/>
        <Name>Assign-Message-Raise-Fault-3</Name>
    </Step>
</FaultRule>

In the proxy trace download I see:

    <Point id="Condition">
        <DebugInfo>
            <Timestamp>02-08-16 14:11:24:763</Timestamp>
            <Properties>
                <Property name="ExpressionResult">false</Property>
                <Property name="Expression">((fault.name equals "RaiseFault") and (("fault.name" equals "RaiseFault") and ("fault.name" equals "RaiseFault")))</Property>
                <Property name="Tree">FAULT_raise-fault-3</Property>
            </Properties>
        </DebugInfo>
        <VariableAccess>
            <Get value="InvalidApiKey" name="fault.name"/>
        </VariableAccess>
    </Point>
    <Point id="Condition">
        <DebugInfo>
            <Timestamp>02-08-16 14:11:24:763</Timestamp>
            <Properties>
                <Property name="ExpressionResult">false</Property>
                <Property name="Expression">((fault.name equals "RaiseFault") and ("fault.name" equals "RaiseFault"))</Property>
                <Property name="Tree">FAULT_raise-fault-2</Property>
            </Properties>
        </DebugInfo>
        <VariableAccess>
            <Get value="InvalidApiKey" name="fault.name"/>
        </VariableAccess>
    </Point>
    <Point id="Condition">
        <DebugInfo>
            <Timestamp>02-08-16 14:11:24:763</Timestamp>
            <Properties>
                <Property name="ExpressionResult">false</Property>
                <Property name="Expression">(fault.name equals "RaiseFault")</Property>
                <Property name="Tree">FAULT_raise-fault-1</Property>
            </Properties>
        </DebugInfo>
        <VariableAccess>
            <Get value="InvalidApiKey" name="fault.name"/>
        </VariableAccess>
    </Point>

In the target trace download I see:

<Point id="Condition">
        <DebugInfo>
            <Timestamp>02-08-16 14:11:41:101</Timestamp>
            <Properties>
                <Property name="ExpressionResult">false</Property>
                <Property name="Expression">(fault.name equals "RaiseFault")</Property>
                <Property name="Tree">FAULT_raise-fault-1</Property>
            </Properties>
        </DebugInfo>
        <VariableAccess>
            <Get value="ErrorResponseCode" name="fault.name"/>
        </VariableAccess>
    </Point>
    <Point id="Condition">
        <DebugInfo>
            <Timestamp>02-08-16 14:11:41:101</Timestamp>
            <Properties>
                <Property name="ExpressionResult">false</Property>
                <Property name="Expression">((fault.name equals "RaiseFault") and ("fault.name" equals "RaiseFault"))</Property>
                <Property name="Tree">FAULT_raise-fault-2</Property>
            </Properties>
        </DebugInfo>
        <VariableAccess>
            <Get value="ErrorResponseCode" name="fault.name"/>
        </VariableAccess>
    </Point>
    <Point id="Condition">
        <DebugInfo>
            <Timestamp>02-08-16 14:11:41:101</Timestamp>
            <Properties>
                <Property name="ExpressionResult">false</Property>
                <Property name="Expression">((fault.name equals "RaiseFault") and (("fault.name" equals "RaiseFault") and ("fault.name" equals "RaiseFault")))</Property>
                <Property name="Tree">FAULT_raise-fault-3</Property>
            </Properties>
        </DebugInfo>

@Kurt Kanaskie and @Sean Davis, thank you so much for pointing this out! My testing confirms this as well, and I agree it's weird. But I've updated the docs accordingly. And the *first* FaultRule that is true gets executed in both cases.

I'm going to file a product bug. IMO, ProxyEndpoint should be top-to-bottom as well, unless there's some logical reason for this that I'm missing.

Thanks again!

My only concern is backwards compatibility now. There are lots of projects that will break if this behaviour is changed. Definitely a bit frustrating!

No problem, add an attribute to <FaultRules> element in each proxy and default it to the current direction. LOL

So when a new proxy is generated developer will see the following and be confused, but clear.

<ProxyEndpoint name="default">
    <FaultRules executionDirection="up">
    ...
<TargetEndpoint name="default">
    <FaultRules executionDirection="down">
    ...

Great point about the backwards compatibility on existing implementations. So how's about I don't file a bug at this point. Now that we've got the behavior explicitly described, let's see if it causes any hue and cry. Thanks again, fellas. It takes a village!

@Kurt Kanaskie

Hi Kurt, does attribute executionDirection exist for FaultRule element? I tried searching documentation for this and couldn't find it.