Re: Regarding SLA of Apigee X

amitkhosla

Hi Experts,

As per https://cloud.google.com/apigee/sla, Availability SLA is 99.9% for enterprise for single instance & multiple instance 99.99. Please correct if that is incorrect.

My question is regarding reasons that I should expect availability lost. Is it because of chance of regional failure? Or do we have any other impacts as well like zonal failure or any other internals of Apigee?

Thanks & Regards

Amit

dchiesa1

@amitkhosla wrote:

My question is regarding reasons that I should expect availability lost. Is it because of chance of regional failure? Or do we have any other impacts as well like zonal failure or any other internals of Apigee?

I don't quite understand your question. I think you are asking Apigee experts to describe the possible causes for the availability of Apigee to be less than 100%. Is that right?

Apigee, like most modern cloud-based services, is a distributed computer system. Distributed systems are complex, and many things at multiple layers are potential sources of disruption. Physical issues, power issues, network issues, software issues... But none of that is spelled out in the Apigee SLA. The SLA is just "service level agreement" - it is just a commitment to provide service at a particular level, and some agreement on what happens if Google does not meet that committed service level. Google does not detail all the possible causes for disruption; there are too many.

It seems to me, as a user of the service, the potential causes of disruption should be completely irrelevant to you. What's pertinent is the SLA.

Maybe you're just curious? I am no expert, but, I can imagine some possible sources of service disruption. If a hurricane strikes and knocks out power to an availability zone in a Google datacenter, at the same time a load spike occurs, then I suppose it is possible that a service disruption in Apigee will occur. If there is a failure in a redundant network component, that can cause disruption. If one of the software libraries Apigee depends on exhibits a novel bug, that can cause a disruption. As with failures in any complex system, usually the causes of disruption are novel and subtle, with a chain of causal events.

There was an incident some years ago in which a team inside a Google datacenter was running a routine test of the backup power generator. This particular time, there was a fuel leak, which led to a fire in the generator room, which led to fire suppression (water). But because the generator was not on the ground floor, the water sprayed in the generator room found its way to computer systems, and THAT led to service disruption in the datacenter.

This kind of causal chain is typical of failures in a complex system. If you're interested, you can read more about failures in complex systems, here.

amitkhosla

Thanks for the reply!!

I agree that all systems fail due to which we take steps to mitigate the risks. I am working on a project where to have least latency I am trying to connect Apigee X instance in same region as the source app. I will though have this app spread across different regions and each such app instance will call relevant Apigee X instance. Apigee X itself also calling different apps in same region only. These instances of App (north or south of Apigee X) are independent of deployments in other regions.

So trying to figure out if there can be any issue where my Apps keeps running but Apigee X crashes. Apigee as per my understanding must be running across multiple zones to help in the zonal failure.