Apigee Edge-GCP Service Account Key Rotation

tgp
Bronze 1
Bronze 1

Hi Team, 

I am currently working on Apigee Edge + GCP backend.

In a couple of use cases, It is required to configure a GCP Service Account JSON key file in Apigee Edge, in a following places, for example...

1. In a GCP Logging Extension

2.In a Encrypted KVM

As a security perspective, backend team planned to rotate the SA private key in a particular interval.

What are the recommendations, best practices & approaches in Apigee Edge to configure this SA key without downtime (manually or automatically) while they are rotating?

TIA.

 

0 3 442
3 REPLIES 3

It should be no problem rotating keys. Regarding your desire to avoid downtime (or, maybe we can call it "service disruption"), there's nothing special about key rotation in the scenario of Apigee and GCP Service Accounts.  The concerns there are the same concerns you would have in any key rotation scenario. 

The main requirement is that the lifetime of the new key must overlap the lifetime of the old key.  

The process is to load the key into the KVM.  Then within Apigee there is some logic to use the .json key to sign a JWT, and use THAT to  obtain an opaque OAuth token.  When you have a new key, just update the key within the KVM. 

For the purposes of this discussion, let's introduce the idea of a "version" of the key.  Suppose your goal is to rotate the key every month. The 1st day of every month, you want to begin using a new key. In that case your key lifetime should be 31 days, plus 1 day.  (something like that).  We can call this the "lifetime overlap".  It can be 1 hour, 1 day, etc.  It's up to you.  One or more days feels safer to me; it allows for you to roll back if there are any problems. 

The process I am imagining is: 

  1. The Apigee proxy checks the cache for an opaque token that it can use with GCP logging (LookupCache)
  2. If there is no token in cache, then
    1. The proxy retrieves the .JSON key from the KVM (via KeyValueMapOperations GET)
    2. The proxy generates a self-signed JWT (via GenerateJWT)
    3. the proxy transmits that JWT in a request-for-access-token to the oauth2.googleapis.com endpoint (ServiceCallout)
    4. the proxy receives the access_token in response and inserts it into cache, with a lifetime that is just a little less than the token lifetime (via PopulateCache).  Typically the lifetime of the access_token is 1 hour, 3600 seconds. So you can use something like 3550 as the cache expiry for this token. 
    5. the proxy puts the received token into the same variable that it would have loaded from LookupCache (Step 1)
  3. at some point later in processing, the proxy uses the access token to perform logging

 

You can update the KVM to insert a new .JSON key.  When the TTL for the access_token expires  (3550 seconds), the proxy will retrieve the NEW .json key from the KVM, and then will generate an access_token using that new json key. 

One complication you must take care of: the KVM Get operation also takes advantage of the cache.  You should cache the KVM Get for less than the "lifetime overlap".  If your old SA key and your new SA key have a lifetime overlap of 1 day, then the cache TTL for the KVM Get must be less than one day. This insures that in the worst case scenario, ... in other words in the case where the proxy retrieves the "old" .json key from KVM  just before you administratively load the new .json into the KVM, then the "old" KVM will be ejected from the Apigee cache  before the old key becomes invalid.  You will want a margin of safety here, too.  So if your key overlap is 1 day, then you may wish to use a 4-hour TTL for the KVM Get.  The implication of a shorter TTL for the cache that works with KVM Get is, in some cases when the proxy gets a new access token for GCP logging, it will first have to read the KVM. This will consume 8ms or so, which introduces some additional latency for the client app. Depending on how critical it is for you to have low latency, you may wish to shorten or extend the TTL for the KVM Get. Keep in mind the overall latency for the "get a new token" operation (Step 2 above) will be significantly more than 8ms. GenerateJWT will take 6ms or so, transmitting the JWT to the oauth2.googleapis.com endpoint and receiving the response will take 80ms or so,  and so on.  So 8ms for a KVM Get may be irrelevant. 

In summary, 

  • use a key overlap
  • keep the TTL for the cache for KVM GET smaller than the  key overlap
  • don't worry

ps: You didn't ask, but... I would suggest investigating the use of a ServiceCallout to perform the logging to GCP Logging.  It will be faster and more efficient, simpler and more reliable, than the GCP extension.  It will use the same token as you would use with the GCP Extension. 

 

Thanks @dchiesa1 for your quick response and guidence.

Happened to find your doc. https://github.com/DinoChiesa/Apigee-GCP-Logging-Example.

Sorry, I haven't gone thru it fully. Is it talking about this SA key rotation?

To update the key(credentials) in GCP Extension, Is it good to use Edge Extensions API to patch credentials with out service disruption? If it is feasible then bankend team may call this API in their terraform module.

The GCP Logging example I published does not deal with key rotation. 

To update the key, I think all you need to do is use the KVM Put operation.  This is possible via the Apigee Edge administrative API. 

For an environment-scoped KVM, you can use THIS API to update a KVM entry.