Solved: Avoiding tombstone issues with high volume OAuth

Report Inappropriate Content · 01-24-2017 04:59 PM

When generating > 2 million access tokens per day, what settings should we use for Cassandra?

The concern is that we'd rapidly start accumulating tombstones causing performance to suffer or perhaps issues with the 100k tombstone query limit

My initial thoughts are:

reduce the oauth_20_access_tokens gc_grace_seconds from 10 days to ~1 day
increase the tombstone_failure_threshold from 100k to 1M?

Would that be a good idea? Is there any documentation on this?

davidmehi

Hello,

I came across this post and wanted to provide more feedback on what you can do to avoid this issue.

Set expiration on access token and refresh token (this is set in the OAuth policy using <ExpiresIn>)
Upgrade to 4.16.01+ (supports autopurge option)
Use SSD disks
Set Autopurge settings
- http://docs.apigee.com/api-services/content/oauthv2-policy#purge
Set Compact setting
- "LeveledCompactionStrategy"
- cqlsh> ALTER COLUMNFAMILY kms.oauth_20_access_tokens WITH compaction = {'class': 'LeveledCompactionStrategy'};

To help with the number of accesstokens and tombstones, we recommend the following:

Adding 3 more nodes of cassandra will help distribute the data more among the other nodes, reducing the data per node.
To help with the tombstone issues, try the following:
- lower the gc_grace_settings in Cassandra
- default is 10 days
- lower to 3 or 5 days
- This will trigger the compaction process, which removes tombstones, more frequently
Increase the tombstone limit in Cassandra
- default is 100k
- increase to 300 or 500k
- Will affect performance a little bit

**Be aware that if you change these settings in Cassandra, if any of your Cassandra nodes go down, it needs to be fixed or replaced within 24 hours. Otherwise, it will cause problems with how the data is distributed

If tombstones continue to be a problem, you can trigger a manual compaction process at anytime using nodetool

View solution in original post

Report Inappropriate Content

Upon further research, it looks like the 100k limit is specifically related to intensive queries. But, I suspect Apigee's queries are using the indexes to retrieve only one token record at a time rather than many - in which case, the tombstones may not be an issue. Can anyone help clarify this concern?

Report Inappropriate Content

@Baba Krishnankutty

Report Inappropriate Content

http://docs.apigee.com/api-services/content/oauthv2-policy talks about how to purge expired tokens. By default they are purged 180 days after expiry (for access token as well as refresh token).

Report Inappropriate Content

Thanks but that does not help answer the question. That document does describe how to enable the automatic token purge, but it still results in tombstones being created and possible performance problems as a result.

bkrishnankutty

reduce the oauth_20_access_tokens gc_grace_seconds from 10 days to ~1 day
increase the tombstone_failure_threshold from 100k to 1M?

Both are the right approaches for the scenario. However we cannot officially document/support both options.

Option 1, will work as long as you can guarantee a node failure will be handled/restored within few hours.

Option 2, could potentially inflict a huge latency as the seeks now have to traverse through all the tombstones upto 1M.

I would start looking at, the legitimate business case for generating and expiring X number (presuming it over few hundred thousand per day) of oauth tokens.

Report Inappropriate Content

> Option 2, could potentially inflict a huge latency as the seeks now have to traverse through all the tombstones upto 1M.

My understanding is that the huge latency would only occur if queries were issued that require scanning through many records - i.e., not directly querying based on the primary key (the token). Does Apigee perform queries like that?

> Both are the right approaches for the scenario. However we cannot officially document/support both options.

Are there no customers who expire more than 100k tokens per 10 day span?

davidmehi

Hello,

I came across this post and wanted to provide more feedback on what you can do to avoid this issue.

Set expiration on access token and refresh token (this is set in the OAuth policy using <ExpiresIn>)
Upgrade to 4.16.01+ (supports autopurge option)
Use SSD disks
Set Autopurge settings
- http://docs.apigee.com/api-services/content/oauthv2-policy#purge
Set Compact setting
- "LeveledCompactionStrategy"
- cqlsh> ALTER COLUMNFAMILY kms.oauth_20_access_tokens WITH compaction = {'class': 'LeveledCompactionStrategy'};

To help with the number of accesstokens and tombstones, we recommend the following:

Adding 3 more nodes of cassandra will help distribute the data more among the other nodes, reducing the data per node.
To help with the tombstone issues, try the following:
- lower the gc_grace_settings in Cassandra
- default is 10 days
- lower to 3 or 5 days
- This will trigger the compaction process, which removes tombstones, more frequently
Increase the tombstone limit in Cassandra
- default is 100k
- increase to 300 or 500k
- Will affect performance a little bit

**Be aware that if you change these settings in Cassandra, if any of your Cassandra nodes go down, it needs to be fixed or replaced within 24 hours. Otherwise, it will cause problems with how the data is distributed

If tombstones continue to be a problem, you can trigger a manual compaction process at anytime using nodetool