Error while fetching object -Organization

API-Evangelist · 03-03-2016 03:20 PM

We have 1 node cassandra in dev/qa & 3 node in dc1 & 3 node dc2.

We recently had a server maintenance and seen a strange behavior with data loss & recovery(partial).

(It was first incident & want to share)..

Way to replicate the situation:

1.As part of maintaince servers were rebooted where apigee CS node is running 2.As part of hardening server was rebooted with /tmp having no exec permissions 3.Apigee services starts-up while server boot up.Check the apigee cassandra logs for below error. == Caused by: org.apache.cassandra.exceptions.ConfigurationException: SnappyCompressor.create() threw an error: java.lang.NoClassDefFoundError Could not initialize class org.xerial.snappy.Snappy at org.apache.cassandra.io.compress.CompressionParameters.createCompressor(CompressionParameters.java:179) at org.apache.cassandra.io.compress.CompressionParameters.<init>(CompressionParameters.java:71) at org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:95) ... 11 more == 4.Change back the permissions on /tmp to have execute permissions 5.Restart the apigee services & verify the recent API proxies situation as in our case we lost all the recently worked proxies information.

What we found was data is not available in CS but was in ZK.

Opened a case and found there is no way to recover as it is single node with no backup.Support recommended to clear the entries in ZK & re-create the org.

After we cleared the entries and recreate the org, restarted the services we found all the proxies are visible back(other things we missing like kvm in the org but thats fine).

Any one seen this behavior, can cassandra expert can explain the behaviour?

How frequent the commit happens & how can data loss happens and recover? How does it internally works?

-Vinay

jghunt

Your dev and qa environment with a single cassandra ring is not suggested. Cassandra is designed to be in a cluster for data replication. We are not sure what the behavior would be like for a single node cassandra. We don't advise this, and we advise a minimum of a 3 ring node cassandra, even in testing environments.

Best practices that we suggest from our experience in the cloud. 1. patch in your test environments first before promoting patches to new environments. Never patch all at once. 2. when patching cassandra, we patch one cassandra in the cluster at a time and bring the cassandra back up, verify, then move to the next node in the cluster. 3. Should there be any failures, we take backups of all data. Our backups are on a cronjob which backs up at regular intervals so that we can restore if necessary.

API-Evangelist

Thanks Janice.

We have seen same behaviour with 3 node in dc1 & 3 node dc2, it is missing latest revisions of proxies.

Would appreciate if some one replicate the issue as it might be a possible bug & might help other customers.

-Vinay