Apigee OPDK - attempted to delete non-existing file CommitLog

Apigee OPDK version 4.50.00

Anyone else seen a similar error? We have a setup with 3 Cassandra Zookeeper nodes in DC-1 and 3 Cassandra Zookeeper nodes in DC-2. DC-2 nodes were down at the time the following error occurred. So at the time the error occurred we had three nodes up in DC-1 and three nodes down in DC-2.

Error message (I removed the vm IP address from the log pasted here, replacing it with {MY CASSANDRA VM IP}):

INFO [COMMIT-LOG-ALLOCATOR] 2021-12-13 16:35:39,722 ColumnFamilyStore.java:917 - Enqueuing flush of cache_sequence_id_r24: 331736 (0%) on-heap, 848398 (0%) off-heap
INFO [MemtableFlushWriter:13273] 2021-12-13 16:35:39,724 Memtable.java:347 - Writing Memtable-cache_sequence_id_r24@2012015790(220.559KiB serialized bytes, 29117 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:13273] 2021-12-13 16:35:39,747 Memtable.java:382 - Completed flushing /opt/apigee/data/apigee-cassandra/data/cache/cache_sequence_id_r24-2708b0f0308711e9b3b3cddb55735b1c/cache-cache_sequence_id_r24-tmp-ka-47-Data.db (102.357KiB) for commitlog position ReplayPosition(segmentId=1617774795705, position=6805)
ERROR [COMMIT-LOG-ALLOCATOR] 2021-12-13 16:35:39,776 StorageService.java:453 - Stopping gossiper
WARN [COMMIT-LOG-ALLOCATOR] 2021-12-13 16:35:39,776 StorageService.java:359 - Stopping gossip by operator request
INFO [COMMIT-LOG-ALLOCATOR] 2021-12-13 16:35:39,776 Gossiper.java:1456 - Announcing shutdown
INFO [COMMIT-LOG-ALLOCATOR] 2021-12-13 16:35:39,776 StorageService.java:1715 - Node /{MY CASSANDRA VM IP} state jump to shutdown
ERROR [COMMIT-LOG-ALLOCATOR] 2021-12-13 16:35:41,955 StorageService.java:458 - Stopping RPC server
INFO [COMMIT-LOG-ALLOCATOR] 2021-12-13 16:35:41,955 ThriftServer.java:142 - Stop listening to thrift clients
ERROR [COMMIT-LOG-ALLOCATOR] 2021-12-13 16:35:41,957 StorageService.java:463 - Stopping native transport
INFO [COMMIT-LOG-ALLOCATOR] 2021-12-13 16:35:41,961 Server.java:225 - Stop listening for CQL clients
ERROR [COMMIT-LOG-ALLOCATOR] 2021-12-13 16:35:41,961 CommitLog.java:409 - Failed managing commit log segments. Commit disk failure policy is stop; terminating thread
java.lang.AssertionError: attempted to delete non-existing file CommitLog-4-1617774795287.log
at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:125) ~[apache-cassandra-2.1.22.jar:2.1.22]
at org.apache.cassandra.db.commitlog.CommitLogSegment.delete(CommitLogSegment.java:357) ~[apache-cassandra-2.1.22.jar:2.1.22]
at org.apache.cassandra.db.commitlog.CommitLogSegmentManager$5.call(CommitLogSegmentManager.java:424) ~[apache-cassandra-2.1.22.jar:2.1.22]
at org.apache.cassandra.db.commitlog.CommitLogSegmentManager$5.call(CommitLogSegmentManager.java:419) ~[apache-cassandra-2.1.22.jar:2.1.22]
at org.apache.cassandra.db.commitlog.CommitLogSegmentManager$1.runMayThrow(CommitLogSegmentManager.java:153) ~[apache-cassandra-2.1.22.jar:2.1.22]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) [apache-cassandra-2.1.22.jar:2.1.22]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_151]

I've double checked permissions for the commitlog directory and they seem to be fine. This Apigee installation has been running fine for over a year. One thing to note is that around this time DC-2 was down due to disk space issues for one of the Cassandra nodes. The commitlog directory for that node in DC-2 was full and so the node could not be started.

A similar error seems to have been fixed at one time, but the version of Cassandra Apigee is running should include that fix I believe:
https://issues.apache.org/jira/browse/CASSANDRA-10377

Any recommendations for other steps I can take to diagnose a possible cause?

Solved Solved
0 1 186
1 ACCEPTED SOLUTION

looks like consistency error between global cluster

 /opt/apigee/data/apigee-cassandra/data/cache/cache_sequence_id_r24-2708b0f0308711e9b3b3cddb55735b1c/cache-cache_sequence_id_r24-tmp-ka-47-Data.db

 

this path point to actual cassandra data on disk, maybe when nodes from dc-2 went down, some updates happened on dc1. Try to run repair on the nodes, and open ticket for support on the subject

View solution in original post

1 REPLY 1

looks like consistency error between global cluster

 /opt/apigee/data/apigee-cassandra/data/cache/cache_sequence_id_r24-2708b0f0308711e9b3b3cddb55735b1c/cache-cache_sequence_id_r24-tmp-ka-47-Data.db

 

this path point to actual cassandra data on disk, maybe when nodes from dc-2 went down, some updates happened on dc1. Try to run repair on the nodes, and open ticket for support on the subject