Error 500 Hosts are either not healthy or not reachable

Hi,

At least once a day, when I call my API, I receive a 500 error :

{  
  "fault": {  
    "faultstring": "Hosts are either not healthy or not reachable",  
    "detail": {  
      "errorcode": "Internal Server Error"  
    }  
  }
}

If I look at the Trace for this call, it seems that the error occurs even before any request to my backend. It looks like the error occurs on Apigee's system.

Do you know how to solve or prevent it ?

Thanks

0 5 478
5 REPLIES 5

@rmishra

More details on the error, from the Trace :

Recorded May 18, 2018 9:00:38 AM
error: Hosts are either not healthy or not reachable
error.cause: All host(s) tried for query failed (tried: /10.133.82.4:9042 (com.datastax.driver.core.TransportException: [/10.133.82.4:9042] Connection has been closed), ruby03.service.europe-west1.production.consul.apigee.net/10.133.82.10:9042 (com.datastax.driver.core.TransportException: [ruby03.service.europe-west1.production.consul.apigee.net/10.133.82.10:9042] Connection has been closed))
error.class: io.apigee.common.exception.CpsException

Error occurs after a Quota policy. This is the result of this Quota policy with error after :

ratelimit.Quota.allowed.count 0
ratelimit.Quota.available.count 0
ratelimit.Quota.class.allowed.count 0
ratelimit.Quota.class.available.count 0
ratelimit.Quota.class.exceed.count 0
ratelimit.Quota.class.total.exceed.count 0
ratelimit.Quota.class.used.count 0
ratelimit.Quota.datastore.fail.open false
ratelimit.Quota.exceed.count 0
ratelimit.Quota.expiry.time 0
ratelimit.Quota.failed false
ratelimit.Quota.fault.cause
ratelimit.Quota.fault.name
ratelimit.Quota.total.exceed.count 0
ratelimit.Quota.used.count 0

And this is a result of same Quota policy with no error after :

ratelimit.Quota.allowed.count 10000
ratelimit.Quota.available.count 9713
ratelimit.Quota.class.allowed.count 0
ratelimit.Quota.class.available.count 0
ratelimit.Quota.class.exceed.count 0
ratelimit.Quota.class.total.exceed.count 0
ratelimit.Quota.class.used.count 0
ratelimit.Quota.datastore.fail.open false
ratelimit.Quota.exceed.count 0
ratelimit.Quota.expiry.time 1527811200000
ratelimit.Quota.failed false
ratelimit.Quota.fault.cause
ratelimit.Quota.fault.name
ratelimit.Quota.identifier (...)
ratelimit.Quota.total.exceed.count 0
ratelimit.Quota.used.count 287

@Alex R.

This is not an issue with your target servers, this is an issue with Cassandra. It seems that when the Message Processor tries to query Cassandra (Quota determination??), it cannot get a connection to any server in the ring. Since Cassandra access fails, the client gets an error . The error is highly misleading.

Are you running any kind of background processes (Nodetool repair??) on Cassandra during the time when the error occurs? Check Cassandra utilization(CPU, RAM), health when this error occurs

@rmishra

How do you check Cassandra on Apigee?

Are you using Edge Private Cloud?

  • If yes, reach out to your Ops team and ask them for OS , JMX statistics , Log files of Cassandra when this error occurs. Ask them about the scheduled time for performing Anti Entropy maintenance in Cassandra. If the scheduled times on nodes are too close, its possible that the second node starts to get repaired even when the first one hasn't finished. That could cause a behavior like the one you are experiencing.
  • If No, please reach out to Apigee support describing your problem.

No, I'm on Edge Private Cloud.

I will contact the support.

Thank you