Hi,
I'm sure this has a very simple answer, but I'm struggling to find it. I want to create an alert that triggers when a metric value is greater than another metric's value. As an example an alert in Prometheus would read as:
sum by (name) (scheduler_job_successive_failures > scheduler_job_failure_tolerance)
So I'd want the metric to be scheduler_job_successive_failures , but the threshold to be the value from scheduler_job_failure_tolerance
The threshold always seems to be a constant (eg: 0.5). I've considered making the metric the sum of
scheduler_job_successive_failures - scheduler_job_failure_tolerance
and then setting the threshold to be > 0, but cannot see a way to describe that in MQL either?
Hi @CrispinVeall,
Welcome to the Google Cloud Community!
Can you try running the query below?
fetch gce_instance::your_metric_name.scheduler_job_successive_failures
| join fetch gce_instance::your_metric_name.scheduler_job_failure_tolerance
| every 1m
| group_by [resource.name], [sum(scheduler_job_successive_failures), sum(scheduler_job_failure_tolerance)]
| eval diff = sum(scheduler_job_successive_failures) - sum(scheduler_job_failure_tolerance)
| condition diff > 0
You may view this documentation on how to set up the alert.
I hope this helps. Thank you. 😃
Thanks @lawrencenelson - I can't decide if that is more or less difficult to read than the eventual solution I stumbled across:
fetch prometheus_target
| { metric 'prometheus.googleapis.com/scheduler_job_successive_failures/gauge' ; metric 'prometheus.googleapis.com/scheduler_job_failure_tolerance/gauge' }
| outer_join 0
| sub
| group_by 5m, [value_scheduler_job_successive_failures_mean: mean(val())]
| every 5m
| condition val() > 1
They both seem to work - I don't know if my solution contains any bad practice?