Alerting using metric as threshold

CrispinVeall · 11-16-2023 07:44 AM

Hi,
I'm sure this has a very simple answer, but I'm struggling to find it. I want to create an alert that triggers when a metric value is greater than another metric's value. As an example an alert in Prometheus would read as:

sum by (name) (scheduler_job_successive_failures > scheduler_job_failure_tolerance)

So I'd want the metric to be scheduler_job_successive_failures , but the threshold to be the value from scheduler_job_failure_tolerance

The threshold always seems to be a constant (eg: 0.5). I've considered making the metric the sum of

scheduler_job_successive_failures - scheduler_job_failure_tolerance

and then setting the threshold to be > 0, but cannot see a way to describe that in MQL either?

lawrencenelson

Hi @CrispinVeall,

Welcome to the Google Cloud Community!

Can you try running the query below?

fetch gce_instance::your_metric_name.scheduler_job_successive_failures
| join fetch gce_instance::your_metric_name.scheduler_job_failure_tolerance
| every 1m
| group_by [resource.name], [sum(scheduler_job_successive_failures), sum(scheduler_job_failure_tolerance)]
| eval diff = sum(scheduler_job_successive_failures) - sum(scheduler_job_failure_tolerance)
| condition diff > 0

You may view this documentation on how to set up the alert.

I hope this helps. Thank you. 😃

CrispinVeall

Thanks @lawrencenelson - I can't decide if that is more or less difficult to read than the eventual solution I stumbled across:

fetch prometheus_target
| { metric 'prometheus.googleapis.com/scheduler_job_successive_failures/gauge' ; metric 'prometheus.googleapis.com/scheduler_job_failure_tolerance/gauge' }
| outer_join 0
| sub
| group_by 5m, [value_scheduler_job_successive_failures_mean: mean(val())]
| every 5m
| condition val() > 1

They both seem to work - I don't know if my solution contains any bad practice?