Ops Agent - logging loop devices - how to stop?

We recently started installing the new Ops Agent, and were prompted to set up some alerts, one of which was for Disk Usage (which was a great idea, since we filled a boot drive today and didn't know about it 😞 )

Unfortunately, we got spammed about 10 minutes later by the alert as it triggered on all of the "/dev/loop" devices, which are always 100% full.

I read in the docs that tmpfs devices would be ignored, but it seems loop devices were not considered.

So, I wanted to exclude loop devices right away, instead of trying to filter them from the dashboards (which is a pain). (Conversely, the only real disk we use is root, so we could just explicitly include that, I suppose)

Unfortunately, the docs about setting up the /etc/google-cloud-ops-agent/config.yaml are not terribly clear.

Can I even do this?

If so, what would be the format?

Pointers or samples would be appreciated. 

Thanks!

Dion

1 8 1,627
8 REPLIES 8

FYI... have you reviewed the following instructions?

https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/configuration#default

Looks like you may be able to use the follow format to exclude certain metrics:

processors:
    metrics_filter:
      type: exclude_metrics
     
metrics_pattern: []

I did review that document, yes, but it appears to only offer a method to exclude the entire metric (eg: 

agent.googleapis.com/processes/*

And this document:

https://cloud.google.com/monitoring/api/metrics_opsagent#agent-disk

Hints at stuff:

bytes_used GA
Disk bytes used
GAUGE, DOUBLE, By
aws_ec2_instance, gce_instance
Current number of disk bytes used by state. Summing the values of all states yields the total available disk space. Linux only. Sampled every 60 seconds.
device: Device name.
state: Type of usage, one of [free, used, reserved].

But there's no clear example of using it (that I could find).

And, in those examples, it seems it's not giving a method to say "exclude the device", just "exclude the metric".

What am I missing?

Thanks,

Dion

I see... "exclude the device" is not supported yet in the current version of Ops Agent. 

Oh, good! I'm not totally losing it.

Any suggestions on a workaround to accomplish this in the interim? These loop devices will *always* be 100% (can I send feedback somewhere to have them included with tmpfs to be ignored by default?)

Thanks,

Dion

This is a known feature request, so no need to send any feedback for now.

As far as workaround, not verified specifically for your use-case, but you may want to try "custom metrics": https://cloud.google.com/monitoring/custom-metrics/creating-metrics

 

Thanks

The same thing occurred to me.  I got Ops Agent installed (finally, long story, required a lot of trial and error and persistence), and I was interested in monitoring disk usage.  I got alerts: VM disk utilization too high.  But the only alerts were for dev/loops which exceeded the 95% threshold, which is right because apparently dev/loops are mount points attached to snapd services

I edited the policy named "VM disk utilization too high".  There are three fields: filter, comparator, value.  For the first, drop-down and choose "device"; for the second, drop-down and choose "!has_string" (does not have the string); and for value type in "dev/loop".  Then only incidents that trigger 95% utilization AND which don't have the string dev/loop. In other words, all my incidents were for 95% and for devices that contained the string "dev/loop".  With the second condition saved to the policy all my incidents were automatically solved and closed.  I expect I'll only get an alert if the persistent disk utilization reaches 95%.

Hi,

Did you've tried "exclude" /dev/loop in this way ? I do not have /dev/loop device however, I'm excluding devices exactly in this way in our alerting policies. 

DamianS_0-1682062008677.png
cheers,
DamianS

 

So my policy looks as follow


 

 

{
  "name": "projects/um-monitoring-webapp-wordpress/alertPolicies/17650320439033901987",
  "displayName": "Crit disk usage",
  "documentation": {},
  "userLabels": {},
  "conditions": [
    {
      "name": "projects/um-monitoring-webapp-wordpress/alertPolicies/17650320439033901987/conditions/17650320439033904538",
      "displayName": "CRITICAL VM Instance - Disk utilization",
      "conditionThreshold": {
        "aggregations": [
          {
            "alignmentPeriod": "600s",
            "crossSeriesReducer": "REDUCE_MEAN",
            "groupByFields": [
              "metric.label.device"
            ],
            "perSeriesAligner": "ALIGN_MEAN"
          }
        ],
        "comparison": "COMPARISON_GT",
        "duration": "0s",
        "filter": "resource.type = \"gce_instance\" AND metric.type = \"agent.googleapis.com/disk/percent_used\" AND (metric.labels.device != monitoring.regex.full_match(\"/dev/loop\") AND metric.labels.state = \"used\")",
        "thresholdValue": 90,
        "trigger": {
          "count": 1
        }
      }
    }
  ],
  "alertStrategy": {
    "autoClose": "604800s"
  },
  "combiner": "OR",
  "enabled": true,
  "notificationChannels": [
    "projects/um-monitoring-webapp-wordpress/notificationChannels/13716784830169285425"
  ],
  "creationRecord": {
    "mutateTime": "2023-04-21T07:31:40.689443240Z",
    "mutatedBy": "damian."
  },
  "mutationRecord": {
    "mutateTime": "2023-04-21T07:31:40.689443240Z",
    "mutatedBy": "damian."
  }
}