Persistent Disk SSD approaching quota running hell...

dvoytan · 07-13-2023 04:43 PM

I'm trying to reproduce the example GPU batch job . I'm noticing that when the job is submitted and queued (via the cli example in the link), my Persistent Disk SSD (GB) usage in region : us-central1 is increasing rapidly and reaching my quota limit (500GB). At this point the job hangs in queue and does not get run due to the quota limit. Furthermore the reported usage remains high for about 8 hours.

Any idea what would cause this runaway disk quota consumption (as I understand this job runs with a 30GB SSD attached?). I have tried listing all of my disks and snapshots using gcloud commands, but none are showing up, I have no instances with attached disks in this region, and the quota usage remains high even after some time has passed, ruling out delayed updates?

I do have all of the "base" images (e.g. cos-101-17162-210-54), but these shouldn't count against a quota, correct? I think I made a mistake by setting the compute region (as opposed to zone)

locations/us-central1 as an allowed location (I have unlimited quota in zone : us-central1-f but not region: us-central1) but as I understand this doesn't explain the runaway consumption. Furthermore, if the issue really is zoning, I want to be able to "clear" the usage and fix the problem ASAP.

Any help would be greatly appreciated!

bolianyin

If you want to fix the issue asap, I think the best way is to delete the job and resubmit it with desired zone in allowedLocations.

Do you see the job state is always QUEUED? A QUEUE job should not consume your quota. Also, how many tasks do you have in your job and what is the machine type you set? I am trying to understand where the 30GB SSD comes from. If you can post a minimum job spec that can reproduce the issue, we can look into it further. In addition, you can also send your job UID to gcp-batch-preview@google.com for us to take a further look.

dvoytan

Thanks for your response, I submitted a bath job using a custom container with an appropriate zone and everything seems to run fine.

dvoytan

Reviving this thread, because I am experiencing this issue again. I sent an email to the link above. To summarize the issues I'm observing:

1. Job state alternates between "Job state is set from SCHEDULED to SCHEDULED_PENDING_QUEUED" and "Job state is set from QUEUED to SCHEDULED" about 4-5 times.

2. Then, I see the log:
Quota checking process decides to delay scheduling for the job [jobId] due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 4000, usage: 3969...]". The job is not actively running, so I don't understand why quota should be consumed. Is the usage "cached" before the job is actually submitted, and not cleared?

3. I'm running just one task. The runnable is a container with the image saved on GCP artifact repo. I've attached the run config here:

{
    "taskGroups": [
        {
            "taskSpec": {
                "runnables": [
                    {
                        "container": {
                            "imageUri": redacted,
                            "volumes": [
                                "/var/lib/nvidia/lib64:/usr/local/nvidia/lib64",
                                "/var/lib/nvidia/bin:/usr/local/nvidia/bin"
                            ],
                            "options": "--privileged"
                        }
                    }
                ],
                "computeResource": {
                    "cpuMilli": 2000,
                    "memoryMib": 14300,
                    "bootDiskMib": 50000
                },
                "maxRetryCount": 1,
                "environment": {
                    "variables": {
                        "redacted":"redacted"
                    }
                }
            },
            "taskCount": 1,
            "parallelism": 1
        }
    ],
    "allocationPolicy": {
        "instances": [
            {
                "installGpuDrivers": true,
                "policy": {
                    "machineType": "n1-standard-4",
                    "accelerators": [
                        {
                            "type": "nvidia-tesla-t4",
                            "count": 1
                        }
                    ]
                }
            }
        ],
        "location": {
            "allowedLocations": [
                "zones/us-central1-f"
            ]
        }
    },
    "logsPolicy": 
    {
        "destination": "CLOUD_LOGGING"
    }
}

My impression of what is happening is that I'm submitting the job to the "scheduler", the "scheduler" tries to allocate a GPU on us-central1-f, but finds that none are available, somewhere along the line the needed boot disk Mb is getting counted against my Quota, and then the job sits in various states between "SCHEDULED to SCHEDULED_PENDING_QUEUED" etc. But I really have no idea what is going wrong.

Wen_gcp

Hi @dvoytan ,

There is a zonal stock out issue (ref: what is zonal stock out) in us-central1-f. It could be caused by lack of CPU, GPU, or Local SSD. After several retries, the weird error message with usage 3969 shown up, we are looking into the details, but it should not result in any cost. If there is, please let us know.

For a quick workaround, could you please check if there is enough quotas for them in us-central-f in your project OR trying a different zone with enough quotas?

Thanks,

Wen

dvoytan

Thanks for the reply Wen!

I appreciate the investigation. I can try other zones and that may be the best way forward in the future. These jobs, however, are not urgent so it's O.K. if they wait in a queue for a while.

What's puzzling to me is why quota is being used before the job actually starts (and more importantly not being released, which then blocks the job from being launched for many hours until the resources are eventually freed)

Perhaps my mental model of the batch queue is flawed? My understanding is that the job waits in a queue until resources are available. When they become available and the job launches, resources are consumed (and counted against quotas). The sequence of errors I'm observing suggests to me that quota is consumed before the job runs (it's probably not actually consumed but "marked for consumption"). I don't have a deep enough understanding of how the scheduling system and compute engine are working under the hood to solve this.

Thanks again for your reply!

Wen_gcp

It's my pleasure to help. Sorry for the confusion. We have quota check before scheduling the job, i.e. update the job state to "SCHEDULED" to offer a clear picture on which state the job is after being submitted. (Here are the different states of a job in the context of Batch.) This matches "the job waits in a queue until resources are available". For "that quota is consumed before the job runs", my current guess is something weird happen during zonal stock out. We are checking why the usage 3969 is shown up and will keep you posted.

Thanks,

Wen

Wen_gcp

Hi @dvoytan ,

I have a couple question on the quota usage when this issue happened:

1. Have you checked GCP quota page by any chance at that time? Did they show an unexpected usage?

2. Were there any VMs, jobs, etc. consuming resources during that time besides this job?

3. You mentioned there is unlimited quota in zone : us-central1-f for your project. Does it mean for all SSD, GPU, CPU resources, there are unlimited quota?

Thanks,

Wen

kkt

I'm experiencing a similar issue running Autopilot GKE cluster with a GPU workload/deployment with just 1 gpu/node.

The first time this happened, when billing info was added and I was able to get quota for 1 gpu, my "Persistent Disk Standard" consumed rapidly started increasing towards 4TB when the cluster was trying to scale up to its first gpu node. I didn't find any place which listed where this space was getting consumed, and it went away after I deleted the cluster and recreated.

Now a little more than a week later, where things had been running fine with 1 gpu node, I see the "Persistent Disk SSD" increasing and reaching the 500GB quota, and the node/pod are no longer available and can't be provisioned due to exceeded quota. Again, I don't see any place where the space is getting used, and I'm not getting billed for this consumed persistent disk.

I'm hoping I don't have to recreate the cluster to address this issue when it occurs, since it would affect existing ingress to services.

Persistent Disk SSD approaching quota running hello world container job gpu batch