GCP Batch Jobs - queue/scheduled time

nechtobolshee · 01-11-2024 03:58 AM

Not long ago (a couple of days) I encountered a problem: the tasks I started took a very long time to start (they remained in the “queued” or “scheduled” status). This has never happened before; the average startup time was about 10 minutes. Now the time has increased and for some, it starts after 12+ hours. What could be causing this and how to fix it?

The moment this happened, I checked the quotas, the limit had not been reached anywhere.

P.S: I use a T4 GPU at work, the region is northamerica-northeast1

wenyhu

Hi @nechtobolshee ,

Usually for your situation, if you do Get or List Jobs, you should see some error descriptions on your Job's status event telling you the potential issue. If not, would you mind sharing some examples of your stuck job uid so that I can help take a look in the meantime?

Thanks!

nechtobolshee

Hi! Thanks for your answer.
I fetched the Job's description and found that sometimes I did not have enough Persistent Disk SSD (GB) size (Saw in Quotas). But it looks strange cause other jobs were finished and SSD should be free.
Do we have any options to look in detail at what needs the SSD is being used for?

Wen_gcp

Hi @nechtobolshee ,

May I have your job uid to take a further look? If you can help post the job descriptions getting from Get or List Jobs without confidential data, it would also be helpful!

Thanks!

nechtobolshee

Yeap!
UID: vafpafskncxsvyhdc-88cfdd7e-bcd5-40a700

Some of what happens in the job:
{
"description":"Quota checking process decides to delay scheduling for the job vafpafskncxsvyhdc-88cfdd7e-bcd5-40a700 due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 500, usage: 394, wanted: 118.], next schedule time 2024-01-19 07:42:49.502334805 -0800 PST m=+109928.679934081.",
"eventTime":"2024-01-19T15:37:49.502425166Z",
"type":"SCHEDULING_INFO"
},
{
"description":"Job state is set from QUEUED to SCHEDULED for job X.",
"eventTime":"2024-01-19T15:42:53.184951325Z",
"type":"STATUS_CHANGED"
},
{
"description":"Job state is set from SCHEDULED to SCHEDULED_PENDING_QUEUED for job X",
"eventTime":"2024-01-19T16:03:25.074093726Z",
"type":"STATUS_CHANGED"
},

Wen_gcp

Hi @nechtobolshee,

This job was tried to be scheduled multiple times. This could be why it stayed in "QUEUED or SCHEDULED" for a long time. The reasons in order are

```
1. 2024-01-19 05:35:55 Inadequate quotas [Quota: SSD_TOTAL_GB, limit: 500, usage: 394, wanted: 118. Quota: NVIDIA_T4_GPUS, limit: 3, usage: 3, wanted: 1.]

2. 2024-01-19 06:51:25 GCE_ZONE_RESOURCE_POOL_EXHAUSTED (Reference: what is zonal stock out?)

3. 2024-01-19 07:02:23 Inadequate quotas [Quota: SSD_TOTAL_GB, limit: 500, usage: 394, wanted: 118.]

4. 2024-01-19 07:47:59 GCE_ZONE_RESOURCE_POOL_EXHAUSTED

5. 2024-01-19 08:44:32 CODE_GCE_ZONE_RESOURCE_POOL_EXHAUSTED

6. 2024-01-19 09:55:36 Inadequate quotas [Quota: SSD_TOTAL_GB, limit: 500, usage: 394, wanted: 118.]

7. 2024-01-19 11:59:57 GCE_ZONE_RESOURCE_POOL_EXHAUSTED

8. 2024-01-19 13:22:13 GCE_ZONE_RESOURCE_POOL_EXHAUSTED

8. 2024-01-19 14:43:19 Inadequate quotas [Quota: SSD_TOTAL_GB, limit: 500, usage: 394, wanted: 118.]

```

Seems now it is delayed to be scheduled due to lack of SSD quota, could you please help check the disk quotas to see if there is enough ssd quota? Also, will increasing quota be an option for you? Thanks!

nechtobolshee

@Wen_gcp @Thank you! Yes, the problem is that the quota limit has been reached. Is there any way to see where the SSD is being used? I would like to understand what uses and how many resources.

Wen_gcp

Anytime 🙂 Local ssd disks are used by the VMs, you can check how many local SSD disks are needed for your job's machine type here.

nechtobolshee

@Wen_gcp Hi! Could you help me please? I have increased Quota for SSD (now it should be 1TB) and as a result - this Quota got a limit value. Before I increased the limit was 500 and the value jumped to 394, now the limit is 1TB and the value has increased to 984GB.
How can you find out where to spend it? I have 1 VM and it uses 40GB, but where did the rest go?

nechtobolshee

I could be wrong, but it looks like this: The instance tries to start, allocates resources (SSD, etc.) and when it reaches some resource, it receives an error because there is no free one.
But at the same time, the SSD remains marked as in use.

I increased the limit to 4TB but the errors are exactly the same 🙂

Wen_gcp

Hi @nechtobolshee , may I know your last job uid? Is it the same one? Thanks!

echap

@Wen_gcp Hi, this is still an ongoing issue with Batch.

If a GPU is unavailable, the job will cycle between queued and scheduled. This is expected and ok except during each scheduling cycle, additional SSD quota is consumed equivalent to the job definition. When the job can't allocate the GPU, it returns to the queued state but does not release the SSD quota. This cycle continues until all SSD quota is consumed and the job is never able to run. Even after deleting the batch job, the SSD quota is not released for 6-12 hours.

Attention to this bug would be greatly appreciated. Thank you.

Wen_gcp

Hi echap@,

1. May I know the Batch job uids and their regions that encountered the above issues, so we can investigate in them?

2. Also, you can try to describe job. There usually will be important events logged in the job status events. If you could provide that information, it would also be very helpful to us. Thanks!

echap

Hi @Wen_gcp, thanks for getting back to me on this.

I have put together a minimal reproducible example based on the example job definition in the documentation.

Here is the job definition:

{
    "taskGroups": [
        {
            "taskSpec": {
                "runnables": [
                    {
                        "script": {
                            "text": "echo Hello world! This is task ${BATCH_TASK_INDEX}. This job has a total of ${BATCH_TASK_COUNT} tasks."
                        }
                    }
                ],
                "computeResource": {
                    "cpuMilli": 8000,
                    "memoryMib": 30000,
                    "bootDiskMib": 100000
                },
                "maxRetryCount": 2,
                "maxRunDuration": "3600s"
            },
            "parallelism": 1
        }
    ],
    "allocationPolicy": {
        "instances": [
            {
                "installGpuDrivers": true,
                "policy": {
                    "machineType": "n1-standard-8",
                    "provisioningModel": "STANDARD",
                    "accelerators": [
                        {
                            "type": "nvidia-tesla-t4",
                            "count": 1
                        }
                    ],
                    "disks": [
                        {
                            "deviceName": "additional_disk",
                            "newDisk": {
                                "type": "pd-ssd",
                                "sizeGb": 3000
                            }
                        }
                    ]
                }
            }
        ],
        "location": {
            "allowedLocations": [
                "zones/us-central1-b"
            ]
        }
    },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}

Using this test batch job to answer your questions:

1. UUID: minimal-reprod-ssd-822f2967-bf0a-4ccd0
2. Region: us-central1
This is the (partially redacted) output from running

gcloud batch jobs describe minimal-reprod-ssd-allocation-20240401-141753 --location us-central1

allocationPolicy:
  instances:
  - installGpuDrivers: true
    policy:
      accelerators:
      - count: '1'
        type: nvidia-tesla-t4
      disks:
      - deviceName: additional_disk
        newDisk:
          sizeGb: '3000'
          type: pd-ssd
      machineType: n1-standard-8
      provisioningModel: STANDARD
  labels:
    batch-job-id: minimal-reprod-ssd-allocation-20240401-141753
  location:
    allowedLocations:
    - regions/us-central1
    - zones/us-central1-b
  serviceAccount:
    email: REDACTED
createTime: '2024-04-01T21:17:54.225011145Z'
logsPolicy:
  destination: CLOUD_LOGGING
name: REDACTED
status:
  runDuration: 0s
  state: QUEUED
  statusEvents:
  - description: Job state is set from QUEUED to SCHEDULED for job projects/REDACTED/locations/us-central1/jobs/minimal-reprod-ssd-allocation-20240401-141753.
    eventTime: '2024-04-01T21:17:57.766324857Z'
    type: STATUS_CHANGED
  - description: "VM in Managed Instance Group meets error: Batch Error: code - CODE_GCE_ZONE_RESOURCE_POOL_EXHAUSTED,\
      \ description - error count is 7, latest message example: Instance 'minimal-reprod-ssd-822f2967-bf0a-4ccd0-group0-0-plpl'\
      \ creation failed: The zone 'projects/REDACTED/zones/us-central1-b'\
      \ does not have enough resources available to fulfill the request.  Try a different\
      \ zone, or try again later."
    eventTime: '2024-04-01T21:39:35.121Z'
    type: OPERATIONAL_INFO
  - description: VMs not functioning within the time window 1080 seconds.
    eventTime: '2024-04-01T21:39:59.921550373Z'
    type: OPERATIONAL_INFO
  - description: Job state is set from SCHEDULED to SCHEDULED_PENDING_QUEUED for job
      projects/REDACTED/locations/us-central1/jobs/minimal-reprod-ssd-allocation-20240401-141753.
    eventTime: '2024-04-01T21:39:59.938074721Z'
    type: STATUS_CHANGED
  - description: Job state is set from SCHEDULED_PENDING_QUEUED to QUEUED for job
      projects/REDACTED/locations/us-central1/jobs/minimal-reprod-ssd-allocation-20240401-141753.
    eventTime: '2024-04-01T21:40:47.164395660Z'
    type: STATUS_CHANGED
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 14:50:51.649865745 -0700 PDT m=+214356.414855742.'
    eventTime: '2024-04-01T21:45:51.649940305Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 14:55:56.114827842 -0700 PDT m=+215098.907538262.'
    eventTime: '2024-04-01T21:50:56.114945351Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 15:01:00.575580174 -0700 PDT m=+215430.389426365.'
    eventTime: '2024-04-01T21:56:00.575668296Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 15:06:05.414859704 -0700 PDT m=+215160.110862867.'
    eventTime: '2024-04-01T22:01:05.414956904Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 15:11:10.885322528 -0700 PDT m=+215575.643588359.'
    eventTime: '2024-04-01T22:06:10.885438378Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 15:16:14.918018459 -0700 PDT m=+216403.603340216.'
    eventTime: '2024-04-01T22:11:14.918098459Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 15:21:19.732150066 -0700 PDT m=+216708.237442730.'
    eventTime: '2024-04-01T22:16:19.732303156Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 15:26:23.83905514 -0700 PDT m=+217115.264477295.'
    eventTime: '2024-04-01T22:21:23.839243250Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 15:31:29.303001932 -0700 PDT m=+216698.739494837.'
    eventTime: '2024-04-01T22:26:29.303094510Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 15:36:33.755483807 -0700 PDT m=+44759.027281986.'
    eventTime: '2024-04-01T22:31:33.755572287Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 15:41:38.743365183 -0700 PDT m=+217307.279607038.'
    eventTime: '2024-04-01T22:36:38.743491068Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 15:46:43.311081982 -0700 PDT m=+217683.695794508.'
    eventTime: '2024-04-01T22:41:43.311168602Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 15:51:49.216058823 -0700 PDT m=+218013.981048820.'
    eventTime: '2024-04-01T22:46:49.216128103Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 15:56:54.486013647 -0700 PDT m=+218319.251003634.'
    eventTime: '2024-04-01T22:51:54.486088217Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 16:02:01.596816769 -0700 PDT m=+219150.282138726.'
    eventTime: '2024-04-01T22:57:01.596897559Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 16:07:05.686012498 -0700 PDT m=+218818.176830465.'
    eventTime: '2024-04-01T23:02:05.686090058Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 16:12:09.859799653 -0700 PDT m=+219039.598787647.'
    eventTime: '2024-04-01T23:07:09.859892183Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 16:17:14.646791875 -0700 PDT m=+219539.411781872.'
    eventTime: '2024-04-01T23:12:14.646883345Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 16:22:19.267329037 -0700 PDT m=+219819.652041563.'
    eventTime: '2024-04-01T23:17:19.267388247Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 16:27:23.808267934 -0700 PDT m=+220148.573257921.'
    eventTime: '2024-04-01T23:22:23.808338374Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 16:32:27.823364324 -0700 PDT m=+220441.290208258.'
    eventTime: '2024-04-01T23:27:27.823450156Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 16:37:32.029049816 -0700 PDT m=+221306.354588823.'
    eventTime: '2024-04-01T23:32:32.029174329Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 16:42:36.44816026 -0700 PDT m=+220866.187148214.'
    eventTime: '2024-04-01T23:37:36.448252111Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 16:47:40.465074124 -0700 PDT m=+221269.901567030.'
    eventTime: '2024-04-01T23:42:40.465167520Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 16:52:44.688509722 -0700 PDT m=+222296.113931647.'
    eventTime: '2024-04-01T23:47:44.688586832Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 16:57:48.2128431 -0700 PDT m=+49633.484641278.'
    eventTime: '2024-04-01T23:52:48.212909480Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 17:02:54.694841039 -0700 PDT m=+222167.185659016.'
    eventTime: '2024-04-01T23:57:54.694928349Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 17:08:01.361443832 -0700 PDT m=+222476.057446895.'
    eventTime: '2024-04-02T00:03:01.361523221Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 17:13:08.196972282 -0700 PDT m=+222868.769351091.'
    eventTime: '2024-04-02T00:08:08.197066712Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 17:18:15.491008349 -0700 PDT m=+223743.701457466.'
    eventTime: '2024-04-02T00:13:15.491115359Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 17:23:22.362773795 -0700 PDT m=+224050.573223032.'
    eventTime: '2024-04-02T00:18:22.362878735Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 17:28:29.367989063 -0700 PDT m=+223619.106977007.'
    eventTime: '2024-04-02T00:23:29.368097353Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 17:33:35.984233329 -0700 PDT m=+224035.071707124.'
    eventTime: '2024-04-02T00:28:35.984316179Z'
    type: SCHEDULING_INFO
  - description: 'Quota checking process decided to delay scheduling for the job minimal-reprod-ssd-822f2967-bf0a-4ccd0
      due to inadequate quotas [Quota: SSD_TOTAL_GB, limit: 16000, usage: 15975, wanted:
      3115.], next schedule time 2024-04-01 17:38:37.429751088 -0700 PDT m=+224972.256447834.'
    eventTime: '2024-04-02T00:33:37.429861913Z'
    type: SCHEDULING_INFO
taskGroups:
- name: REDACTED
  parallelism: '1'
  taskCount: '1'
  taskSpec:
    computeResource:
      bootDiskMib: '100000'
      cpuMilli: '8000'
      memoryMib: '30000'
    maxRetryCount: 2
    maxRunDuration: 3600s
    runnables:
    - script:
        text: echo Hello world! This is task ${BATCH_TASK_INDEX}. This job has a total
          of ${BATCH_TASK_COUNT} tasks.
uid: minimal-reprod-ssd-822f2967-bf0a-4ccd0
updateTime: '2024-04-02T00:33:37.429861913Z'

Wen_gcp

Hi echap@,

It is true that when the job can't allocate the GPU, it returns to the queued state. But the quota check will query the current quota usage of the project instead of actually consuming the quota. Especially "after deleting the batch job, the SSD quota is not released for 6-12 hours.", as the vm was not created, there shouldn't be any real consumption on SSD quota due to this job.

Could you please confirm at the same time there is no other job or resources consuming the SSD quota? May I have a quota usage screenshot like https://www.googlecloudcommunity.com/gc/Infrastructure-Compute-Storage/GCP-Batch-Jobs-queue-schedule... for further investigation? Thanks a lot!

echap

Hi @Wen_gcp,

I have attached three images that show

The entire SSD quota (16 TB) being consumed quickly after the job is submitted.
The disks page of compute engine, showing that no significant SSD usage exists.
The job page showing that this job was in the queued state when the other two screenshots were taken.

I'm thinking the quota check must be putting some type of hold on the SSD quota (which appears as real consumption). I am guessing this was done to avoid a race condition where two different jobs could pass the quota check at the same time and result in double quota use. The job UUID for this new job can be seen in the third image if you require it again.

Thank you

Wen_gcp

Thanks for the information @echap! The quota check is purely a read on quota status of the project and won't hold quotas. I am not sure if this could happen due to zonal stock out, let me ask around and get back to you, thanks!

echap

Hi @Wen_gcp,

I hope you are having a good week.

I wanted to follow up on whether there is any new information about this issue. As of today, the bug still exists and is wreaking havoc on our project quotas. If this can't be addressed we will unfortunately need to migrate away from using Batch completely.

Shamel

Hi @echap the quota seems to be having an unexpected behavior in some cases that is not intended. We are investigating the fix and will be following up shortly on when it is addressed.

echap

Hi @Shamel, thank you for the update.

Shamel

@echap - The dependent team is looking into the solution for this rare PD quota issue, we are awaiting more details on the timeline. In the meantime, a potential short term mitigation that may help, but not solve the issue is to add the following job label "goog-batch-skip-quota-check": "true" to your config. The job will be scheduled without verifying quota upfront, but if stock out happens, the job will be put back into QUEUED state and retry provisioning the requested capacity again. Below is a snippet of the label.

"allocationPolicy": {
        "location": {
            "allowed_locations": ["zones/us-central1-a"]
        },
        "instances": [
        {
            "installGpuDrivers": true,
        "policy": {
          "machineType": "a3-highgpu-8g"
        }
      }
    ]
    },
    "labels": {
        "goog-batch-skip-quota-check": "true"

    },