Re: CloudRun Job replace failing / CloudScheduler ...

robinboening · 11-17-2022 08:49 AM

I have 5 CloudRun Jobs and CloudScheduler configured to execute them.

Today, 17th November 2022 at 3:19am UTC the CloudScheduler is repeatedly failing to execute one specific Job. All other Jobs are triggered just fine. At this time I was sleeping tight, so no changes have been made to the Job or Scheduler configuration.

This is a log for the CloudScheduler (repeats every minute as it is supposed to execute every minute)

{
  "@type":"type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished", 
  "jobName":"projects/xxx/locations/europe-west6/jobs/xxx", 
  "status":"INVALID_ARGUMENT", "targetType":"HTTP", 
  "url":"https://europe-north1-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/xxx/jobs/xxx:run"
}

I went checking the Jobs details page and saw an error showing in a banner on top of the page. I forgot to copy it but it was like this: "Resource 'xxx-6w6dd' of kind 'EXECUTION' does not exist. Resource readiness deadline exceeded."

I tried to redeploy (replacing the Job) using the gcloud cli and it failed with the same error.

ERROR: (gcloud.alpha.run.jobs.replace) Resource 'xxx-6w6dd'
of kind 'EXECUTION' in region 'europe-north1' in project 'xxx' does not exist.
Resource readiness deadline exceeded.

I can only guess but to me it looks like it wants to replace a specific instance that does not exist.

And indeed, the last successful execution of that Job, at 03:18 am, had a different instance id "xxx-4hjvn". So my best guess is the "xxx-6w6dd" never existed.

To me it sounds like a bug, but I am happy if anyone tells me if I am told I am doing something wrong.

Thank you!

robinboening

As a workaround I was able to delete the Job from the Browser console and redeploy. It is working again, but it leaves a bad taste in my mouth as it could happen again anytime.

gautier-gdx

Hello we had the exact same problem yesterday and after redeploying our jobs it was fixed but now it's happening again, do you still have the problem ?

robinboening

After deleting the entire Job and a redeployment the issue was gone and didn't appear again for me. In another project with the exact same Jobs and configuration this issue did not happen.

robinboening

I am now experiencing the same problem again.

The CloudScheduler is failing to execute one specific CloudRun Job and is logging INVALID_ARGUMENT. In the history (run/jobs/details) of this Job it shows this banner saying "The service has encountered an internal error. Please try again later. Resource readiness deadline exceeded."

This time I had no error when I redeployed so I didn't need to delete the Job.

knet

Hi, is this for the new service you created when the first one stopped working?

We're looking into this issue, thank you for raising it.

robinboening

Thanks for your response!

This time it was a different Job that couldn't be triggered.

knet

The issue seems to mostly impact jobs with many (10K+, or even more - 100K+) executions in their history. If you have mission-critical applications running in Cloud Run jobs with a very long execution history, you could consider proactively creating a new job and running that one instead to avoid this issue.

I'm sorry for the inconvenience here. Cloud Run jobs are in Preview; we try to keep our Preview features as stable as possible but sometimes we do get issues like this during Previews.

robinboening

Thank you, that's good to know. I'll try that!

CloudRun Job replace failing / CloudScheduler failing to execute Job