Solved: Compute Engine VM has lost its storage suddenly

gavinharriss · 03-24-2023 01:03 AM

My compute engine VM hosting a container image has suddenly stopped working despite looking as if it's still running fine. It's not been working for about 24 hours now despite everything I try. It's been stable since I provisioned it a few years back, apart from the occasional restart being required because of the hosted app requiring an off and on again.

I've tried multiple times to restart the VM and even started and stopped it, resulting in new public IP addresses being assigned. Nothing has managed to get it working again.

I can ping the public IP but am unable to reach the hosted application on port 8080, which has been enabled in the firewall. Turning on firewall logging I can see that my allow-http-8080 rule is still being used without issue.

My container application is not stateless so requires storage. The VM instance size is e2-medium, so should have 10 GB. It's deployed to us-west1-b.

Looking through the logs I see a few worrying messages that make me suspect storage is an issue:

Warning: Failed to run "google_optimize_local_ssd" script: exec: "google_optimize_local_ssd": executable file not found in $PATH

Warning: Failed to run "google_set_multiqueue" script: exec: "google_set_multiqueue": executable file not found in $PATH

If I try to connect to the VM using SSH it fails to connect. Then if I troubleshoot the connection it stalls on an endless busy phase on the "User permissions" stage of troubleshooting:

In the logs I then see the following error:

Error: Error creating user: useradd: failed to reset the lastlog entry of UID 20162: No space left on device useradd: cannot create directory /home/mail.

So it sounds like the VM is no longer able to access the storage it needs.

Nothing has been changed from my end, the VM simply stopped working correctly about 24 hours ago.

Any suggestions about how I might be able to resolve this issue?

kolban

When you run a Compute Engine (a VM) it runs an OS (normally Linux or Windows). When you ask it to run a Docker image, the OS running in the VM is an instance of Linux that ALSO has Docker installed/configured. We can read about this OS at:

https://cloud.google.com/container-optimized-os/docs

The net of this is that if you use the same Compute Engine instance over and over again (which is fine) then that instance of the OS might need to have its logs cleaned. However, when you stop/restart the VM, it is indeed the case that the Docker image that is run inside the VM is pulled from Artifact Registry but the instance of the OS on the VM remains between stops/starts. Lets be clear that it was only my guess that logs filled up the filesystem. We will learn more when you start it with serial console login enabled. Once we have a shell prompt and run "df", my hope is that we will find a full disk that can be cleaned ... but the problem MAY be elsewhere. Also, we will want to find out the "distribution" of files on the local VM filesystem. We may find that it is something other than logs that might be being written ... to be continued 🙂

View solution in original post

kolban

To my eyes, the core issue is "No space left on device". I heard you say this VM has been running for years. I'm imagining that it has been writing log records to its local file system and those logs may not have ever been cleaned up. This may result in the Linux file system becoming 100% full. When the file system becomes full, all bets are off. It means the kernel and other apps can't write to /tmp or other locations. It could easily also mean that core services like SSH can't function. What I would do is reboot the machine after enabling serial console login. See here:

https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-using-serial-console

You should then be able to login from the Google Cloud console in a simple/basic serial console. Now we can run Linux commands such as "df" to determine free disk. I'd be looking for file systems that are 100% full. If they are, then find some logs that you don't need, delete them and THEN reboot the machine again. Hopefully now it will have a minimum amount of free space to work and you can now SSH in properly. At this point, do a thorough examination of disk usage across the file system and see what has been growing, clean it up and ensure it doesn't fill up your file system again.

gavinharriss

Thanks kolban, I'll give that a try. Though I may have not provided enough detail... this VM is tied to a Google Artifact Registry container so I was under the expectation that every time I reset the VM it was fetching the latest container image and wiping the whole VM slate clean. I assumed this meant removing all logs, cache, etc. that had built up. I now wonder if the VM itself still retained all its logging while only the internal image hosted was reset. I must admit that I have a shallow understanding in this space so will investigate in the direction you've pointed me at.

kolban

When you run a Compute Engine (a VM) it runs an OS (normally Linux or Windows). When you ask it to run a Docker image, the OS running in the VM is an instance of Linux that ALSO has Docker installed/configured. We can read about this OS at:

https://cloud.google.com/container-optimized-os/docs

The net of this is that if you use the same Compute Engine instance over and over again (which is fine) then that instance of the OS might need to have its logs cleaned. However, when you stop/restart the VM, it is indeed the case that the Docker image that is run inside the VM is pulled from Artifact Registry but the instance of the OS on the VM remains between stops/starts. Lets be clear that it was only my guess that logs filled up the filesystem. We will learn more when you start it with serial console login enabled. Once we have a shell prompt and run "df", my hope is that we will find a full disk that can be cleaned ... but the problem MAY be elsewhere. Also, we will want to find out the "distribution" of files on the local VM filesystem. We may find that it is something other than logs that might be being written ... to be continued 🙂

gavinharriss

Thank you for your insights kolban, that makes a lot of sense. Though I must admit I ended up spinning up a replacement VM as a quick fix. When time allows I will experiment with the serial console to make myself a little more familiar in this space for the next time! You help has been much appreciated.

Alexey_

Try cleaning up unused docker images, it cleaned 53 of of 55 Gbs of space on my instance 🙂

docker image prune -a
docker system prune