PVE Guest crash detection
The Problem
Sometimes VMs/LXCs crash & there's no way to have Proxmox detect & reset them...
Possible Solution(s)
Have the VM/LXC itself run a watchdog & send a message regularly to Proxmox.
Then...
Have Proxmox watch for stale messages & reset any VMs/LXCs that are out of date by some specific time limit.
Questions
What messaging tools are built-in to Proxmox AND suitable?
One option is MQTT. But I'd rather keep this entirely contained to the PVE server itself.
Something using ncat might be best.
(That's ncat, not netcat... netcat doesn't work...)
sudo apt install ncat
Thoughts
On the guests
So far, working with a Debian LXC & an LMDE VM...
Finding relevant info about guests
VMID of a container can be found in /etc/mtab via script:
VMID=`cat /etc/mtab | grep '/dev/mapper/' | head -n1 | cut -d'-' -f4`
PVEHOST=Schizo310.TinkerNet.ca
TYPE=CT
For VMs, you'll need to put it into the SMBIOS Settings under Options (maybe there's a way to automate this?) & then read it in via script using dmidecode:
VMID=`/usr/sbin/dmidecode -s system-serial-number`
PVEHOST=`/usr/sbin/dmidecode -s system-product-name`
TYPE=`/usr/sbin/dmidecode -s system-version`
(Has to be done as root...)
Sending the info to the host
A shell script (/root/bin/watchdog) with the above variable settings and:
/usr/bin/echo -e $TYPE"\t"$VMID"\t\t"`/usr/bin/date +%s` | /usr/bin/ncat $PVEHOST 9999
a cron job sending something like:
- * * * * * /home/tinker/bin/watchdog
This cron line (in roots crontab) sends the guest type, the VMID, and the time expressed as seconds since the Epoch (1970-01-01 00:00 UTC) to port 9999 on the host.
On the PVE server
A similar cron task to that on the guests, but sending the server name instead of the guest VMID.
VMID=Schizo310
TYPE=PVE
A continuously running service based on:
- ncat -kl schizo310 9999
An example of the messages
root@schizo310:~# ncat -kl 192.168.1.2 9999 VM 11003 1715882341 PVE Schizo310 1715882341 CT 11002 1715882341 CT 11002 1715882401 VM 11003 1715882401 PVE Schizo310 1715882401
Now to just figure out how to check the guests times against the servers time...
Then, if a guests time differs from the hosts time by more than a minute, it's likely crashed. Reset it.
Ncat Caveats
Ncat, by default is a complete moron about DNS...
This appears to be some sort of bug in the GNU C compiler library NSS implementation.
If you start Ncat as a listener using the machines hostname, it has a good chance of binding to 127.0.1.1 instead of the machines actual IP address. This means it will never see incoming messages over the network. You can check if this is what it's doing by adding a -v option to the command line.
The solution when this happens is to edit /etc/nsswitch.conf and change the hosts line to remove files as an option (or, at least, move dns to the first position).