Difference between revisions of "PVE Guest crash detection"

From Da Nerd Mage Wiki
Jump to navigation Jump to search
Line 47: Line 47:
A continuously running service based on:
A continuously running service based on:
* ncat -kl schizo310 9999
* ncat -kl schizo310 9999
=== A bit of code so far... ===
<syntaxhighlight lang="bash">
#! /bin/bash
while read line
do
  IFS=';' read -ra datum <<< "$line"
  echo -e ${datum[0]} "\t" ${datum[1]} "  \t" ${datum[3]} "\t" ${datum[2]}
# echo ${datum[0]} # TYPE
# echo ${datum[1]} # VMID
# echo ${datum[2]} # HOSTNAME
# echo ${datum[3]} # TIME
done < <(ncat -kl Schizo310.TinkerNet.ca 9999)
</syntaxhighlight>


=== An example of the messages ===
=== An example of the messages ===
<pre>
<syntaxhighlight>
root@schizo310:~# ncat -kl 192.168.1.2 9999
root@schizo310:~/bin# watcher
VM 11003 1715882341
VM 1001  1716161760 pfSense310
PVE Schizo310 1715882341
CT 11004  1716161761 pubkey-test
CT 11002 1715882341
CT 11001  1716161761 LibreSpeed310
CT 11002 1715882401
VM 1201  1716161761 manager310
VM 11003 1715882401
CT 11002  1716161761 NetCat-Tst-CLI
PVE Schizo310 1715882401
PVE Schizo310   1716161761 schizo310
</pre>
CT 1022  1716161761 Grafana
VM 11003  1716161761 LMDE-6
CT 1050  1716161761 FTPd
CT 1010  1716161761 MQTT
VM 11005  1716161761 Mint-21-Tst
CT 1040  1716161761 FileStore
CT 10022  1716161761 BlogNabbit
CT 1030  1716161761 Apache
CT 1032  1716161761 Apache-again
CT 1015  1716161761 Node-Red
CT 9060  1716161761 KittyCam
CT 1021  1716161761 InfluxDB
CT 1020  1716161761 MariaDB
</syntaxhighlight>


=== Next Steps ===
=== Next Steps ===

Revision as of 19:37, 19 May 2024

The Problem

Sometimes VMs/LXCs crash & there's no way to have Proxmox detect & reset them...

Possible Solution(s)

Have the VM/LXC itself run a watchdog & send a message regularly to Proxmox.

Then...

Have Proxmox watch for stale messages & reset any VMs/LXCs that are out of date by some specific time limit.

Questions

What messaging tools are built-in to Proxmox AND suitable?

One option is MQTT. But I'd rather keep this entirely contained to the PVE server itself.

Something using ncat might be best.

(That's ncat, not netcat... netcat doesn't work...)

  • sudo apt install ncat

Thoughts

On the guests

So far, working with a Debian LXC and LMDE & Mint-21 VMs...

Finding relevant info about guests

VMID of a container can be found in /etc/mtab via script:

  • VMID=`cat /etc/mtab | grep '/dev/mapper/' | head -n1 | cut -d'-' -f4`
  • PVEHOST=Schizo310.TinkerNet.ca
  • TYPE=CT
SMBIOS Settings.png

For VMs, you'll need to put it into the SMBIOS Settings under Options (maybe there's a way to automate this?) & then read it in via script using dmidecode:

  • VMID=`/usr/sbin/dmidecode -s system-serial-number`
  • PVEHOST=`/usr/sbin/dmidecode -s system-product-name`
  • TYPE=`/usr/sbin/dmidecode -s system-version`

(Has to be done as root...)

Sending the info to the host

A shell script (/root/bin/watchdog) with the above variable settings and:

  • /usr/bin/echo -e $TYPE";"$VMID";"`/usr/bin/hostname -s`";"`/usr/bin/date +%s` | /usr/bin/ncat $PVEHOST 9999

a cron job sending something like:

  • * * * * * /home/tinker/bin/watchdog

This cron line (in roots crontab) sends the guest type, the VMID, and the time expressed as seconds since the Epoch (1970-01-01 00:00 UTC) to port 9999 on the host.

On the PVE server

A similar cron task to that on the guests, but sending the server name instead of the guest VMID.

  • VMID=Schizo310
  • TYPE=PVE

A continuously running service based on:

  • ncat -kl schizo310 9999

A bit of code so far...

#! /bin/bash

while read line
 do
  IFS=';' read -ra datum <<< "$line"
  echo -e ${datum[0]} "\t" ${datum[1]} "  \t" ${datum[3]} "\t" ${datum[2]}

# echo ${datum[0]} # TYPE
# echo ${datum[1]} # VMID
# echo ${datum[2]} # HOSTNAME
# echo ${datum[3]} # TIME

done < <(ncat -kl Schizo310.TinkerNet.ca 9999)

An example of the messages

root@schizo310:~/bin# watcher
VM 	 1001   	 1716161760 	 pfSense310
CT 	 11004   	 1716161761 	 pubkey-test
CT 	 11001   	 1716161761 	 LibreSpeed310
VM 	 1201   	 1716161761 	 manager310
CT 	 11002   	 1716161761 	 NetCat-Tst-CLI
PVE  Schizo310   1716161761 	 schizo310
CT 	 1022   	 1716161761 	 Grafana
VM 	 11003   	 1716161761 	 LMDE-6
CT 	 1050   	 1716161761 	 FTPd
CT 	 1010   	 1716161761 	 MQTT
VM 	 11005   	 1716161761 	 Mint-21-Tst
CT 	 1040   	 1716161761 	 FileStore
CT 	 10022   	 1716161761 	 BlogNabbit
CT 	 1030   	 1716161761 	 Apache
CT 	 1032   	 1716161761 	 Apache-again
CT 	 1015   	 1716161761 	 Node-Red
CT 	 9060   	 1716161761 	 KittyCam
CT 	 1021   	 1716161761 	 InfluxDB
CT 	 1020   	 1716161761 	 MariaDB

Next Steps

Now to just figure out how to check the guests times against the servers time...

I suspect I'll be writing some code to watch the incoming data & put it in a file for handling.

Then, if a guests time differs from the hosts time by more than a minute, it's likely crashed. Reset it.

Ncat Caveats

Ncat, by default is a complete moron about DNS...

This appears to be some sort of bug in the GNU C compiler library NSS implementation.

If you start Ncat as a listener using the machines hostname, it has a good chance of binding to 127.0.1.1 instead of the machines actual IP address. This means it will never see incoming messages over the network. You can check if this is what it's doing by adding a -v option to the command line.

The solution when this happens is to edit /etc/nsswitch.conf and change the hosts line to remove files as an option (or, at least, move dns to the first position).

pfSense VMs

This seems to work under pfSense too.

BUT, requires installation of the nmap package and the watchdog script needs editing of the command locations.

#! /bin/sh

PVEHOST=`/usr/local/sbin/dmidecode -s system-product-name`

VMID=`/usr/local/sbin/dmidecode -s system-serial-number`

TYPE=`/usr/local/sbin/dmidecode -s system-version`

echo -e $TYPE";"$VMID";"`/bin/hostname -s`";"`/bin/date +%s` | /usr/local/bin/ncat $PVEHOST 9999

Windows VMs

¯\_(ツ)_/¯