Difference between revisions of "PVE Guest crash detection"
(7 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= The Problem = | = The Problem = | ||
Sometimes VMs/LXCs crash & there's no way to have Proxmox detect & reset them... | Sometimes VMs/LXCs crash & there's no way to have Proxmox detect & reset them... | ||
(Current theory is that the problem that started this project is the journalling service occasionally runs into a disk-write problem when the nitely backups are happening. This seems to make systemd very unhappy & it locks the VM up tight...) | |||
= Possible Solution(s) = | = Possible Solution(s) = | ||
Have the VM/LXC itself run a watchdog & send a message regularly to Proxmox. | Have the VM/LXC itself run a watchdog & send a message regularly to Proxmox. | ||
Line 47: | Line 50: | ||
A continuously running service based on: | A continuously running service based on: | ||
* ncat -kl schizo310 9999 | * ncat -kl schizo310 9999 | ||
=== A bit of code so far... === | |||
<syntaxhighlight lang="bash"> | |||
#! /bin/bash | |||
while read line | |||
do | |||
IFS=';' read -ra datum <<< "$line" | |||
echo -e ${datum[0]} "\t" ${datum[1]} " \t" ${datum[3]} "\t" ${datum[2]} | |||
# echo ${datum[0]} # TYPE | |||
# echo ${datum[1]} # VMID | |||
# echo ${datum[2]} # HOSTNAME | |||
# echo ${datum[3]} # TIME | |||
done < <(ncat -kl Schizo310.TinkerNet.ca 9999) | |||
</syntaxhighlight> | |||
=== An example of the messages === | === An example of the messages === | ||
< | <syntaxhighlight> | ||
root@schizo310:~# | root@schizo310:~/bin# watcher | ||
VM | VM 1001 1716172680 pfSense310 | ||
PVE Schizo310 | CT 1050 1716172681 FTPd | ||
CT | PVE Schizo310 1716172681 schizo310 | ||
CT | CT 1022 1716172681 Grafana | ||
VM | VM 1201 1716172681 manager310 | ||
PVE Schizo310 | CT 1010 1716172681 MQTT | ||
</ | CT 1021 1716172681 InfluxDB | ||
CT 1020 1716172681 MariaDB | |||
CT 1030 1716172681 Apache | |||
CT 1015 1716172681 Node-Red | |||
CT 1040 1716172681 FileStore | |||
CT 11001 1716172681 LibreSpeed310 | |||
VM 1001 1716172740 pfSense310 | |||
CT 1040 1716172741 FileStore | |||
CT 11001 1716172741 LibreSpeed310 | |||
CT 1050 1716172741 FTPd | |||
PVE Schizo310 1716172741 schizo310 | |||
CT 1022 1716172741 Grafana | |||
CT 1010 1716172741 MQTT | |||
CT 1015 1716172741 Node-Red | |||
CT 1020 1716172741 MariaDB | |||
CT 1021 1716172741 InfluxDB | |||
VM 1201 1716172741 manager310 | |||
CT 1030 1716172741 Apache | |||
^C | |||
</syntaxhighlight> | |||
This is a couple minutes of output. Notice that everything show up twice... | |||
=== Running it as a service === | |||
* <code>vi /etc/systemd/system/ncat-watch.service</code> | |||
[Unit] | |||
Description=Watching for incoming NCAT messages | |||
After=multi-user.target | |||
[Service] | |||
ExecStart=/usr/bin/bash /root/bin/watcher | |||
Type=simple | |||
[Install] | |||
WantedBy=multi-user.target | |||
* <code>systemctl enable ncat-watch.service</code> | |||
* <code>systemctl start ncat-watch.service</code> | |||
* <code>systemctl status ncat-watch.service</code> | |||
=== Next Steps === | === Next Steps === | ||
<span style="color: rgb(153, 51, 102);" >Now to just figure out how to check the guests times against the servers time...</span> | <span style="color: rgb(153, 51, 102);" >Now to just figure out how to check the guests times against the servers time...</span> | ||
I suspect I'll be writing some code to | I suspect I'll be writing some code to take the incoming data & put it in a file (or maybe multiple files) for handling. | ||
<span style="color: rgb(153, 51, 102);" >Then, if a guests time differs from the hosts time by more than a minute, it's likely crashed. Reset it.</span> | <span style="color: rgb(153, 51, 102);" >Then, if a guests time differs from the hosts time by more than a minute, it's likely crashed. Reset it.</span> |
Latest revision as of 13:16, 22 June 2024
The Problem
Sometimes VMs/LXCs crash & there's no way to have Proxmox detect & reset them...
(Current theory is that the problem that started this project is the journalling service occasionally runs into a disk-write problem when the nitely backups are happening. This seems to make systemd very unhappy & it locks the VM up tight...)
Possible Solution(s)
Have the VM/LXC itself run a watchdog & send a message regularly to Proxmox.
Then...
Have Proxmox watch for stale messages & reset any VMs/LXCs that are out of date by some specific time limit.
Questions
What messaging tools are built-in to Proxmox AND suitable?
One option is MQTT. But I'd rather keep this entirely contained to the PVE server itself.
Something using ncat might be best.
(That's ncat, not netcat... netcat doesn't work...)
sudo apt install ncat
Thoughts
On the guests
So far, working with a Debian LXC and LMDE & Mint-21 VMs...
Finding relevant info about guests
VMID of a container can be found in /etc/mtab via script:
VMID=`cat /etc/mtab | grep '/dev/mapper/' | head -n1 | cut -d'-' -f4`
PVEHOST=Schizo310.TinkerNet.ca
TYPE=CT
For VMs, you'll need to put it into the SMBIOS Settings under Options (maybe there's a way to automate this?) & then read it in via script using dmidecode:
VMID=`/usr/sbin/dmidecode -s system-serial-number`
PVEHOST=`/usr/sbin/dmidecode -s system-product-name`
TYPE=`/usr/sbin/dmidecode -s system-version`
(Has to be done as root...)
Sending the info to the host
A shell script (/root/bin/watchdog) with the above variable settings and:
/usr/bin/echo -e $TYPE";"$VMID";"`/usr/bin/hostname -s`";"`/usr/bin/date +%s` | /usr/bin/ncat $PVEHOST 9999
a cron job sending something like:
- * * * * * /home/tinker/bin/watchdog
This cron line (in roots crontab) sends the guest type, the VMID, and the time expressed as seconds since the Epoch (1970-01-01 00:00 UTC) to port 9999 on the host.
On the PVE server
A similar cron task to that on the guests, but sending the server name instead of the guest VMID.
VMID=Schizo310
TYPE=PVE
A continuously running service based on:
- ncat -kl schizo310 9999
A bit of code so far...
#! /bin/bash
while read line
do
IFS=';' read -ra datum <<< "$line"
echo -e ${datum[0]} "\t" ${datum[1]} " \t" ${datum[3]} "\t" ${datum[2]}
# echo ${datum[0]} # TYPE
# echo ${datum[1]} # VMID
# echo ${datum[2]} # HOSTNAME
# echo ${datum[3]} # TIME
done < <(ncat -kl Schizo310.TinkerNet.ca 9999)
An example of the messages
root@schizo310:~/bin# watcher
VM 1001 1716172680 pfSense310
CT 1050 1716172681 FTPd
PVE Schizo310 1716172681 schizo310
CT 1022 1716172681 Grafana
VM 1201 1716172681 manager310
CT 1010 1716172681 MQTT
CT 1021 1716172681 InfluxDB
CT 1020 1716172681 MariaDB
CT 1030 1716172681 Apache
CT 1015 1716172681 Node-Red
CT 1040 1716172681 FileStore
CT 11001 1716172681 LibreSpeed310
VM 1001 1716172740 pfSense310
CT 1040 1716172741 FileStore
CT 11001 1716172741 LibreSpeed310
CT 1050 1716172741 FTPd
PVE Schizo310 1716172741 schizo310
CT 1022 1716172741 Grafana
CT 1010 1716172741 MQTT
CT 1015 1716172741 Node-Red
CT 1020 1716172741 MariaDB
CT 1021 1716172741 InfluxDB
VM 1201 1716172741 manager310
CT 1030 1716172741 Apache
^C
This is a couple minutes of output. Notice that everything show up twice...
Running it as a service
vi /etc/systemd/system/ncat-watch.service
[Unit] Description=Watching for incoming NCAT messages After=multi-user.target [Service] ExecStart=/usr/bin/bash /root/bin/watcher Type=simple [Install] WantedBy=multi-user.target
systemctl enable ncat-watch.service
systemctl start ncat-watch.service
systemctl status ncat-watch.service
Next Steps
Now to just figure out how to check the guests times against the servers time...
I suspect I'll be writing some code to take the incoming data & put it in a file (or maybe multiple files) for handling.
Then, if a guests time differs from the hosts time by more than a minute, it's likely crashed. Reset it.
Ncat Caveats
Ncat, by default is a complete moron about DNS...
This appears to be some sort of bug in the GNU C compiler library NSS implementation.
If you start Ncat as a listener using the machines hostname, it has a good chance of binding to 127.0.1.1 instead of the machines actual IP address. This means it will never see incoming messages over the network. You can check if this is what it's doing by adding a -v option to the command line.
The solution when this happens is to edit /etc/nsswitch.conf and change the hosts line to remove files as an option (or, at least, move dns to the first position).
pfSense VMs
This seems to work under pfSense too.
BUT, requires installation of the nmap package and the watchdog script needs editing of the command locations.
#! /bin/sh PVEHOST=`/usr/local/sbin/dmidecode -s system-product-name` VMID=`/usr/local/sbin/dmidecode -s system-serial-number` TYPE=`/usr/local/sbin/dmidecode -s system-version` echo -e $TYPE";"$VMID";"`/bin/hostname -s`";"`/bin/date +%s` | /usr/local/bin/ncat $PVEHOST 9999
Windows VMs
¯\_(ツ)_/¯