Latest revision as of 14:16, 22 June 2024

The Problem

Sometimes VMs/LXCs crash & there's no way to have Proxmox detect & reset them...

(Current theory is that the problem that started this project is the journalling service occasionally runs into a disk-write problem when the nitely backups are happening. This seems to make systemd very unhappy & it locks the VM up tight...)

Possible Solution(s)

Have the VM/LXC itself run a watchdog & send a message regularly to Proxmox.

Then...

Have Proxmox watch for stale messages & reset any VMs/LXCs that are out of date by some specific time limit.

Questions

What messaging tools are built-in to Proxmox AND suitable?

One option is MQTT. But I'd rather keep this entirely contained to the PVE server itself.

Something using ncat might be best.

(That's ncat, not netcat... netcat doesn't work...)

sudo apt install ncat

Thoughts

On the guests

So far, working with a Debian LXC and LMDE & Mint-21 VMs...

Finding relevant info about guests

VMID of a container can be found in /etc/mtab via script:

VMID=`cat /etc/mtab | grep '/dev/mapper/' | head -n1 | cut -d'-' -f4`
PVEHOST=Schizo310.TinkerNet.ca
TYPE=CT

For VMs, you'll need to put it into the SMBIOS Settings under Options (maybe there's a way to automate this?) & then read it in via script using dmidecode:

VMID=`/usr/sbin/dmidecode -s system-serial-number`
PVEHOST=`/usr/sbin/dmidecode -s system-product-name`
TYPE=`/usr/sbin/dmidecode -s system-version`

(Has to be done as root...)

Sending the info to the host

A shell script (/root/bin/watchdog) with the above variable settings and:

/usr/bin/echo -e $TYPE";"$VMID";"`/usr/bin/hostname -s`";"`/usr/bin/date +%s` | /usr/bin/ncat $PVEHOST 9999

a cron job sending something like:

* * * * * /home/tinker/bin/watchdog

This cron line (in roots crontab) sends the guest type, the VMID, and the time expressed as seconds since the Epoch (1970-01-01 00:00 UTC) to port 9999 on the host.

On the PVE server

A similar cron task to that on the guests, but sending the server name instead of the guest VMID.

VMID=Schizo310
TYPE=PVE

A continuously running service based on:

ncat -kl schizo310 9999

A bit of code so far...

#! /bin/bash

while read line
 do
  IFS=';' read -ra datum <<< "$line"
  echo -e ${datum[0]} "\t" ${datum[1]} "  \t" ${datum[3]} "\t" ${datum[2]}

# echo ${datum[0]} # TYPE
# echo ${datum[1]} # VMID
# echo ${datum[2]} # HOSTNAME
# echo ${datum[3]} # TIME

done < <(ncat -kl Schizo310.TinkerNet.ca 9999)

An example of the messages

root@schizo310:~/bin# watcher
VM 	 1001   	 1716172680 	 pfSense310
CT 	 1050   	 1716172681 	 FTPd
PVE  Schizo310   1716172681 	 schizo310
CT 	 1022   	 1716172681 	 Grafana
VM 	 1201   	 1716172681 	 manager310
CT 	 1010   	 1716172681 	 MQTT
CT 	 1021   	 1716172681 	 InfluxDB
CT 	 1020   	 1716172681 	 MariaDB
CT 	 1030   	 1716172681 	 Apache
CT 	 1015   	 1716172681 	 Node-Red
CT 	 1040   	 1716172681 	 FileStore
CT 	 11001   	 1716172681 	 LibreSpeed310
VM 	 1001   	 1716172740 	 pfSense310
CT 	 1040   	 1716172741 	 FileStore
CT 	 11001   	 1716172741 	 LibreSpeed310
CT 	 1050   	 1716172741 	 FTPd
PVE  Schizo310   1716172741 	 schizo310
CT 	 1022   	 1716172741 	 Grafana
CT 	 1010   	 1716172741 	 MQTT
CT 	 1015   	 1716172741 	 Node-Red
CT 	 1020   	 1716172741 	 MariaDB
CT 	 1021   	 1716172741 	 InfluxDB
VM 	 1201   	 1716172741 	 manager310
CT 	 1030   	 1716172741 	 Apache
^C

This is a couple minutes of output. Notice that everything show up twice...

Running it as a service

vi /etc/systemd/system/ncat-watch.service

[Unit]
Description=Watching for incoming NCAT messages
After=multi-user.target

[Service]
ExecStart=/usr/bin/bash /root/bin/watcher
Type=simple

[Install]
WantedBy=multi-user.target

systemctl enable ncat-watch.service
systemctl start ncat-watch.service
systemctl status ncat-watch.service

Next Steps

Now to just figure out how to check the guests times against the servers time...

I suspect I'll be writing some code to take the incoming data & put it in a file (or maybe multiple files) for handling.

Then, if a guests time differs from the hosts time by more than a minute, it's likely crashed. Reset it.

Ncat Caveats

Ncat, by default is a complete moron about DNS...

This appears to be some sort of bug in the GNU C compiler library NSS implementation.

If you start Ncat as a listener using the machines hostname, it has a good chance of binding to 127.0.1.1 instead of the machines actual IP address. This means it will never see incoming messages over the network. You can check if this is what it's doing by adding a -v option to the command line.

The solution when this happens is to edit /etc/nsswitch.conf and change the hosts line to remove files as an option (or, at least, move dns to the first position).

pfSense VMs

This seems to work under pfSense too.

BUT, requires installation of the nmap package and the watchdog script needs editing of the command locations.

#! /bin/sh

PVEHOST=`/usr/local/sbin/dmidecode -s system-product-name`

VMID=`/usr/local/sbin/dmidecode -s system-serial-number`

TYPE=`/usr/local/sbin/dmidecode -s system-version`

echo -e $TYPE";"$VMID";"`/bin/hostname -s`";"`/bin/date +%s` | /usr/local/bin/ncat $PVEHOST 9999

Windows VMs

¯\_(ツ)_/¯

@@ Line 1: / Line 1: @@
 = The Problem =
 Sometimes VMs/LXCs crash & there's no way to have Proxmox detect & reset them...
+(Current theory is that the problem that started this project is the journalling service occasionally runs into a disk-write problem when the nitely backups are happening. This seems to make systemd very unhappy & it locks the VM up tight...)
 = Possible Solution(s) =
 Have the VM/LXC itself run a watchdog & send a message regularly to Proxmox.
@@ Line 21: / Line 24: @@
 = Thoughts =
 == On the guests ==
-So far, working with a Debian LXC & an LMDE VM...
+So far, working with a Debian LXC and LMDE & Mint-21 VMs...
 === Finding relevant info about guests ===
 VMID of a container can be found in '''/etc/mtab''' via script:
@@ Line 36: / Line 39: @@
 === Sending the info to the host ===
 A shell script (/root/bin/watchdog) with the above variable settings and:
-* <code>/usr/bin/echo -e $TYPE"\t"$VMID"\t\t"`/usr/bin/date +%s` {{!}} /usr/bin/ncat $PVEHOST 9999</code>
+* <code>/usr/bin/echo -e $TYPE";"$VMID";"`/usr/bin/hostname -s`";"`/usr/bin/date +%s` | /usr/bin/ncat $PVEHOST 9999</code>
 a cron job sending something like:
 * * * * * * /home/tinker/bin/watchdog
@@ Line 47: / Line 50: @@
 A continuously running service based on:
 * ncat -kl schizo310 9999
+=== A bit of code so far... ===
+<syntaxhighlight lang="bash">
+#! /bin/bash
+while read line
+ do
+  IFS=';' read -ra datum <<< "$line"
+  echo -e ${datum[0]} "\t" ${datum[1]} "  \t" ${datum[3]} "\t" ${datum[2]}
+# echo ${datum[0]} # TYPE
+# echo ${datum[1]} # VMID
+# echo ${datum[2]} # HOSTNAME
+# echo ${datum[3]} # TIME
+done < <(ncat -kl Schizo310.TinkerNet.ca 9999)
+</syntaxhighlight>
 === An example of the messages ===
-<pre>
+<syntaxhighlight>
-root@schizo310:~# ncat -kl 192.168.1.2 9999
+root@schizo310:~/bin# watcher
-VM	11003		1715882341
+VM 	 1001   	 1716172680 	 pfSense310
-PVE	Schizo310	1715882341
+CT 	 1050   	 1716172681 	 FTPd
-CT	11002		1715882341
+PVE  Schizo310   1716172681 	 schizo310
-CT	11002		1715882401
+CT 	 1022   	 1716172681 	 Grafana
-VM	11003		1715882401
+VM 	 1201   	 1716172681 	 manager310
-PVE	Schizo310	1715882401
+CT 	 1010   	 1716172681 	 MQTT
-</pre>
+CT 	 1021   	 1716172681 	 InfluxDB
+CT 	 1020   	 1716172681 	 MariaDB
+CT 	 1030   	 1716172681 	 Apache
+CT 	 1015   	 1716172681 	 Node-Red
+CT 	 1040   	 1716172681 	 FileStore
+CT 	 11001   	 1716172681 	 LibreSpeed310
+VM 	 1001   	 1716172740 	 pfSense310
+CT 	 1040   	 1716172741 	 FileStore
+CT 	 11001   	 1716172741 	 LibreSpeed310
+CT 	 1050   	 1716172741 	 FTPd
+PVE  Schizo310   1716172741 	 schizo310
+CT 	 1022   	 1716172741 	 Grafana
+CT 	 1010   	 1716172741 	 MQTT
+CT 	 1015   	 1716172741 	 Node-Red
+CT 	 1020   	 1716172741 	 MariaDB
+CT 	 1021   	 1716172741 	 InfluxDB
+VM 	 1201   	 1716172741 	 manager310
+CT 	 1030   	 1716172741 	 Apache
+^C
+</syntaxhighlight>
+This is a couple minutes of output.  Notice that everything show up twice...
+=== Running it as a service ===
+* <code>vi /etc/systemd/system/ncat-watch.service</code>
+ [Unit]
+ Description=Watching for incoming NCAT messages
+ After=multi-user.target
+ [Service]
+ ExecStart=/usr/bin/bash /root/bin/watcher
+ Type=simple
+ [Install]
+ WantedBy=multi-user.target
+* <code>systemctl enable ncat-watch.service</code>
+* <code>systemctl start ncat-watch.service</code>
+* <code>systemctl status ncat-watch.service</code>
+=== Next Steps ===
 <span style="color: rgb(153, 51, 102);" >Now to just figure out how to check the guests times against the servers time...</span>
+I suspect I'll be writing some code to take the incoming data & put it in a file (or maybe multiple files) for handling.
 <span style="color: rgb(153, 51, 102);" >Then, if a guests time differs from the hosts time by more than a minute, it's likely crashed. Reset it.</span>
@@ Line 70: / Line 131: @@
 The solution when this happens is to edit '''/etc/nsswitch.conf''' and change the '''hosts''' line to remove '''files''' as an option (or, at least, move '''dns''' to the first position).
+= pfSense VMs =
+This seems to work under pfSense too.
+BUT, requires installation of the nmap package and the watchdog script needs editing of the command locations.
+<pre>
+#! /bin/sh
+PVEHOST=`/usr/local/sbin/dmidecode -s system-product-name`
+VMID=`/usr/local/sbin/dmidecode -s system-serial-number`
+TYPE=`/usr/local/sbin/dmidecode -s system-version`
+echo -e $TYPE";"$VMID";"`/bin/hostname -s`";"`/bin/date +%s` | /usr/local/bin/ncat $PVEHOST 9999
+</pre>
+= Windows VMs =
+<span style="font-size: 24pt;" >¯\_(ツ)_/¯</span>

Difference between revisions of "PVE Guest crash detection"

Latest revision as of 14:16, 22 June 2024

Contents

The Problem

Possible Solution(s)

Questions

Thoughts

On the guests

Finding relevant info about guests

Sending the info to the host

On the PVE server

A bit of code so far...

An example of the messages

Running it as a service

Next Steps

Ncat Caveats

pfSense VMs

Windows VMs

Navigation menu

Difference between revisions of "PVE Guest crash detection"

Latest revision as of 14:16, 22 June 2024

The Problem

Possible Solution(s)

Questions

Thoughts

On the guests

Finding relevant info about guests

Sending the info to the host

On the PVE server

A bit of code so far...

An example of the messages

Running it as a service

Next Steps

Ncat Caveats

pfSense VMs

Windows VMs

Navigation menu

Search