The idea for the big-little proxmox servers comes from the big.LITTLE processor design ARM has been developing for the last couple of years. ARM big.LITTLE - Wikipedia The basic idea is that one has two heterogeneous types of cores (or servers nodes in my case). Slow power-efficient cores run when there is very little load and some fast power-consuming cores turn on when the load increases.
I don’t think that this is a use case that the proxmox developers take into consideration when working on proxmox. However, proxmox has a couple of things working in its favor:
Promox is designed to deal with a node unexpectedly going offline very well. If a node dies unexpectedly, the other nodes continue running their VMs and containers as if nothing happened.
Proxmox can gracefully handle manual node shutdowns. If you shut a node of vie the GUI or CLI, proxmox sends the poweroff command to individual VMs and containers on that node. When they are in the poweroff state, the proxmox node shuts itself off. This works pretty well. The only questionable behavior I have seen is a GitLab container which sometimes takes 5-10 minutes to shut down. I need to read the logs to make sure that those containers are going down gracefully.
The challenge is that a Promox cluster is not designed to intentionally run without all of its nodes:
- The first issue is quorum. When making a change to the ‘Promox configuration’ all of the nodes in a cluster have a vote. If one node of a two-node cluster is powered off, it takes two votes to do anything… and there is only one node available to vote. This is a very common design pattern in clustering systems. Some of the early spacecraft used a similar system. They had three navigational computers, if all three didn’t agree, they would all try the calculation again.
- The second issue is implementation. The Proxmox GUI does not do anything directly. Instead, the GUI writes to configuration files that are mostly stored in /etc/pve. Then when you hit apply, proxmox triggers a lower level qemu CLI command which read the configuration files and make the actual changes to the system
There doesn’t seem to be anything that can queue configuration changes and apply them a node comes back online.
I need to do more experimentation to see what happens when a node is offline. The power edge currently takes about 10-15 minutes to power back up so waking up to make changes is not really an option.
Physically, I have found a couple of ways to power off the server:
- log in to the server and manually power it off.
- Hit the shutdown button in the GUI.
- Use a cron job to send the poweroff command
- In poweredge systems one can:
a. Log into iDrac and shut it down
b. Send ipmi commands Intelligent Platform Management Interface - Wikipedia
- I do it the simple way and set up an alias in ssh config to automatically ssh to the device and power it off.
I went with the manual method because my life is very unstructured. I don’t have set times to work in my homelab.
Interestingly, powering the system back on is a bit harder. My favorite technique is WoL or Wake on Lan. There are several GUI and CLI systems that can send WoL signals manually or on a schedule. The challenge is that not all combinations of network adaptors and motherboards respond correctly to WoL signals.
On the Poweredge server, I used impi commands (see above) sent through the iDrac network adaptor and it works a treat.
I consider my NUC like @jay appears to use his RPI. It is a power-sipping ‘head’ of the homelab device which controls everything else.
If anyone else is interested in working on this, it might be interesting to look at the HA (High Availability) code in proxmox. HA has to prepare the cluster so that there are no service interruptions if a node goes offline. It might lend some insight into how to handle LA (Low Availability) situations.
Yep, in 5 years when LA is a thing in proxmox you can say you learned about it first on the Learn Linux TV forum Now, I just have to convince people smarter than me to implement it