diff options
| author | Samuel Dionne-Riel <samuel@dionne-riel.com> | 2024-12-31 20:06:56 -0500 |
|---|---|---|
| committer | Samuel Dionne-Riel <samuel@dionne-riel.com> | 2025-01-01 00:32:55 -0500 |
| commit | 40fb9e7c5a45af3f63fc821067f044d3d8a4befc (patch) | |
| tree | 1b4bd36cc438b71f5da67449396fa1ff9ea58812 /pkgs/development/python-modules/termplotlib/gnuplot-subprocess.patch | |
| parent | 88195a94f390381c6afcdaa933c2f6ff93959cb4 (diff) | |
switch-to-configuration-ng: Fix exit status on bootloader install error
The problem
-----------
When rebuilding a system, if `switch-to-configuration-ng` fails to install
bootloader files, it will (most likely) `exit(0)`.
```
/etc/nixos $ sudo nixos-rebuild --fast boot && reboot
[sudo] password for samuel:
building the system configuration...
updating GRUB 2 menu...
cannot copy /nix/store/.../initrd to /boot/kernels/...-initrd.tmp: No space left on device
Failed to install bootloader
Broadcast message from samuel@... on pts/1 (Tue 2024-12-31 16:48:26 EST):
The system will reboot now!
```
This is a quite awkward breaking change with the expected behaviour.
* * *
The investigation
-----------------
Compare:
- https://github.com/NixOS/nixpkgs/blob/85b5f3e959327a3fa46f843848ebb8799069bb95/pkgs/by-name/sw/switch-to-configuration-ng/src/src/main.rs#L171-L179
- https://github.com/NixOS/nixpkgs/blob/85b5f3e959327a3fa46f843848ebb8799069bb95/nixos/modules/system/activation/switch-to-configuration.pl#L115-L117
Let's see what `die()` is all about:
- https://github.com/NixOS/nixpkgs/blob/85b5f3e959327a3fa46f843848ebb8799069bb95/pkgs/by-name/sw/switch-to-configuration-ng/src/src/main.rs#L121-L125
***sus.***
There are multiple issues converging here.
**Incorrect port**
The original implementation did not use `die`, but `exit 1`.
So porting from `perl` following the script's idiosyncrasies was not
done appropriately.
**Incorrect `die` fac-simile**
The `die` method is incomplete with regard to the semantics of perl.
- https://perldoc.perl.org/5.40.0/functions/die
Of importance to us:
> If [die is called], the exit code is determined from the values of
> `$!` and `$?` with this pseudocode:
>
> ```
> exit $! if $!; # errno
> exit $? >> 8 if $? >> 8; # child exit status
> exit 255; # last resort
> ```
The `die()` method in `switch-to-configuration-ng` *only* checks
`errno`, using its value directly to `exit()`.
It does not handle some form of implicit child process exit status.
And, due to incorrect assumptions, it will not fall back to anything.
**Incorrect implementation**
(Note that from this point on, I'm not a Rust expert, so bear with me if
some nuances are lost or incorrectly represented.)
The `die()` function implementation, as a port, might not even work
correctly.
Already, the `spawn` method does not mention it would be setting
`errno`, so any `die()` following a `status.success()` is *sus* and
should be investigated. Since it's not attempting to do anything "smart"
with child processes.
- https://doc.rust-lang.org/1.83.0/std/process/struct.Command.html#method.spawn
And I'd argue that using *errno* in this manner in Rust is probably a
mistake, and should not be done.
> This should be called immediately after a call to a platform function,
> otherwise the state of the error value is indeterminate.
- https://doc.rust-lang.org/1.83.0/std/io/struct.Error.html#method.last_os_error
Considering *platform function* is largely left undefined, I would
(probably wrongly) intuit that it should be considered undefined
behaviour to rely on it.
Note that `raw_os_error` might have a surprising interface.
> If this `Error` was constructed via `last_os_error` [...],
> then this function will return `Some`, otherwise it will return `None`.
- https://doc.rust-lang.org/1.83.0/std/io/struct.Error.html#method.raw_os_error
Since it's used as `std::io::Error::last_os_error().raw_os_error()`,
AFAIUI it will always return `Some`.
Since this is exposing `errno`, the libc concept, it will behave the
same, and may be set to `0` by default, just like here:
```
$ printf '#include <stdlib.h>\n#include <errno.h>\nint main() { exit(errno); }' \
| cc -x c - && ./a.out; echo $?
0
```
Which means that, since no *platform function*[sic] changed its value,
it will be zero, the `die()` function will be equivalent to `exit(errno)`,
and the program will have failed “successfully” wrongly.
* * *
The fix
-------
I've fixed the `do_pre_switch_check` and `do_install_bootloader` methods,
both of which share the same defects (the original script uses `exit 1`
for both).
They were the only `status.success()` checks using `die()`.
* * *
Reproducing the issue
---------------------
Remember how I said:
> Considering *platform function* is largely left undefined, I would
> (probably wrongly) intuit that it should be considered undefined
> behaviour to rely on it.
Here's why it's not some vague FUD.
First, make sure a `nixos-rebuild boot` would need to write new files to
the boot partition. Removing an older (but still alive) generation's
initrd can do that.
Fill the `/boot` partition to force an error.
```
$ sudo dd if=/dev/zero of=/boot/BOGUS.FILLINGS
```
Then, and here's the fun part, observe:
```
~ $ sudo rm -r /run/nixos
~ $ sudo nixos-rebuild --fast boot ; echo $?
building the system configuration...
updating GRUB 2 menu...
cannot copy /nix/store/x91w4p91l7iclkdp38chvdxcw6nr5113-mobile-nixos-initrd-generic/initrd to /boot/kernels/x91w4p91l7iclkdp38chvdxcw6nr5113-mobile-nixos-initrd-generic-initrd.tmp: No space left on device
Failed to install bootloader
0
~ $ sudo nixos-rebuild --fast boot ; echo $?
building the system configuration...
updating GRUB 2 menu...
cannot copy /nix/store/x91w4p91l7iclkdp38chvdxcw6nr5113-mobile-nixos-initrd-generic/initrd to /boot/kernels/x91w4p91l7iclkdp38chvdxcw6nr5113-mobile-nixos-initrd-generic-initrd.tmp: No space left on device
Failed to install bootloader
warning: error(s) occurred while switching to the new configuration
1
```
So... What's the deal with /run/nixos? It's where the lock file will
reside. (And other transient files.)
- https://github.com/NixOS/nixpkgs/blob/85b5f3e959327a3fa46f843848ebb8799069bb95/pkgs/by-name/sw/switch-to-configuration-ng/src/src/main.rs#L1018-L1027
But why does that matter here?
```
~ $ errno 17
EEXIST 17 File exists
```
This error is produced by some *platform functions*[sic] that create
either the directory, or the lockfile. The file already exists.
So the script would end-up failing this way *only for the first
invocation*. Which is why it's possible any of you all reviewing this
~~novel~~ PR haven't faced that issue.
* * *
Future work
-----------
I believe `die()` *probably* should be switched to check the value, and
`exit 255` if it's 0.
Though I also believe `die()` shouldn't try to port perl semantics into
Rust. I don't think it's working out.
Additionally, a NixOS test should be authored to ensure that errors in
these phases actually are handled appropriately.
Signed-off-by: Samuel Dionne-Riel <samuel@dionne-riel.com>
Diffstat (limited to 'pkgs/development/python-modules/termplotlib/gnuplot-subprocess.patch')
0 files changed, 0 insertions, 0 deletions
