Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/azure: move image-specific configs from azure-common to azure-image, fix console output #359365

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

codgician
Copy link
Member

@codgician codgician commented Nov 26, 2024

[azure-common.nix]

  • Moved image specific configs (namely file system configs and grub configs) out of azure-common.nix, because these settings are only intended for images generated using azure-image.nix. This will make azure-common.nix more generic so folks generating azure image using other methods (e.g. disko) can leverage it.
  • Removed reference of headless.nix to make the console work (which was left out by mistake from nixos/azure: add Gen 2 VM, aarch64 and accelerated networking support #333508). Thanks @nh2 for the great finding.
  • Enabled cloud-init by default. Otherwise, the current waagent would try ifupdown for networking management, which is not available in NixOS ecosystem.

[azure-image.nix]

  • Included image specific configs (mentioned as above).
  • Fixed grub console output for Gen 2 VM by forcing grub to run in text mode.
  • Added option for customizing root partition label.
  • Updated the default configuration.nix copied into the image (azure-config-user.nix), prompting user to set the correct vmGeneration before rebuilding. Doing this instead of dynamically generating to remain compatibility, as I found some open-source configurations on GitHub are referencing this file directly.

[azure-agent.nix]

  • Removed assertion of Network Manager, because the current version of waagent works well with Network Manager.
  • Removed reference of python39 because the current version of waagent is depending on a newer Python version which already exist in the environment given how it is packaged.

Changes in this PR is testable via flake codgician/azure-aarch64-nixos (instructions for generating images provided in README.md). Tested for both Gen 2 grub and Gen 2 systemd-boot on aarch64.

image

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 25.05 Release Notes (or backporting 24.11 and 25.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@github-actions github-actions bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` labels Nov 26, 2024
@codgician codgician requested a review from nh2 November 26, 2024 18:42
@codgician codgician self-assigned this Nov 26, 2024
@codgician codgician force-pushed the fix-azure-modules branch 2 times, most recently from 9e744ad to 3ede656 Compare November 27, 2024 13:23
@codgician codgician changed the title nixos/azure: move image-specific configs from azure-common to azure-image, and code cleanups nixos/azure: move image-specific configs from azure-common to azure-image, fix console output Nov 27, 2024
@codgician codgician marked this pull request as ready for review November 27, 2024 17:14
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Nov 28, 2024
Copy link
Contributor

@nh2 nh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, but the release notes is a blocker.

Note I have not tested it, only read the code.

@github-actions github-actions bot added 8.has: documentation This PR adds or changes documentation 8.has: changelog labels Nov 29, 2024
@codgician
Copy link
Member Author

Generally LGTM, but the release notes is a blocker.

Note I have not tested it, only read the code.

Thanks @nh2 for the insights, I have made improvements according to comments and updated the release note.

@AkechiShiro
Copy link
Contributor

AkechiShiro commented Nov 29, 2024

I've been trying to test it but I'm running in an issue (server side from Microsoft, I've been asked by Microsoft to open a support ticket for their team to further investigate their server side errors) azcopy is failling with 500 when I try to copy the vhd to a managed disk.

If you have any recommended way of uploading the vhd @codgician feel free to let me know I'll try without azcopy

@codgician
Copy link
Member Author

I've been trying to test it but I'm running in an issue (server side from Microsoft, I've been asked by Microsoft to open a support ticket for their team to further investigate their server side errors) azcopy is failling with 500 when I try to copy the vhd to a managed disk.

If you have any recommended way of uploading the vhd @codgician feel free to let me know I'll try without azcopy

I also use azcopy for copying, the detailed command could be found here (you may need to change account name / container name / subscription id) accordingly.

Uploading can also be done on Azure Portal. In storage account page -> Storage browser -> Blob containers, select your container and click "Upload" for uploading blob, just make sure the type is selected as "Page Blob".

image

@codgician
Copy link
Member Author

Rebased with latest master branch to resolve merge conflicts.

@nh2
Copy link
Contributor

nh2 commented Nov 30, 2024

Added fileSystems.<name>.label in the latest iteration.

@codgician I cannot see that in your latest push yet, it still shows device = "/dev/disk/by-label/... for me in multiple places.

@codgician
Copy link
Member Author

codgician commented Dec 1, 2024

Added fileSystems.<name>.label in the latest iteration.

@codgician I cannot see that in your latest push yet, it still shows device = "/dev/disk/by-label/... for me in multiple places.

The device definition has to be preserved because it is directly referenced if boot.growPartition is enabled:

rootDevice="${config.fileSystems."/".device}"
. If removed it leads to build failure.

A line of inherit (cfg) label; is added beneath for users needing to access the label.

@codgician
Copy link
Member Author

codgician commented Dec 12, 2024

Rebased with latest master branch (where services.waagent option module is introduced and replacing azure-agent.nix). Test flake also updated.

@AkechiShiro
Copy link
Contributor

AkechiShiro commented Dec 15, 2024

I've been able to do a quick test (systemd-boot x86) but I'm running into a few issues, I've used an old VHD (built before your last comment) and a freshly built one, I've noted the following :

  • Channel nixos and stateVersion set to 24.11 but uname -a shows I'm under 25.05 (waagent seems to run fine).
  • systemd-boot menu at boot is indeed usable in the serial console
  • Result generated is no longer a disk.vhd for the test flake but more something like nixos-image-azure-25.05.20241212.ec68a18-x86_64-linux.vhd (building a configuration with stateVersion set to 24.11 or 25.05 doesn't change the output format, it still build a nixos-image-azure-25.05.date.<hash>-x86_64-linux.vhd.

sudo nixos-rebuild switch yields this error (note nix-channel --update does work and nix-shell -p do work) :

while calling the 'findFile' builtin
         at /nix/store/pc1d5jpff9zqnkks7nkkk3snms7h57m6-source/nixos/default.nix:1:60:
            1| { configuration ? import ./lib/from-env.nix "NIXOS_CONFIG" <nixos-config>
             |                                                            ^
            2| , system ? builtins.currentSystem

       error: file 'nixos-config' was not found in the Nix search path (add it using $NIX_PATH or -I)

NIX_PATH is set to nixpkgs=flake:nixpkgs:/nix/var/nix/profiles/per-user/root/channels, I tried setting it to nixpkgs=flake:nixpkgs:/nix/var/nix/profiles/per-user/root/channels:nixos-config=/etc/nixos/configuration.nix but this still did not work.

configuration.nix contains exactly what is in nixos/modules/virtualisation/azure-config-user.nix, I'm not sure how you were able to nixos-rebuild switch successfully.

@codgician did you export specific NIX_PATH variable or am I testing in the wrong way ? (I'm using a bit of a modified version for this script : https://github.com/rudesome/nixos-on-azure/blob/master/upload-image.sh)

I don't see where the original configuration.nix that I had in the test repository is on the machine when I boot it, but it does seem to affect the image.

If I switch the stateVersion to 25.05 (and then build a new ISO with that configuration) :

  • waagent crashes at the moment => making the VM unable to join the Internet (logging into the serial console is still possible),
  • nixos channel is still set to 24.11
  • I believe one more patch/substitution is necessary to point it towards the proper path to openssl, there might be a way to configure this in the new waagent service you have introduced.
INFO Daemon Daemon Clean protocol and wireserver endpoint
INFO Daemon Daemon Wire server endpoint:168.63.129.16
INFO Daemon Daemon Fabric preferred wire protocol version:2015-04-05
INFO Daemon Daemon Wire protocol version:2012-11-30
INFO Daemon Daemon Server preferred version:2015-04-05
ERROR Daemon Daemon Event: name=WALinuxAgent, op=UnhandledError, message=[Errno 2] No such file or directory: '/usr/bin/openssl'
Traceback (most recent call last):
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/daemon/main.py", line 83, in run
    self.daemon(child_args)
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/daemon/main.py", line 144, in daemon
    self.provision_handler.run()
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/pa/provision/cloudinit.py", line 53, in run
    self.protocol_util.get_protocol()  # Trigger protocol detection
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/protocol/util.py", line 299, in get_protocol
    protocol = self._detect_protocol(save_to_history=save_to_history, init_goal_state=init_goal_state)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/protocol/util.py", line 220, in _detect_protocol
    protocol.detect(init_goal_state=init_goal_state, save_to_history=save_to_history)
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/protocol/wire.py", line 84, in detect
    cryptutil.gen_transport_cert(trans_prv_file, trans_cert_file)
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/utils/cryptutil.py", line 47, in gen_transport_cert
    shellutil.run_command(cmd)
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/utils/shellutil.py", line 288, in run_command
    return __run_command(command_action=command_action, command=command, log_error=log_error, encode_output=encode_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/utils/shellutil.py", line 190, in __run_command
    return_code, stdout, stderr = command_action()
                                  ^^^^^^^^^^^^^^^^
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/utils/shellutil.py", line 255, in command_action
    process = _popen(command, stdin=popen_stdin, stdout=stdout, stderr=stderr, shell=False)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/utils/shellutil.py", line 398, in _popen
    process = subprocess.Popen(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/zv1kaq7f1q20x62kbjv6pfjygw5jmwl6-python3-3.12.7/lib/python3.12/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/nix/store/zv1kaq7f1q20x62kbjv6pfjygw5jmwl6-python3-3.12.7/lib/python3.12/subprocess.py", line 1955, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/openssl'
, duration=0
WARNING Daemon Daemon Daemon ended with exception -- Sleep 15 seconds and restart daemon

I believe adding this line into the postPatch of waagent would fix the issue :

   substituteInPlace azurelinuxagent/common/conf.py \
      --replace-fail "/usr/bin/openssl" ${openssl}/bin/openssl

@codgician
Copy link
Member Author

codgician commented Dec 16, 2024

@AkechiShiro Thank you for taking the time for testing. ❤️

Channel nixos and stateVersion set to 24.11 but uname -a shows I'm under 25.05 (waagent seems to run fine).

The reason you are seeing 24.11 is because the default system channel is still set to 24.11 as of today in master branch, see: https://github.com/NixOS/nixpkgs/blob/a9a3e341c6bc326e5a3de9324c33ba283f7ce242/nixos/modules/config/nix-channel.nix#L73C1-L74C1

Result generated is no longer a disk.vhd for the test flake but more something like nixos-image-azure-25.05.20241212.ec68a18-x86_64-linux.vhd (building a configuration with stateVersion set to 24.11 or 25.05 doesn't change the output format, it still build a nixos-image-azure-25.05.date.-x86_64-linux.vhd.

This is introduced recently by PR #359339. Since then the image name would follow image.fileName configuration. I have updated my flake and specified image.fileName = "disk.vhd" to avoid confusion.

sudo nixos-rebuild switch yields this error (note nix-channel --update does work and nix-shell -p do work) :

The /etc/nixos/configuration.nix inside the image is copied from azure-config-user.nix and does not reflect the actual configuration of the generated image. Thus, in my testing I have been rebuilding from flake instead of directly running sudo nixos-rebuild switch. I am open to any suggestions on improving this experience.

did you export specific NIX_PATH variable or am I testing in the wrong way ? (I'm using a bit of a modified version for this script : https://github.com/rudesome/nixos-on-azure/blob/master/upload-image.sh)

I did not export specific NIX_PATH (at least my modifications does not introduce any new exports). I think this is irrelevant with the script you are using and you are testing the right way.

waagent crashes at the moment => making the VM unable to join the Internet (logging into the serial console is still possible),

I added a default OS.OpensslPath configuration in the latest iteration to fix this.

image

@AkechiShiro
Copy link
Contributor

Thanks a lot @codgician for the amazing work on bringing more Azure support for NixOS, I think it would be great to not rely on flakes until they're stable, I'm not sure if there is a way to expose some kind of flag to let me use configuration.nix in order to use channels + npins for instance for dependencies.

I'll attempt to do a retest the week-end that is coming (on x86, arm, and also run GRUB).

Would you have also any idea on the process in Azure/MS side to sign the Lanzaboote shim ? If we could get SecureBoot + TPM on NixOS image in Azure, as well it would be very nice.

@codgician
Copy link
Member Author

Thanks a lot @codgician for the amazing work on bringing more Azure support for NixOS, I think it would be great to not rely on flakes until they're stable, I'm not sure if there is a way to expose some kind of flag to let me use configuration.nix in order to use channels + npins for instance for dependencies.

I'll attempt to do a retest the week-end that is coming (on x86, arm, and also run GRUB).

Would you have also any idea on the process in Azure/MS side to sign the Lanzaboote shim ? If we could get SecureBoot + TPM on NixOS image in Azure, as well it would be very nice.

It is possible to add custom Secure Boot key for Azure image via ARM template, referring: https://learn.microsoft.com/en-us/azure/virtual-machines/trusted-launch-secure-boot-custom-uefi. The custom key binds to an Azure Compute Gallery Image, requiring users to do some manual setups.

@codgician
Copy link
Member Author

codgician commented Dec 20, 2024

Do you have any recommandation to cross compile the Arm Image I haven't been able to figure it out for now.

In the GitHub Actions of test flake (later switched to Garnix for native aarch64 building), I leveraged QEMU to emulate ARM to build the image so the nix store hashes could be consistent with building natively on ARM machines. To cross-compile you may need to override stdenv in the input nixpkgs to use cross-compilation toolchain (learn more at https://nix.dev/tutorials/cross-compilation.html), but all built packages will result in different nix store hashes. Meaning, when you apply the cross-built image to a VM, you will have to rebuild everything on the first local nixos rebuild.

I also believe an improvement (could be in a follow up PR) may be to have nixos-generate-config script, add them (azure virtualisation module) in the hardware-configuration.nix generated file if it's easy to detect that we're running on Azure from the inside of the VM.

I don't know any offline way of easily detecting whether VM is running on Azure yet. Azure runs a special version of Windows on their bare-metal and uses the same Hyper-V hypervisor found on any Windows client (with some additional special features). But I like this idea, I will try to do some research on whether such detection is possible in coming weeks.

@AkechiShiro
Copy link
Contributor

AkechiShiro commented Dec 21, 2024

In order to crossCompile, I've added the following to my local configuration

boot.binfmt.emulatedSystems = [ "aarch64-linux" ];

This uses qemu to emulate arm in order to crossCompile the flake I believe, I was thus able to build the flake for aarch64, however after uploading it on Azure (the built VHD), it states that

the VM Image is not supported for Arm64
image

I wasn't able to test aarch64, trying to still boot the image as x64, it cannot find the OS :
image

Were you able to test arm64 on Azure @codgician ?

@codgician
Copy link
Member Author

codgician commented Dec 21, 2024

In order to crossCompile, I've added the following to my local configuration

boot.binfmt.emulatedSystems = [ "aarch64-linux" ];

This uses qemu to emulate arm in order to crossCompile the flake I believe, I was thus able to build the flake for aarch64, however after uploading it on Azure (the built VHD), it states that

the VM Image is not supported for Arm64

image

I wasn't able to test aarch64, trying to still boot the image as x64, it cannot find the OS :

image

Were you able to test arm64 on Azure @codgician ?

The VM image also needs to be created as Arm64, which require you to create an Azure Compute Gallery. I personally use Terraform (Terranix) to deploy my Azure resources, you can refer here to get the overall idea: https://github.com/codgician/serenitea-pot/blob/main/packages/terraform-config/celestia/vms/lumine/image.nix

Also this stackoverflow post could be helpful: https://stackoverflow.com/questions/75574115/create-azure-arm64-image-from-vhd

@AkechiShiro
Copy link
Contributor

AkechiShiro commented Dec 21, 2024

Trying to build the latest image after a pull for the last commits in the test flake, I'm encountering these errors at the moment (not sure what is the root cause, as I have enough memory and space) :

error while reading directory /build/root/etc: Cannot allocate memory
error while reading directory /build/root: Cannot allocate memory
error while reading directory /build/root/nix/var/nix/gcroots/per-user: Cannot allocate memory
error while reading directory /build/root/nix/var/nix/gcroots/auto: Cannot allocate memory
error while reading directory /build/root/nix/var/nix/gcroots: Cannot allocate memory
error while reading directory /build/root/nix/var/nix/temproots: Cannot allocate memory
error while reading directory /build/root/nix/var/nix/db: Cannot allocate memory
error while reading directory /build/root/nix/var/nix/profiles/per-user/root: Cannot allocate memory
error while reading directory /build/root/nix/var/nix/profiles/per-user: Cannot allocate memory
error while reading directory /build/root/nix/var/nix/profiles: Cannot allocate memory
error while reading directory /build/root/nix/var/nix: Cannot allocate memory
error while reading directory /build/root/nix/var: Cannot allocate memory
...
error: builder for '/nix/store/r7rny8wplk7sla4jhm9j2kl1cv6f8rhn-azure-image.drv' failed with exit code 1;
       last 25 log lines:
       > error while reading directory /build/root/nix/store/px6akw851gkq4101y5m9vv0x7zd2f2v4-unit-container-.service: Cannot allocate memory
       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/lib/python3.12/site-packages/msgpack-1.1.0.dist-info: Cannot allocate memory
       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/lib/python3.12/site-packages/msgpack/__pycache__: Cannot allocate memory
       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/lib/python3.12/site-packages/msgpack: Cannot allocate memory
       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/lib/python3.12/site-packages: Cannot allocate memory
       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/lib/python3.12: Cannot allocate memory
       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/lib: Cannot allocate memory
       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/nix-support: Cannot allocate memory
       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0: Cannot allocate memory
       > error while reading directory /build/root/nix/store/c0xl18ds7cf12aywlhfbm7j013k38xmk-unit-script-nixos-activation-start/bin: Cannot allocate memory
       > error while reading directory /build/root/nix/store/c0xl18ds7cf12aywlhfbm7j013k38xmk-unit-script-nixos-activation-start: Cannot allocate memory
       > error while reading directory /build/root/nix/store: Cannot allocate memory
       > error while reading directory /build/root/nix: Cannot allocate memory
       > error while reading directory /build/root: Cannot allocate memory
       > error while reading directory /build/root/root/.nix-defexpr: Cannot allocate memory
       > error while reading directory /build/root/root: Cannot allocate memory
       > error while reading directory /build/root: Cannot allocate memory
       > [   83.174364] reboot: Restarting system
       > thread 'main' panicked at src/main.rs:829:54:
       > called `Result::unwrap()` on an `Err` value: InitSeccompContext
       > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
       > thread 'main' panicked at src/main.rs:829:54:
       > called `Result::unwrap()` on an `Err` value: InitSeccompContext
       > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
       > qemu-system-aarch64: -chardev socket,id=store,path=virtio-store.sock: Failed to connect to 'virtio-store.sock': No such file or directory

x64 still works, I see that on GitHub Actions in the test flake, you're encountering a similar failure at the moment

@codgician
Copy link
Member Author

codgician commented Dec 21, 2024

Trying to build the latest image after a pull for the last commits in the test flake, I'm encountering these errors at the moment (not sure what is the root cause, as I have enough memory and space) :

error while reading directory /build/root/etc: Cannot allocate memory

error while reading directory /build/root: Cannot allocate memory

error while reading directory /build/root/nix/var/nix/gcroots/per-user: Cannot allocate memory

error while reading directory /build/root/nix/var/nix/gcroots/auto: Cannot allocate memory

error while reading directory /build/root/nix/var/nix/gcroots: Cannot allocate memory

error while reading directory /build/root/nix/var/nix/temproots: Cannot allocate memory

error while reading directory /build/root/nix/var/nix/db: Cannot allocate memory

error while reading directory /build/root/nix/var/nix/profiles/per-user/root: Cannot allocate memory

error while reading directory /build/root/nix/var/nix/profiles/per-user: Cannot allocate memory

error while reading directory /build/root/nix/var/nix/profiles: Cannot allocate memory

error while reading directory /build/root/nix/var/nix: Cannot allocate memory

error while reading directory /build/root/nix/var: Cannot allocate memory

...

error: builder for '/nix/store/r7rny8wplk7sla4jhm9j2kl1cv6f8rhn-azure-image.drv' failed with exit code 1;

       last 25 log lines:

       > error while reading directory /build/root/nix/store/px6akw851gkq4101y5m9vv0x7zd2f2v4-unit-container-.service: Cannot allocate memory

       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/lib/python3.12/site-packages/msgpack-1.1.0.dist-info: Cannot allocate memory

       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/lib/python3.12/site-packages/msgpack/__pycache__: Cannot allocate memory

       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/lib/python3.12/site-packages/msgpack: Cannot allocate memory

       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/lib/python3.12/site-packages: Cannot allocate memory

       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/lib/python3.12: Cannot allocate memory

       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/lib: Cannot allocate memory

       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0/nix-support: Cannot allocate memory

       > error while reading directory /build/root/nix/store/g9sk34viln219wbj9imrpbywsq5md0cw-python3.12-msgpack-1.1.0: Cannot allocate memory

       > error while reading directory /build/root/nix/store/c0xl18ds7cf12aywlhfbm7j013k38xmk-unit-script-nixos-activation-start/bin: Cannot allocate memory

       > error while reading directory /build/root/nix/store/c0xl18ds7cf12aywlhfbm7j013k38xmk-unit-script-nixos-activation-start: Cannot allocate memory

       > error while reading directory /build/root/nix/store: Cannot allocate memory

       > error while reading directory /build/root/nix: Cannot allocate memory

       > error while reading directory /build/root: Cannot allocate memory

       > error while reading directory /build/root/root/.nix-defexpr: Cannot allocate memory

       > error while reading directory /build/root/root: Cannot allocate memory

       > error while reading directory /build/root: Cannot allocate memory

       > [   83.174364] reboot: Restarting system

       > thread 'main' panicked at src/main.rs:829:54:

       > called `Result::unwrap()` on an `Err` value: InitSeccompContext

       > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

       > thread 'main' panicked at src/main.rs:829:54:

       > called `Result::unwrap()` on an `Err` value: InitSeccompContext

       > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

       > qemu-system-aarch64: -chardev socket,id=store,path=virtio-store.sock: Failed to connect to 'virtio-store.sock': No such file or directory

x64 still works, I see that on GitHub Actions in the test flake, you're encountering a similar failure at the moment

This start to happen after rebasing with latest nixpkgs master, and seems to only happen when generating image with QEMU emulation. It does not happen when building on native aarch64 hardware (tested with NixOS in Parallels VM on my M2 Max MBP, and on Garnix.io: https://garnix.io/build/Q9PKxmbB which has native arm64 support referring garnix-io/issues#12).

I haven't figured out the root cause yet but the exception seem to happen in virtio-fsd (which is unrelated to this PR itself).

But to check whether it is an OOM issue, maybe try increasing memSize passed to make-disk-image.nix in azure-image.nix:

memSize ? 1024
. Currently this is kept as default (1024MiB). I will try this myself later today and will consider exposing this option if it turns out to be the root cause. (UPDATE: Doesn't look like OOM issue on my side)

@codgician codgician force-pushed the fix-azure-modules branch 2 times, most recently from eac97da to efa73b6 Compare December 21, 2024 10:05
@codgician
Copy link
Member Author

Did some further testing and found that this is not related to OOM. Removed the option added for testing purpose. Considering this issue only happen when building aarch64 with QEMU emulation under x86_64, I tend to think this issue is not caused by this PR.

I added a nixConfig block to the test flake with Garnix public binary cache in case you want to try out aarch64 image.

Copy link
Contributor

@AkechiShiro AkechiShiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've tested this PR enough, I think it's alright to merge as it is currently :

  • Tested x64_86 (Systemd-boot)
  • Tested ARM (Systemd-boot)

They boot all fine on my side, I haven't seen any issues on them and the serial console allows to select NixOS generations.

Thanks again for the work done @codgician

@AkechiShiro AkechiShiro added the 12.approvals: 2 This PR was reviewed and approved by two reputable people label Dec 21, 2024
Copy link
Contributor

@usertam usertam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nitpicking :D Thank you for your awesome work! Some maybe out-of-scope suggestions:

I think networking.useNetworkd needs lib.mkDefault true, because not setting one throws a warning:

evaluation warning: The combination of `systemd.network.enable = true`, `networking.useDHCP = true` and `networking.useNetworkd = false` can cause both networkd and dhcpcd to manage the same interfaces. This can lead to loss of networking. It is recommended you choose only one of networkd (by also enabling `networking.useNetworkd`) or scripting (by disabling `systemd.network.enable`)

I also consider making virtualisation.azure.acceleratedNetworking true by default, it doesn't seem to break things: https://learn.microsoft.com/en-us/azure/virtual-network/accelerated-networking-overview?tabs=redhat#limitations-and-constraints

Comment on lines 82 to 88
services.udev.extraRules =
with builtins;
concatStringsSep "\n" (
map (i: ''
ENV{DEVTYPE}=="disk", KERNEL!="sda" SUBSYSTEM=="block", SUBSYSTEMS=="scsi", KERNELS=="?:0:0:${toString i}", ATTR{removable}=="0", SYMLINK+="disk/by-lun/${toString i}"
'') (lib.range 1 15)
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
services.udev.extraRules =
with builtins;
concatStringsSep "\n" (
map (i: ''
ENV{DEVTYPE}=="disk", KERNEL!="sda" SUBSYSTEM=="block", SUBSYSTEMS=="scsi", KERNELS=="?:0:0:${toString i}", ATTR{removable}=="0", SYMLINK+="disk/by-lun/${toString i}"
'') (lib.range 1 15)
);
services.udev.extraRules = lib.concatMapStrings (i: ''
ENV{DEVTYPE}=="disk", KERNEL!="sda" SUBSYSTEM=="block", SUBSYSTEMS=="scsi", KERNELS=="?:0:0:${toString i}", ATTR{removable}=="0", SYMLINK+="disk/by-lun/${toString i}"
'') (lib.range 1 15);

Comment on lines 15 to 17

# Please update the VM Generation to the actual value
virtualisation.azureImage.vmGeneration = "v1";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Please update the VM Generation to the actual value
virtualisation.azureImage.vmGeneration = "v1";

Overwriting an option with its default value seems redundant. Maybe put this to the comment above?

@usertam
Copy link
Contributor

usertam commented Dec 29, 2024

Screenshot 2024-12-29 at 7 12 46 AM

It boots and works on my own config. I had some hiccups because make-disk-image.nix copied one nixpkgs over and exhausted all the inodes. 1 Running nix store optimise helps.

Screenshot 2024-12-29 at 8 19 18 AM

But then I can't get Accelerated Networking working even after modprobe and messing with the NIC. I am on a F1s instance which should support that, I think.

@codgician
Copy link
Member Author

codgician commented Dec 29, 2024

Screenshot 2024-12-29 at 7 12 46 AM

It boots and works on my own config. I had some hiccups because make-disk-image.nix copied one nixpkgs over and exhausted all the inodes. 1 Running nix store optimise helps.

Screenshot 2024-12-29 at 8 19 18 AM

But then I can't get Accelerated Networking working even after modprobe and messing with the NIC. I am on a F1s instance which should support that, I think.

Accelerated networking needs to be enabled in network interface configuration on Azure portal:

image

Thank you for taking your time testing this change!

@codgician
Copy link
Member Author

codgician commented Dec 29, 2024

Just some nitpicking :D Thank you for your awesome work! Some maybe out-of-scope suggestions:

I think networking.useNetworkd needs lib.mkDefault true, because not setting one throws a warning:

evaluation warning: The combination of `systemd.network.enable = true`, `networking.useDHCP = true` and `networking.useNetworkd = false` can cause both networkd and dhcpcd to manage the same interfaces. This can lead to loss of networking. It is recommended you choose only one of networkd (by also enabling `networking.useNetworkd`) or scripting (by disabling `systemd.network.enable`)

I also consider making virtualisation.azure.acceleratedNetworking true by default, it doesn't seem to break things: https://learn.microsoft.com/en-us/azure/virtual-network/accelerated-networking-overview?tabs=redhat#limitations-and-constraints

For networking.useNetworkd, the module did not force it to true because I found waagent also works well with NetworkManager, so I'd prefer to leave this option to end-user. I did a quick search and did not see any azure modules setting systemd.network.enable = true; by default, but please let me know if I've missed anything.

And for making virtualisation.azure.acceleratedNetworking true by default, this is a good point, but I think we can do that in future PRs (after this change reaches more audience for testing).

And thank you for your awesome suggestions on code improvement, I have applied them in the latest commit.

@wegank wegank removed the 12.approvals: 2 This PR was reviewed and approved by two reputable people label Dec 30, 2024
@usertam usertam added the 12.approvals: 1 This PR was reviewed and approved by one reputable person label Jan 3, 2025
@wegank wegank added 12.approvals: 3+ This PR was reviewed and approved by three or more reputable people and removed 12.approvals: 1 This PR was reviewed and approved by one reputable person labels Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: changelog 8.has: documentation This PR adds or changes documentation 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 12.approvals: 3+ This PR was reviewed and approved by three or more reputable people
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants