Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boot failures: md1.uzip: UZIP(zlib) inflate() failed, kernel panics and errno EFAULT #102

Closed
grahamperrin opened this issue Jan 16, 2021 · 29 comments

Comments

@grahamperrin
Copy link
Contributor

grahamperrin commented Jan 16, 2021

cb71800

ISO written to physical media

hello-0.4.0_0D8-FreeBSD-12.1-amd64.iso written more than once to a Kingston DataTraveler 3.0, which was stress-tested without error a few days ago, repeatedly failed to boot on an Ergo Vista 631.

First attempt: a sudden restart.

A subsequent attempt:

image

I tried a different USB port. Still no boot.

I destroyed the GPT, created a new GPT with an ms-basic-data partition, added msdosfs then ran one round of stressdisk, without error: stressdisk.txt

ISO alone

VirtualBox, given all four processors of an HP EliteBook 8570p:

2021-01-16 09:57:41

  • screenshot taken 2021-01-16 09:57:41
  • I chose to stop the machine after around one minute 2021-01-16 09:56 VBox.log because the dd routine appeared to make no progress beyond what's seen in the shot.

Again, with four processors:

2021-01-16 10:07:17

A few days ago I began suspecting that with some builds, boot failures are more likely with more than one processor given to a (virtual) machine.

Today, limited to one processor: success, 2021-01-16 10:11 VBox.log

Shut down, started with one processor: another success, 2021-01-16 10:34 VBox.log

Shut down, added three processors, started, the dd + ZFS receive routine averaged around 11 MB/s, sudden restart, boot proceeded:

– muted because three seconds was not long enough for me to reach the screen 👎

@grahamperrin

This comment has been minimized.

@grahamperrin

This comment has been minimized.

@grahamperrin
Copy link
Contributor Author

With normal mode failing on the Ergo Vista 621, my next step would have been safe mode, however:

@probonopd
Copy link
Member

~1/10 experimental builds have a failure that makes them unable to boot. I do not know how to fix this yet.

@grahamperrin
Copy link
Contributor Author

Understood, thanks. If I can get safe mode, we might be closer to understanding and maybe avoiding the issue.

@grahamperrin

This comment has been minimized.

@grahamperrin

This comment has been minimized.

@probonopd
Copy link
Member

Yes, I'd just conclude that 0D8 is one of those broken builds. 0D11 is out now :)

@grahamperrin grahamperrin changed the title 0.4.0 (0D8) sometimes failing to boot on multiple machines 0.4.0 (0D8) and (0D11) sometimes failing to boot on multiple machines Jan 17, 2021
@grahamperrin grahamperrin changed the title 0.4.0 (0D8) and (0D11) sometimes failing to boot on multiple machines 0.4.0 (0D8), (0D11) … sometimes failing to boot on multiple machines Jan 17, 2021
@grahamperrin
Copy link
Contributor Author

OD11 above #102 (comment) and below, with an Ergo Vista 631.

Happily, now we have a panic with a visible backtrace:

image

@probonopd
Copy link
Member

Yay, indeed, another broken ISO. Triggering rebuild... this is really getting annoying but I don't know how to resolve it.

@grahamperrin

This comment has been minimized.

@probonopd
Copy link
Member

I suspect that the ZFS image is simply broken, e.g., because it was not written correctly due to some caching race conditions at creation time. Something along those lines. Pure suspcicion though.

@grahamperrin
Copy link
Contributor Author

It would be nice to have OpenZFS but IIRC it's not packaged where you want it

@grahamperrin grahamperrin changed the title 0.4.0 (0D8), (0D11) … sometimes failing to boot on multiple machines 0.4.0 (0D8), (0D11), (0D12) … sometimes failing to boot on multiple machines Jan 21, 2021
@grahamperrin

This comment has been minimized.

@grahamperrin

This comment has been minimized.

@probonopd
Copy link
Member

What does

sudo dmesg | grep rtc

give?

I get:

Users-RevoOne-RL85% sudo dmesg | grep rtc 
efirtc0: <EFI Realtime Clock> on motherboard
efirtc0: registered as a time-of-day clock, resolution 1.000000s
atrtc0: <AT realtime clock> port 0x70-0x77 irq 8 on acpi0
atrtc0: Warning: Couldn't map I/O.
atrtc0: registered as a time-of-day clock, resolution 1.000000s

According to https://lists.freebsd.org/pipermail/freebsd-current/2020-May/076128.html it seems like only one of the two should be attached, but I have both attached without any apparent issues.

Who knows more about this topic?

@grahamperrin
Copy link
Contributor Author

grahamperrin commented Jan 24, 2021

0.4.0 (0D13) and 0.4.0 (0D18)

Booted more than once without error in a VirtualBox guest (0D18).

Mostly without error with an Ergo Vista 631 (0D13 on a USB flash drive).

On the Ergo:

FreeBSD% sudo dmesg | grep rtc
atrtc0: <AT realtime clock> port 0x70-0x71 irq 8 on acpi0
atrtc0: registered as a time-of-day clock, resolution 1.000000s
atrtc0: <AT realtime clock> port 0x70-0x71 irq 8 on acpi0
atrtc0: registered as a time-of-day clock, resolution 1.000000s
FreeBSD% date ; uname -v ; uptime ; pkg query '%o %v %R' hello 
Sun Jan 24 02:54:43 EST 2021
FreeBSD 12.1-RELEASE r354233 GENERIC 
 2:54AM  up 14 mins, 0 users, load averages: 0.45, 0.50, 0.36
helloSystem 0.4.0_0D13 unknown-repository
FreeBSD% 

@grahamperrin

This comment has been minimized.

@grahamperrin

This comment has been minimized.

@grahamperrin

This comment has been minimized.

@grahamperrin

This comment has been minimized.

@grahamperrin grahamperrin changed the title 0.4.0 (0D8), (0D11), (0D12) … sometimes failing to boot on multiple machines 0.3.0 (0C164) … 0.4.0 (0D8), (0D11), (0D12) … sometimes failing to boot on multiple machines Feb 3, 2021
@grahamperrin grahamperrin changed the title 0.3.0 (0C164) … 0.4.0 (0D8), (0D11), (0D12) … sometimes failing to boot on multiple machines 0.3.0 (0C164) … 0.4.0 (0D8), (0D11), (0D12) … sometimes fails to boot Feb 3, 2021
@grahamperrin

This comment has been minimized.

@grahamperrin
Copy link
Contributor Author

Advice received via e-mail, quoted without attribution with permission, with added emphasis:

… a controlled environment where you can easily run experiments and will be an experiment in itself:

Install pristine FreeBSD 12.1 on a VM, then after normal multi-user boot try executing the commands that helloSystem executes in its init.sh script.

The most important command is the mdconfig that makes data/system.uzip into a read-only md1 device.

You can read from this device which will force the decompression that fails to happen.

You can try without the 'zfs recv' command and even without 'zfs send', reading with dd from the md1 device instead.

If you manage to reproduce the issue on pristine FreeBSD, you will get more interest from FreeBSD developers.

You can use the same pristine FreeBSD 12.1 system to try decompressing the .uzip file of many different helloSystem ISO's which would allow you to see if the problem is data-dependent. …

https://www.freebsd.org/cgi/man.cgi?query=mdconfig&sektion=8&manpath=FreeBSD+12.1-RELEASE+and+Ports

@grahamperrin grahamperrin changed the title 0.3.0 (0C164) … 0.4.0 (0D8), (0D11), (0D12) … sometimes fails to boot 0.3.0 (0C164) … 0.4.0 (0D8), (0D11), (0D12) … 0.5.0 (0E17) … sometimes fails to boot Mar 6, 2021
@grahamperrin

This comment has been minimized.

@grahamperrin
Copy link
Contributor Author

grahamperrin commented Mar 18, 2021

Boot failures: md1.uzip: UZIP(zlib) inflate() failed, kernel panics and errno EFAULT

From opening post #102 (comment)

image

– and from #102 (comment)

… a panic with a visible backtrace: …

12.1-RELEASE and 12.2-RELEASE as bases for helloSystem

FreeBSD bug 254318 – [panic] when a specific sequence of read requests is issued to a geom_uzip device the kernel panics

  • covers helloSystem boot time system crash and md1.uzip: UZIP(zlib) inflate() failed scenarios.

I might treat helloSystem boot time sudden restarts as symptomatic of system crashes, with possibly the same cause.

13.⋯

Bug 254318 observes:


On FreeBSD/amd64 13.0-RC2 a different symptom is observed:

  • There is no kernel panic, but some of the read requests fail with errno EFAULT even though they should succeed.

On FreeBSD/amd64 14.0-CURRENT (from FreeBSD-14.0-CURRENT-amd64-20210311-15565e0a217-257277-disc1.iso) the behaviour is the same as on 13.0-RC2.

@grahamperrin grahamperrin changed the title 0.3.0 (0C164) … 0.4.0 (0D8), (0D11), (0D12) … 0.5.0 (0E17) … sometimes fails to boot Boot failures: md1.uzip: UZIP(zlib) inflate() failed, kernel panics and errno EFAULT Mar 19, 2021
@grahamperrin
Copy link
Contributor Author

Boot apparently stalled at/around providing initial system time

From #102 (comment)

efitc0: providing initial system time

– and:

atrtc0: providing initial system time

If this symptom recurs, raise a separate issue.

@grahamperrin
Copy link
Contributor Author

Out of memory

If there's an OOM boot failure that's not covered by #130, raise a separate issue. Essentially:

  • configure a dump device
  • vm.panic_on_oom 1

Incidentally, from https://cgit.freebsd.org/src/commit/?id=3c200db9d2a831e17307710bfd1b581aa325cee2 (2020-02-10):

Modify the vm.panic_on_oom sysctl to take a count of events.

… This change is helpful in capturing cores when the system is in a perpetual cycle of out-of-memory events (as opposed to just hitting one or two sporadic out-of-memory events). …

Whilst vm.panic_on_oom can be greater than 1, I don't imagine this being useful in OOM boot failure situations.

@probonopd
Copy link
Member

Apparently the root cause for md1.uzip: UZIP(zlib) inflate() failed is caused by FreeBSD Bug 254318 - [panic] when a specific sequence of read requests is issued to a geom_uzip device the kernel panics. There is even a small program for reproduction.

Hence reopening this issue until https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254318 is resolved and integrated, and/or we are no longer using geom_uzip.

@probonopd probonopd reopened this Apr 8, 2021
@probonopd
Copy link
Member

Using a completely new Live system architecture in 0.7.0, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants