Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rustic backup to B2 dies after error instead of retrying #1326

Open
ociaw opened this issue Oct 16, 2024 · 11 comments
Open

Rustic backup to B2 dies after error instead of retrying #1326

ociaw opened this issue Oct 16, 2024 · 11 comments
Labels
A-backends Area: Related to backends in `rustic-core` A-cli Area: `rustic` command line interface C-bug Category: Something isn't working as expected

Comments

@ociaw
Copy link

ociaw commented Oct 16, 2024

Rustic keeps dying during my backups to B2, using opendal:b2. As a result, I have to repeatedly re-run rustic to complete the backup. During the backup, sometimes when B2 returns an error (usually no tomes available), rustic abruptly stops with no error message and an exit code of 141. However, this doesn't always occur - other times, it correctly retries again and continues with the backup.

I'm unsure why there's no error message, or what the significance of the 141 exit code is. I've run rustic with the log level set to DEBUG, but that hasn't seemed to reveal anything more. Is there something more I can do to debug this better?

Tested Versions: 0.8.0, 0.9.1, 0.9.3
OS: TrueNas 13.1-RELEASE-p9

Truncated rustic log with Log Level = DEBUG
2024-10-11T06:22:49.795402039Z [DEBUG] (240) hyper_util::client::legacy::pool: reuse idle connection for ("https", api000.backblazeb2.com)
2024-10-11T06:22:49.846311897Z [DEBUG] (240) reqwest::connect: starting new connection: https://pod-000-1075-04.backblaze.com/
2024-10-11T06:22:49.846506576Z [DEBUG] (240) hyper_util::client::legacy::connect::dns: resolve; host=pod-000-1075-04.backblaze.com
2024-10-11T06:22:49.846686687Z [DEBUG] (3) hyper_util::client::legacy::pool: pooling idle connection for ("https", api000.backblazeb2.com)
2024-10-11T06:22:49.846892617Z [DEBUG] (477) hyper_util::client::legacy::connect::dns: resolving host="pod-000-1075-04.backblaze.com"
2024-10-11T06:22:49.858056306Z [DEBUG] (1) rustic_core::archiver::tree_archiver: new       file: "/file/name/is/redacted.flac"
2024-10-11T06:22:49.939037195Z [DEBUG] (240) hyper_util::client::legacy::connect::http: connecting to 149.137.132.124:443
2024-10-11T06:22:49.979073034Z [DEBUG] (240) hyper_util::client::legacy::connect::http: connected to 149.137.132.124:443
2024-10-11T06:22:50.138864052Z [WARN] will retry after 1.914994597s because: Unexpected (temporary) at Writer::close, context: { uri: https://pod-000-1075-04.backblaze.com/b2api/v2/b2_upload_file/192976a3bf05eebe6a08d3ac/c
000_v0001075_t0004, response: Parts { status: 503, version: HTTP/1.1, headers: {"server": "nginx", "date": "Fri, 11 Oct 2024 06:22:50 GMT", "content-type": "application/json;charset=utf-8", "content-length": "87", "connect
ion": "keep-alive", "cache-control": "max-age=0, no-cache, no-store", "strict-transport-security": "max-age=63072000"} }, service: b2, path: data/13/13a1ac8ec8f274cf7ac57483255dd7e465a636d03f24a0af8a39900ae7ecaea8, written
: 72223669 } => B2Error { status: 503, code: "service_unavailable", message: "no tomes available" }
My anonymized config
[backup]
globs = [ "!*/Thumbs.db", "!*/.recycle/*" ]
exclude-if-present = [".nobackup"]

[[backup.snapshots]]
label = "home/user1"
sources = [ "/mnt/home/user1" ]
[[backup.snapshots]]
label = "home/user2"
sources = [ "/mnt/home/user2" ]
[[backup.snapshots]]
label = "home/user3"
sources = [ "/mnt/home/user3" ]
[[backup.snapshots]]
label = "home/user4"
sources = [ "/mnt/home/user4" ]
[[backup.snapshots]]
label = "shared/user1"
sources = [ "/mnt/shared/user1/documents", "/mnt/shared/user1/music", "/mnt/shared/user1/pictures" ]
[[backup.snapshots]]
label = "shared/user4"
sources = [ "/mnt/shared/user4/Documents", "/mnt/shared/user4/Music" ]
[[backup.snapshots]]
label = "shared/user2"
sources = [ "/mnt/shared/user2" ]
[[backup.snapshots]]
label = "shared/user3"
sources = [ "/mnt/shared/user3" ]

[repository]
repository = "opendal:b2"
password = "password"

[repository.options] 
application_key_id = "id"
application_key = "key"
bucket = "bucket"
bucket_id = "192976a3bf05eebe6a08d3ac"
@github-actions github-actions bot added the S-triage Status: Waiting for a maintainer to triage this issue/PR label Oct 16, 2024
@aawsome
Copy link
Member

aawsome commented Oct 16, 2024

Thanks for opening this issue @ociaw

In fact, yes rustic should retry here as this is classified as a temporary error - and it is even saying it is retrying.

But in fact it is just stopping instead of retrying. The return code 141 indicates that rustic was terminated by a SIGPIPE signal. Can you show how you are calling the rustic binary? could it be that you pipe the output to some other program?

@ociaw
Copy link
Author

ociaw commented Oct 17, 2024

Hmm, no piping, but I'm running it inside screen, with fish shell if it matters.
The exact command is rustic backup --log-level Debug --log-file ~/rustic-log.log, though I only added the log-level and log-file options trying to track down this issue.

@nardoor nardoor added C-bug Category: Something isn't working as expected A-cli Area: `rustic` command line interface A-backends Area: Related to backends in `rustic-core` and removed S-triage Status: Waiting for a maintainer to triage this issue/PR labels Nov 17, 2024
@aawsome
Copy link
Member

aawsome commented Dec 13, 2024

Hi @ociaw, in rustic 0.9.5 we improved error output - can you please retry if you see the same behavior and information using the newest version? Thanks a lot!

@ociaw
Copy link
Author

ociaw commented Dec 15, 2024

Updated to 0.9.5, but I'm still encountering the same issue, and I don't see any additional information in the log? Same exit code, 141 and an abrupt stop. Ran it with TRACE debugging level this time:

Truncated log with Log Level = Trace
2024-12-14T18:35:47.075298703Z [DEBUG] (278) opendal::services: service=b2 name= : metadata started
2024-12-14T18:35:47.075413681Z [DEBUG] (278) opendal::services: service=b2 name= : metadata finished
2024-12-14T18:35:47.075605243Z [DEBUG] (278) opendal::services: service=b2 name= path=data/21/21ab9e2b413aa1eb0060a0624c138f957e7f6156e2a7ca4af95b52d10f866b93: write started
2024-12-14T18:35:47.075819238Z [DEBUG] (278) opendal::services: service=b2 name= path=data/21/21ab9e2b413aa1eb0060a0624c138f957e7f6156e2a7ca4af95b52d10f866b93: write created writer
2024-12-14T18:35:47.076056447Z [TRACE] (278) opendal::services: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/opendal-0.50.2/src/layers/logging.rs:220] service=b2 name= path=data/21/21ab9e2b413aa1eb0060a0624c138f957e7f6156e2a7ca4af95b52d10f866b93 written=0 size=78056970: Writer::write started
2024-12-14T18:35:47.076379618Z [TRACE] (278) opendal::services: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/opendal-0.50.2/src/layers/logging.rs:220] service=b2 name= path=data/21/21ab9e2b413aa1eb0060a0624c138f957e7f6156e2a7ca4af95b52d10f866b93 written=78056970 size=78056970: Writer::write succeeded
2024-12-14T18:35:47.07665525Z [DEBUG] (278) opendal::services: service=b2 name= path=data/21/21ab9e2b413aa1eb0060a0624c138f957e7f6156e2a7ca4af95b52d10f866b93 written=78056970: Writer::close started
2024-12-14T18:35:47.076952603Z [TRACE] (278) hyper_util::client::legacy::pool: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/pool.rs:645] take? ("https", api000.backblazeb2.com): expiration = Some(90s)
2024-12-14T18:35:47.077209448Z [TRACE] (278) hyper_util::client::legacy::pool: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/pool.rs:307] removing closed connection for ("https", api000.backblazeb2.com)
2024-12-14T18:35:47.077404297Z [TRACE] (278) hyper_util::client::legacy::pool: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/pool.rs:671] checkout waiting for idle connection: ("https", api000.backblazeb2.com)
2024-12-14T18:35:47.077613863Z [DEBUG] (278) reqwest::connect: starting new connection: https://api000.backblazeb2.com/
2024-12-14T18:35:47.077766392Z [TRACE] (278) hyper_util::client::legacy::connect::http: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/connect/http.rs:383] Http::connect; scheme=Some("https"), host=Some("api000.backblazeb2.com"), port=None
2024-12-14T18:35:47.07799247Z [DEBUG] (278) hyper_util::client::legacy::connect::dns: resolve; host=api000.backblazeb2.com
2024-12-14T18:35:47.078541117Z [TRACE] (772) tracing::span::active: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/connect/dns.rs:121] -> resolve;
2024-12-14T18:35:47.093352109Z [TRACE] (772) tracing::span::active: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/connect/dns.rs:121] <- resolve;
2024-12-14T18:35:47.093571423Z [TRACE] (772) tracing::span: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/connect/dns.rs:121] -- resolve;
2024-12-14T18:35:47.093744391Z [DEBUG] (278) hyper_util::client::legacy::connect::http: connecting to 104.153.233.176:443
2024-12-14T18:35:47.093982291Z [TRACE] (278) mio::poll: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/mio-1.0.3/src/poll.rs:571] registering event source with poller: token=Token(34827476864), interests=READABLE | WRITABLE
2024-12-14T18:35:47.134524504Z [DEBUG] (278) hyper_util::client::legacy::connect::http: connected to 104.153.233.176:443
2024-12-14T18:35:47.177162962Z [TRACE] (278) hyper_util::client::legacy::client: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/client.rs:592] http1 handshake complete, spawning background dispatcher task
2024-12-14T18:35:47.177395853Z [TRACE] (278) hyper_util::client::legacy::pool: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/pool.rs:713] checkout dropped for ("https", api000.backblazeb2.com)
2024-12-14T18:35:47.223064294Z [TRACE] (278) hyper_util::client::legacy::pool: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/pool.rs:349] put; add idle connection for ("https", api000.backblazeb2.com)
2024-12-14T18:35:47.223244655Z [DEBUG] (278) hyper_util::client::legacy::pool: pooling idle connection for ("https", api000.backblazeb2.com)
2024-12-14T18:35:47.22342171Z [TRACE] (278) hyper_util::client::legacy::pool: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/pool.rs:671] checkout waiting for idle connection: ("https", pod-000-1057-10.backblaze.com)
2024-12-14T18:35:47.223541276Z [DEBUG] (278) reqwest::connect: starting new connection: https://pod-000-1057-10.backblaze.com/
2024-12-14T18:35:47.223645052Z [TRACE] (278) hyper_util::client::legacy::connect::http: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/connect/http.rs:383] Http::connect; scheme=Some("https"), host=Some("pod-000-1057-10.backblaze.com"), port=None
2024-12-14T18:35:47.223788334Z [DEBUG] (278) hyper_util::client::legacy::connect::dns: resolve; host=pod-000-1057-10.backblaze.com
2024-12-14T18:35:47.22391826Z [TRACE] (772) tracing::span::active: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/connect/dns.rs:121] -> resolve;
2024-12-14T18:35:47.287921438Z [TRACE] (772) tracing::span::active: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/connect/dns.rs:121] <- resolve;
2024-12-14T18:35:47.288098203Z [TRACE] (772) tracing::span: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/connect/dns.rs:121] -- resolve;
2024-12-14T18:35:47.288281338Z [DEBUG] (278) hyper_util::client::legacy::connect::http: connecting to 149.137.131.15:443
2024-12-14T18:35:47.288474505Z [TRACE] (278) mio::poll: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/mio-1.0.3/src/poll.rs:571] registering event source with poller: token=Token(34827492480), interests=READABLE | WRITABLE
2024-12-14T18:35:47.323200473Z [DEBUG] (278) hyper_util::client::legacy::connect::http: connected to 149.137.131.15:443
2024-12-14T18:35:47.359271576Z [TRACE] (278) hyper_util::client::legacy::client: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/client.rs:592] http1 handshake complete, spawning background dispatcher task
2024-12-14T18:35:47.359494228Z [TRACE] (278) hyper_util::client::legacy::pool: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.10/src/client/legacy/pool.rs:713] checkout dropped for ("https", pod-000-1057-10.backblaze.com)
2024-12-14T18:35:52.069600774Z [TRACE] (4) mio::poll: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/mio-1.0.3/src/poll.rs:702] deregistering event source from poller
2024-12-14T18:35:52.22041375Z [TRACE] (4) mio::poll: [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/mio-1.0.3/src/poll.rs:702] deregistering event source from poller
2024-12-14T18:35:59.183423094Z [WARN] will retry after 1.820987225s because: Unexpected (temporary) at Writer::close, context: { uri: https://pod-000-1057-10.backblaze.com/b2api/v2/b2_upload_file/a2ea42da80b38dca9111011a/c000_v0001057_t0048, response: Parts { status: 500, version: HTTP/1.1, headers: {"server": "nginx", "date": "Sat, 14 Dec 2024 18:35:59 GMT", "content-type": "application/json;charset=utf-8", "content-length": "105", "connection": "keep-alive", "cache-control": "max-age=0, no-cache, no-store", "strict-transport-security": "max-age=63072000"} }, service: b2, path: data/21/21ab9e2b413aa1eb0060a0624c138f957e7f6156e2a7ca4af95b52d10f866b93, written: 78056970 } => B2Error { status: 500, code: "internal_error", message: "incident id bdd9015b52f7-1e7a1f386fd0802e" }

@ociaw
Copy link
Author

ociaw commented Dec 21, 2024

Ran a debug build under rust-gdb and got more information:

Thread 8 "tokio-runtime-worke" received signal SIGPIPE, Broken pipe.
Sent by kernel.

Running info stack in gdb yields this for thread 8:

#8  tokio::runtime::io::registration::Registration::poll_io<usize, tokio::io::poll_evented::{impl#6}::poll_write_vectored::{closure_env#0}<mio::net::tcp::stream::TcpStream>> (self=0x81d15ad00, cx=0x7fffdf3f6a80, 
    direction=<error reading variable: Cannot access memory at address 0x1>, f=...)
    at src/runtime/io/registration.rs:171

Unfortunately, I'm not very familiar with GDB so I'm not sure how to get more useful information. I guess the TCP stream is closed on the other end, and rustic/opendal tries to write to it, causing the kernel to send SIGPIPE?

@ociaw
Copy link
Author

ociaw commented Dec 22, 2024

I ended up doing the obvious thing and changing the SIGPIPE handler from SIG_DFL to SIG_IGN. Working great for me so far, was able to backup 300GB or so in one shot.

Is there a reason SIG_DFL is being used instead of SIG_IGN? IMO explicitly ignoring the signal is better than letting the kernel decide what to do and getting inconsistent behavior across operating systems or distributions.

@aawsome
Copy link
Member

aawsome commented Dec 22, 2024

@ociaw The idea about the SIGPIPE handler is to get correct error codes for failing pipe situations.
We especially had in mind a command | rustic backup - where command is failing for some reason. Without handling SIGPIPE, this just does not backup the expected output of command but may save truncated or empty data with rustic returning an zero error code...

I still wonder what leads to the SIGPIPE signal rustic obviously is getting in your case...

@aawsome
Copy link
Member

aawsome commented Dec 22, 2024

Of course, we can either add an option to opt-out the sigpipe handling or maybe just handle it in very special situations (like the the backup - case). I however would like to understand what's going on before we propose a change in rustic here..

@Ekleog
Copy link

Ekleog commented Jan 4, 2025

FWIW, I hit the same 141 exit code issue with a REST server backend; and the issue was that my nginx reverse-proxy had not set a large enough client_max_body_size.

In my case I'd say the error message could have been improved; but otherwise I wouldn't have wanted rustic to silently make it look like it worked while it didn't :)

@beeequeue
Copy link

Is there a workaround for this issue? Compiling a custom binary with SIG_IGN? It's been completely stopping me from backing up my data for half a year as it always encounters this error eventually. 😞

@christophershirk
Copy link

christophershirk commented Jan 12, 2025

I have this error as well with the opendal:b2 backend using rustic commit d1b11f9 . Can reproduce without running in screen/tmux. I'm invoking rustic manually -- no script.

log: [INFO] starting to backup TempForMerging ... [00:12:37] backing up... ██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 11.45 GiB/70.38 GiB 18.21 MiB/s (ETA 65m) [WARN] will retry after 1.191994429s because: Unexpected (temporary) at Writer::close, context: { uri: https://pod-000-1080-05.backblaze.com/b2api/v2/b2_upload_file/{redacted}, response: Parts { status: 503, version: HTTP/1.1, headers: {"server": "nginx", "date": "Sun, 12 Jan 2025 09:19:01 GMT", "content-type": "application/json;charset=utf-8", "content-length": "87", "connection": "keep-alive", "cache-control": "max-age=0, no-cache, no-store", "strict-transport-security": "max-age=63072000"} }, service: b2, path: {redacted}, written: 41558942 } => B2Error { status: 503, code: "service_unavailable", message: "no tomes available" } [00:14:51] backing up... ███████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 13.67 GiB/70.38 GiB 13.91 MiB/s (ETA 62m) [WARN] will retry after 1.437754512s because: Unexpected (temporary) at Writer::close, context: { uri: https://pod-000-1084-12.backblaze.com/b2api/v2/b2_upload_file/{redacted}, response: Parts { status: 503, version: HTTP/1.1, headers: {"server": "nginx", "date": "Sun, 12 Jan 2025 09:21:15 GMT", "content-type": "application/json;charset=utf-8", "content-length": "81", "connection": "keep-alive", "cache-control": "max-age=0, no-cache, no-store", "strict-transport-security": "max-age=63072000"} }, service: b2, path: {redacted}, written: 40846659 } => B2Error { status: 503, code: "service_unavailable", message: "CPU too busy" } [00:20:09] backing up... █████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 16.57 GiB/70.38 GiB 7.66 MiB/s (ETA 65m) [WARN] will retry after 1.2702193259999999s because: Unexpected (temporary) at Writer::close, context: { uri: https://pod-000-1158-03.backblaze.com/b2api/v2/b2_upload_file/{redacted}, response: Parts { status: 503, version: HTTP/1.1, headers: {"server": "nginx", "date": "Sun, 12 Jan 2025 09:26:33 GMT", "content-type": "application/json;charset=utf-8", "content-length": "81", "connection": "keep-alive", "cache-control": "max-age=0, no-cache, no-store", "strict-transport-security": "max-age=63072000"} }, service: b2, path: data/{redacted}, written: 41083244 } => B2Error { status: 503, code: "service_unavailable", message: "CPU too busy" } [00:20:09] backing up... █████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 16.57 GiB/70.38 GiB 7.63 MiB/s (ETA 65m)

After which rustic exits with code 141

Thank you for Rustic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backends Area: Related to backends in `rustic-core` A-cli Area: `rustic` command line interface C-bug Category: Something isn't working as expected
Projects
None yet
Development

No branches or pull requests

6 participants