Retry or repair failed put of chunks? #257

AOOSG · 2024-05-12T16:29:50Z

Hi, I seem to recently have gotten quite frequent failures while uploading a UE5.4 build to an AWS S3 bucket. It seemed to start happening without any changes to my system. The errors are similar to this:

ERRO putStoredBlock: failed to put stored block at `chunks/9f85/0x9f854a10bfd90a9d.lsb` in `s3://<my-aws-bucket>/store/`: s3BlobObject.Write(): operation error S3: PutObject, https response error StatusCode: 0, RequestID: , HostID: , request send failed, Put "https://<my-aws-bucket>.s3.eu-west-2.amazonaws.com/store/chunks/9f85/0x9f854a10bfd90a9d.lsb?x-id=PutObject": dial tcp: lookup <my-aws-bucket>.s3.eu-west-2.amazonaws.com: no such host  blobClient="s3://<my-aws-bucket>/store/" blockHash=11494675059731663517 fname=putStoredBlock key=chunks/9f85/0x9f854a10bfd90a9d.lsb s="s3://<my-aws-bucket>/store/"

I almost always get two failures during an upload, for different chunks. I had 6 chunks failing to upload once.

Is there a known fix for "no such host" errors, or can I retry failed blocks manually somehow? I'd like to just retry much more persistently to see if it succeeds eventually.

The worst part is that the bucket seems corrupted once this error occurs and I have to delete all objects in the bucket before trying to upload a new engine version again.

There's a couple of things I've tried:

Searched the web for a fix - No obvious things (e.g. link and link)
Using google cloud bucket instead - similar errors occur, a couple of errors usually.
Using a different DNS server, but it didn't seem to help (e.g. 8.8.8.8)
Updated to latest golongtail preview release.

Any other suggestions, or ways to repair failed uploads of chunks? Thanks!

The text was updated successfully, but these errors were encountered:

AOOSG · 2024-05-13T07:33:31Z

Some more things I tried last night:

I can keep retrying put of the engine build. It keeps uploading chunks not already uploaded whilst I still get errors on new chunks. Eventually it'll succeed with the upload.

This sounds promising, but doing a get on the same engine build will show lots of warnings and errors on missing chunks, it seems the store is corrupted, and the only option is to delete all the files in the bucket.

Is there a way to repair the store? I fear any future get in future uploaded builds will also fail because of chunks possibly being shared between builds (and the chunks may be missing).

AOOSG · 2024-05-13T07:47:32Z

Further testing, I think I know what's going on. I managed to put an engine build successfully.

It looks like golongtail is so efficient at maximizing the upload bandwidth that sometimes the DNS resolve requests from the golongtail app times out!

On the router I limited the machine upload speed to 90 Mbit/s (It's a ~120 Mbit/s upload internet connection) and I managed to successfully upload an engine build without errors.

So it seems solved for now, two questions/comments:

Feature request: Allow me to throttle the put upload bandwidth usage (for example by allowing me to limit the number of threads used?)
If errors do happen again during a new engine build put it seems there's no guarantee future engine builds won't be corrupted due to possible missing chunks that are shared. Is this true, and how can I fix or mitigate it?

DanEngelbrecht · 2024-05-15T17:47:55Z

You can indirectly reduce the number of threads doing network jobs via the --worker-count option. It defaults to the number of CPU cores you have so if your machine have lots of cores it might be helpful to reduce the number.

For corrupted upsync it is designed so it should not write the version-index if if fails to upload blocks but there might be something wrong there.

You have the option to add --validate which will check that the uploaded index is correct, but won't help if the blocks gets corrupted on the way up to your storage medium...

DanEngelbrecht · 2024-05-15T17:50:55Z

You might also be able to do some form of repair with either --clone-store or --prune-store but there is no outright repair command.

AOOSG · 2024-05-17T14:56:32Z

OK thanks, feel free to close this issue.

I'll give these commands a try the next time. Upload issue should be fixed by reducing the worker count.

I should be able to run --prune-store-blocks if a put fails to remove all blocks which isn't in the index (which should happen if chunks fail to upload).

Two other things I'm thinking of as well:

golongtail put only writes to a local store on the server, and use rsync to the bucket to ensure the bucket is up to date. Downsides: Keeps around a local copy of the whole store on the server.
If there's an error trying to upload a chunk, change the store location in the bucket (i.e. gs://bucket/store<index>/). index is bumped each time there's an engine build that fails to upload one or more chunks. Downsides: Lose all block re-use when bumping the index.

DanEngelbrecht · 2024-06-10T14:51:51Z

@AOOSG Out of curiosity - what kind of computer and NIC do you have?

AOOSG · 2024-06-11T21:12:11Z

From the two machines I've tried (on the same network, 1Gbps down & 120Mbps up)

Desktop 5950X (32 threads) and a dodgy WIFI that still maxes the upload speed.
Server i9-9900k (16 threads) Gigabit ethernet to router

Both were having DNS resolve timing out before I bandwidth limited them to 90Mbps upload.

DanEngelbrecht · 2024-07-08T09:08:07Z

Hi, could you try out https://github.com/DanEngelbrecht/golongtail/releases/tag/v0.4.4-pre1 without limiting the number of workers?

Does it create corrupted stores?
Does performance still look ok?

AOOSG · 2024-08-03T16:56:35Z

Thanks, given it a try now, it's worked so far.

Not that I've seen from a couple of times
Yep, I used the default 8 threads and it maxed 120 Mbps.

AOOSG changed the title ~~Repair or retry failed put of blocks?~~ Retry or repair failed put of chunks? May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry or repair failed put of chunks? #257

Retry or repair failed put of chunks? #257

AOOSG commented May 12, 2024

AOOSG commented May 13, 2024

AOOSG commented May 13, 2024

DanEngelbrecht commented May 15, 2024

DanEngelbrecht commented May 15, 2024

AOOSG commented May 17, 2024

DanEngelbrecht commented Jun 10, 2024

AOOSG commented Jun 11, 2024

DanEngelbrecht commented Jul 8, 2024 •

edited

Loading

AOOSG commented Aug 3, 2024

Retry or repair failed put of chunks? #257

Retry or repair failed put of chunks? #257

Comments

AOOSG commented May 12, 2024

AOOSG commented May 13, 2024

AOOSG commented May 13, 2024

DanEngelbrecht commented May 15, 2024

DanEngelbrecht commented May 15, 2024

AOOSG commented May 17, 2024

DanEngelbrecht commented Jun 10, 2024

AOOSG commented Jun 11, 2024

DanEngelbrecht commented Jul 8, 2024 • edited Loading

AOOSG commented Aug 3, 2024

DanEngelbrecht commented Jul 8, 2024 •

edited

Loading