Implement bulk_delete_request for Azure #5681

andrebsguedes · 2024-04-22T22:34:09Z

Which issue does this PR close?

Closes #5680.

tustvold · 2024-04-26T11:23:44Z

Sorry for the delay, I'll try to get to this next week, it's a spicy one 😅

Azure's APIs are truly something else 🙃

andrebsguedes · 2024-05-13T15:28:36Z

@tustvold Don't tell me! It's sad to see the amount of gymnastics their official clients have to do, like encoding and decoding HTTP manually just because someone thought it would be a good idea to send quasi HTTP requests within a multipart request... 🫠

object_store/src/azure/client.rs

tustvold

So I've gone through this, and whilst I don't see anything obviously incorrect, I am quite nervous about it. It is implementing a fairly complex specification, https://www.ietf.org/rfc/rfc2046, and I worry there may be incorrect assumptions, implementation oversights, etc... If this were being used for a read-only request, that's one thing, but the potential impact of a malformed bulk delete is high.

I don't know if there are mature Rust implementations of multipart mime messages, but if there are, one idea might be to add this as an optional dependency, potentially gated on its own feature (e.g. azure_batch), and use that to implement this functionality. That would delegate the maintenance, correctness burden onto a crate focused on implementing that particular specification correctly

object_store/src/azure/client.rs

tustvold · 2024-05-20T12:31:18Z

object_store/src/azure/client.rs

@@ -240,6 +268,157 @@ impl<'a> PutRequest<'a> {
    }
 }

+#[inline]
+fn extend(dst: &mut Vec<u8>, data: &[u8]) {


Is this needed?

It is a shorthand to avoid typing extend_from_slice too much, I saw it in hyper's source code and I think it makes it a little more readable

I personally think it obfuscates what the code is doing, but I don't feel strongly

tustvold · 2024-05-20T12:33:22Z

object_store/src/azure/client.rs

+}
+
+// Write header names as title case. The header name is assumed to be ASCII.
+fn title_case(dst: &mut Vec<u8>, name: &[u8]) {


FWIW this may not be necessary as headers are case insensitive

Azure rejects requests if some headers aren't using a very specific casing

🤦 Wow...

object_store/src/azure/client.rs

tustvold · 2024-05-20T12:36:36Z

object_store/src/azure/client.rs

+    };
+
+    // Parse part response headers
+    let mut headers = [httparse::EMPTY_HEADER; 10];


Similar comment to above

httparse expects a slice of headers and does not attempt to grow it, so we would need to make the code more complex to support arbitrary headers. I chose to increase the amount to a more conservative one and explain it in a comment

object_store/src/azure/client.rs

alamb · 2024-05-20T13:03:06Z

So I've gone through this, and whilst I don't see anything obviously incorrect, I am quite nervous about it. It is implementing a fairly complex specification, https://www.ietf.org/rfc/rfc2046, and I worry there may be incorrect assumptions, implementation oversights, etc... If this were being used for a read-only request, that's one thing, but the potential impact of a malformed bulk delete is high.

I don't know if there are mature Rust implementations of multipart mime messages, but if there are, one idea might be to add this as an optional dependency, potentially gated on its own feature (e.g. azure_batch), and use that to implement this functionality. That would delegate the maintenance, correctness burden onto a crate focused on implementing that particular specification correctly

My opinion is that having this feature as a part of object store (with a caveat in the doc and gated behind a non default feature flag) would be a good idea.

My rationale is that while I agree with the technical concerns, unless there is some alternate strategy for implementing this feature today this seems like the best way to get experience / testing for this feature

object_store/src/azure/client.rs

Xuanwo

I suggest adding dedicated tests for the multipart parse and generate logic, similar to: https://github.com/apache/opendal/blob/be6f359d15b965147d2c0ecc4de93a95863167aa/core/src/raw/http_util/multipart.rs#L748

object_store/src/azure/client.rs

andrebsguedes · 2024-06-04T20:23:12Z

@tustvold @konjac I still have to take the time to go through all comments here but thanks in advance for taking the time to review it. I can understand the hesitation of approving something that does manual encoding/decoding like this and I would personally hope that we could rely on something else that handle the trickiest bits for us but my searches for a crate that could handle this were unfruitful:

multipart: old, synchronous and only multipart/form-data
multer: parsing only, and seems to be hard-coded to multipart/form-data mime
multipart-stream: also old, supports multipart/x-mixed-replace and maybe multipart/mixed but requires the content-length header in parts
...

But what really ended my search was looking at how the official clients do this and realizing they are also doing it manually:

andrebsguedes · 2024-07-30T16:32:29Z

@tustvold After going through the comments, I think this is ready for another review pass.

tustvold · 2024-07-30T16:39:16Z

I'm afraid I do not have capacity in the foreseeable future for reviewing this, as I no longer work on arrow-rs. One of the other maintainers may be able to help out

andrebsguedes · 2024-07-30T17:20:41Z

Maybe @alamb could help me find a reviewer for this?

Xuanwo · 2024-07-31T05:43:21Z

Hi @andrebsguedes, I'm willing to help review it later this week, although I'm not a maintainer of arrow-rs.

Xuanwo

Hi, this looks good overall. The only thing missing from my side is unit tests for the multipart parse logic.

The bulk delete request builder and response parsing logic are much more complex compared to other requests. I suggest we have dedicated unit tests specifically for them.

… invariants

andrebsguedes · 2024-11-05T17:14:26Z

Hi @Xuanwo, thanks for the review, finally got some time to add the tests for this, would appreciate another pass.

Xuanwo

Thank you @andrebsguedes a lot for your hardwork!

andrebsguedes · 2024-11-19T15:40:14Z

@tustvold Do you know anyone that can help with merging this now that we have got the approval?

tustvold

Sorry for the delay, I think this is fine as is, but I left some comments on how we could reduce the dependency footprint. This could also be done as a follow on.

tustvold · 2024-11-22T14:19:55Z

object_store/src/azure/client.rs

+}
+
+// Write header names as title case. The header name is assumed to be ASCII.
+fn title_case(dst: &mut Vec<u8>, name: &[u8]) {


🤦 Wow...

tustvold · 2024-11-22T14:20:34Z

object_store/src/azure/client.rs

+        let credential = self.get_credential().await?;
+
+        // https://www.ietf.org/rfc/rfc2046
+        let boundary = format!("batch_{}", uuid::Uuid::new_v4());


I wonder if this actually needs to be a UUID or if we could just generate 128-bits of random data and avoid the additional dependency

tustvold · 2024-11-22T14:23:27Z

object_store/src/azure/client.rs

+
+        // Encode end marker
+        extend(&mut body_bytes, b"--");
+        extend(&mut body_bytes, boundary.as_bytes());


Should we validate that boundary doesn't appear in the encoded body?

We can't because the first thing we add to body_bytes is the boundary itself (within serialize_part_delete_request)

tustvold · 2024-11-22T14:29:34Z

object_store/src/azure/client.rs

+
+    let stream = batch_response.bytes_stream();
+
+    let mut multipart = multer::Multipart::new(stream, boundary);


Sorry to go back and forth on this, but looking at multer it has a fairly monstrous dependency footprint, including https://crates.io/crates/encoding_rs/0.8.35 which is 1MB on its own...

Perhaps we could just revert to the custom parsing logic you had before

tustvold · 2024-11-22T14:31:50Z

object_store/src/azure/client.rs

@@ -240,6 +268,157 @@ impl<'a> PutRequest<'a> {
    }
 }

+#[inline]
+fn extend(dst: &mut Vec<u8>, data: &[u8]) {


I personally think it obfuscates what the code is doing, but I don't feel strongly

…ggestions

andrebsguedes · 2024-11-22T21:41:32Z

@tustvold No problem! Ready for another pass

tustvold · 2024-11-25T14:30:48Z

Thank you for sticking with this

github-actions bot added the object-store Object Store Interface label Apr 22, 2024

tustvold reviewed May 20, 2024

View reviewed changes

object_store/src/azure/client.rs Outdated Show resolved Hide resolved

tustvold reviewed May 20, 2024

View reviewed changes

konjac reviewed May 23, 2024

View reviewed changes

object_store/src/azure/client.rs Outdated Show resolved Hide resolved

Xuanwo reviewed May 23, 2024

View reviewed changes

object_store/src/azure/client.rs Show resolved Hide resolved

object_store/src/azure/client.rs Outdated Show resolved Hide resolved

andrebsguedes requested review from tustvold, Xuanwo and konjac July 29, 2024 20:41

Xuanwo reviewed Aug 2, 2024

View reviewed changes

andrebsguedes added 11 commits November 5, 2024 13:59

Implement bulk_delete_request for Azure

efd5c7c

Fix lint and add Azurite bug workaround

075cc21

Special 404 error case

5f0331b

Clippy fix

a384fc1

Make number of expected headers more conservative and better document…

2787ad2

… invariants

Use multer for multipart parsing

85c14b1

Fix clippy

098ab45

Fix clippy apache#2

2012b5b

Reuse part response buffer

452bbb9

Make multer conditional to azure feature

cd29eed

One more HeaderValue::from_static

3b44eef

andrebsguedes force-pushed the azure-bulk-deletes branch from 50dfb18 to e38ee50 Compare November 5, 2024 16:59

Add tests for bulk delete request building and response parsing

412c3b2

andrebsguedes force-pushed the azure-bulk-deletes branch from e38ee50 to 412c3b2 Compare November 5, 2024 17:04

andrebsguedes requested a review from Xuanwo November 5, 2024 17:07

Xuanwo approved these changes Nov 7, 2024

View reviewed changes

tustvold approved these changes Nov 22, 2024

View reviewed changes

andrebsguedes added 2 commits November 22, 2024 18:33

Switch back to manual parsing to avoid multer dependency, other PR su…

87ad3f1

…ggestions

Fixes lint

68d023f

andrebsguedes requested a review from tustvold November 22, 2024 21:40

tustvold approved these changes Nov 25, 2024

View reviewed changes

tustvold merged commit 7fc0e87 into apache:main Nov 25, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement bulk_delete_request for Azure #5681

Implement bulk_delete_request for Azure #5681

andrebsguedes commented Apr 22, 2024

tustvold commented Apr 26, 2024

andrebsguedes commented May 13, 2024

tustvold left a comment •

edited

Loading

tustvold May 20, 2024

andrebsguedes Jul 29, 2024

tustvold Nov 22, 2024

tustvold May 20, 2024

andrebsguedes Jul 29, 2024

tustvold Nov 22, 2024

tustvold May 20, 2024

andrebsguedes Jul 29, 2024

alamb commented May 20, 2024

Xuanwo left a comment

andrebsguedes commented Jun 4, 2024

andrebsguedes commented Jul 30, 2024

tustvold commented Jul 30, 2024

andrebsguedes commented Jul 30, 2024

Xuanwo commented Jul 31, 2024 •

edited

Loading

Xuanwo left a comment

andrebsguedes commented Nov 5, 2024

Xuanwo left a comment

andrebsguedes commented Nov 19, 2024

tustvold left a comment

tustvold Nov 22, 2024

tustvold Nov 22, 2024

tustvold Nov 22, 2024

andrebsguedes Nov 22, 2024

tustvold Nov 22, 2024

tustvold Nov 22, 2024

andrebsguedes commented Nov 22, 2024

tustvold commented Nov 25, 2024


		let stream = batch_response.bytes_stream();

		let mut multipart = multer::Multipart::new(stream, boundary);

Implement bulk_delete_request for Azure #5681

Implement bulk_delete_request for Azure #5681

Conversation

andrebsguedes commented Apr 22, 2024

Which issue does this PR close?

tustvold commented Apr 26, 2024

andrebsguedes commented May 13, 2024

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented May 20, 2024

Xuanwo left a comment

Choose a reason for hiding this comment

andrebsguedes commented Jun 4, 2024

andrebsguedes commented Jul 30, 2024

tustvold commented Jul 30, 2024

andrebsguedes commented Jul 30, 2024

Xuanwo commented Jul 31, 2024 • edited Loading

Xuanwo left a comment

Choose a reason for hiding this comment

andrebsguedes commented Nov 5, 2024

Xuanwo left a comment

Choose a reason for hiding this comment

andrebsguedes commented Nov 19, 2024

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrebsguedes commented Nov 22, 2024

tustvold commented Nov 25, 2024

tustvold left a comment •

edited

Loading

Xuanwo commented Jul 31, 2024 •

edited

Loading