Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement bulk_delete_request for Azure #5681

Merged
merged 14 commits into from
Nov 25, 2024

Conversation

andrebsguedes
Copy link
Contributor

Which issue does this PR close?

Closes #5680.

@github-actions github-actions bot added the object-store Object Store Interface label Apr 22, 2024
@tustvold
Copy link
Contributor

Sorry for the delay, I'll try to get to this next week, it's a spicy one 😅

Azure's APIs are truly something else 🙃

@andrebsguedes
Copy link
Contributor Author

@tustvold Don't tell me! It's sad to see the amount of gymnastics their official clients have to do, like encoding and decoding HTTP manually just because someone thought it would be a good idea to send quasi HTTP requests within a multipart request... 🫠

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've gone through this, and whilst I don't see anything obviously incorrect, I am quite nervous about it. It is implementing a fairly complex specification, https://www.ietf.org/rfc/rfc2046, and I worry there may be incorrect assumptions, implementation oversights, etc... If this were being used for a read-only request, that's one thing, but the potential impact of a malformed bulk delete is high.

I don't know if there are mature Rust implementations of multipart mime messages, but if there are, one idea might be to add this as an optional dependency, potentially gated on its own feature (e.g. azure_batch), and use that to implement this functionality. That would delegate the maintenance, correctness burden onto a crate focused on implementing that particular specification correctly

object_store/src/azure/client.rs Outdated Show resolved Hide resolved
object_store/src/azure/client.rs Outdated Show resolved Hide resolved
@@ -240,6 +268,157 @@ impl<'a> PutRequest<'a> {
}
}

#[inline]
fn extend(dst: &mut Vec<u8>, data: &[u8]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a shorthand to avoid typing extend_from_slice too much, I saw it in hyper's source code and I think it makes it a little more readable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think it obfuscates what the code is doing, but I don't feel strongly

}

// Write header names as title case. The header name is assumed to be ASCII.
fn title_case(dst: &mut Vec<u8>, name: &[u8]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this may not be necessary as headers are case insensitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure rejects requests if some headers aren't using a very specific casing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 Wow...

object_store/src/azure/client.rs Show resolved Hide resolved
object_store/src/azure/client.rs Outdated Show resolved Hide resolved
};

// Parse part response headers
let mut headers = [httparse::EMPTY_HEADER; 10];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

httparse expects a slice of headers and does not attempt to grow it, so we would need to make the code more complex to support arbitrary headers. I chose to increase the amount to a more conservative one and explain it in a comment

object_store/src/azure/client.rs Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented May 20, 2024

So I've gone through this, and whilst I don't see anything obviously incorrect, I am quite nervous about it. It is implementing a fairly complex specification, https://www.ietf.org/rfc/rfc2046, and I worry there may be incorrect assumptions, implementation oversights, etc... If this were being used for a read-only request, that's one thing, but the potential impact of a malformed bulk delete is high.

I don't know if there are mature Rust implementations of multipart mime messages, but if there are, one idea might be to add this as an optional dependency, potentially gated on its own feature (e.g. azure_batch), and use that to implement this functionality. That would delegate the maintenance, correctness burden onto a crate focused on implementing that particular specification correctly

My opinion is that having this feature as a part of object store (with a caveat in the doc and gated behind a non default feature flag) would be a good idea.

My rationale is that while I agree with the technical concerns, unless there is some alternate strategy for implementing this feature today this seems like the best way to get experience / testing for this feature

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding dedicated tests for the multipart parse and generate logic, similar to: https://github.com/apache/opendal/blob/be6f359d15b965147d2c0ecc4de93a95863167aa/core/src/raw/http_util/multipart.rs#L748

object_store/src/azure/client.rs Show resolved Hide resolved
object_store/src/azure/client.rs Outdated Show resolved Hide resolved
@andrebsguedes
Copy link
Contributor Author

@tustvold @konjac I still have to take the time to go through all comments here but thanks in advance for taking the time to review it. I can understand the hesitation of approving something that does manual encoding/decoding like this and I would personally hope that we could rely on something else that handle the trickiest bits for us but my searches for a crate that could handle this were unfruitful:

  • multipart: old, synchronous and only multipart/form-data
  • multer: parsing only, and seems to be hard-coded to multipart/form-data mime
  • multipart-stream: also old, supports multipart/x-mixed-replace and maybe multipart/mixed but requires the content-length header in parts
  • ...

But what really ended my search was looking at how the official clients do this and realizing they are also doing it manually:

@andrebsguedes
Copy link
Contributor Author

@tustvold After going through the comments, I think this is ready for another review pass.

@tustvold
Copy link
Contributor

I'm afraid I do not have capacity in the foreseeable future for reviewing this, as I no longer work on arrow-rs. One of the other maintainers may be able to help out

@andrebsguedes
Copy link
Contributor Author

Maybe @alamb could help me find a reviewer for this?

@Xuanwo
Copy link
Member

Xuanwo commented Jul 31, 2024

Hi @andrebsguedes, I'm willing to help review it later this week, although I'm not a maintainer of arrow-rs.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this looks good overall. The only thing missing from my side is unit tests for the multipart parse logic.

The bulk delete request builder and response parsing logic are much more complex compared to other requests. I suggest we have dedicated unit tests specifically for them.

@andrebsguedes
Copy link
Contributor Author

Hi @Xuanwo, thanks for the review, finally got some time to add the tests for this, would appreciate another pass.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @andrebsguedes a lot for your hardwork!

@andrebsguedes
Copy link
Contributor Author

@tustvold Do you know anyone that can help with merging this now that we have got the approval?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I think this is fine as is, but I left some comments on how we could reduce the dependency footprint. This could also be done as a follow on.

}

// Write header names as title case. The header name is assumed to be ASCII.
fn title_case(dst: &mut Vec<u8>, name: &[u8]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 Wow...

let credential = self.get_credential().await?;

// https://www.ietf.org/rfc/rfc2046
let boundary = format!("batch_{}", uuid::Uuid::new_v4());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this actually needs to be a UUID or if we could just generate 128-bits of random data and avoid the additional dependency


// Encode end marker
extend(&mut body_bytes, b"--");
extend(&mut body_bytes, boundary.as_bytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate that boundary doesn't appear in the encoded body?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't because the first thing we add to body_bytes is the boundary itself (within serialize_part_delete_request)


let stream = batch_response.bytes_stream();

let mut multipart = multer::Multipart::new(stream, boundary);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to go back and forth on this, but looking at multer it has a fairly monstrous dependency footprint, including https://crates.io/crates/encoding_rs/0.8.35 which is 1MB on its own...

Perhaps we could just revert to the custom parsing logic you had before

@@ -240,6 +268,157 @@ impl<'a> PutRequest<'a> {
}
}

#[inline]
fn extend(dst: &mut Vec<u8>, data: &[u8]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think it obfuscates what the code is doing, but I don't feel strongly

@andrebsguedes
Copy link
Contributor Author

@tustvold No problem! Ready for another pass

@tustvold tustvold merged commit 7fc0e87 into apache:main Nov 25, 2024
15 checks passed
@tustvold
Copy link
Contributor

Thank you for sticking with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement bulk_delete_request for Azure
5 participants