Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ObjectStore] Add SpawnService for running requests on different tokio runtime/Handle #7253

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Mar 8, 2025

Which issue does this PR close?

Rationale for this change

This allows spawning IO onto a separate tokio runtime

What changes are included in this PR?

Are there any user-facing changes?

@@ -195,6 +195,18 @@ impl HttpResponseBody {
}
}

impl Body for HttpResponseBody {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a drive-by improvement, and allows using HttpResponseBody with http_body_util::BodyExt.

It technically isn't necessary, as object_store ignores any non-data frames and therefore HttpResponseBody::bytes is sufficient, but this is cleaner

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is probably good context to add as an inline comment for future readers

// specific language governing permissions and limitations
// under the License.

//! Generic HTTP client abstraction
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a little confusing having the public HTTP client abstraction mixed in with the client utilities for implementing object_store - so split it out into a separate module

}

#[async_trait]
impl<T: HttpService + Clone> HttpService for SpawnService<T> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit ouroborosy but I wonder if we should implement HttpService for HttpClient so that you can also wrap an HttpClient with this

@@ -195,6 +195,18 @@ impl HttpResponseBody {
}
}

impl Body for HttpResponseBody {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is probably good context to add as an inline comment for future readers


/// Wraps a provided [`HttpService`] and runs it on a separate tokio runtime
#[derive(Debug)]
pub struct SpawnService<T: HttpService + Clone> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always find templated code like this harder to use and understand than just dyn ...

When it is important to make type specialized code for performance I understand the need

But when the usecase is to make a function call for an http request I think the mental overhead is unecessary

I realize this is a style opinion / point of view and that others may not feel the same way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I originally actually had this wrapping HttpClient, but it ended up a bit weird as you ended up doing

let client = HttpClient::new(SpawnService::new(HttpClient::new(reqwest::Client::new())))

As opposed to composing the services to yield the final client

let client = HttpClient::new(SpawnService::new(reqwest::Client::new()))

@ion-elgreco
Copy link

How would one configure a store with a custom HttpClient?

I don't see anything that resembles with_http_client() on the stores

@tustvold
Copy link
Contributor Author

tustvold commented Mar 8, 2025

I don't see anything that resembles with_http_client() on the stores

0.12 (when it is released) adds with_http_connector

@alamb
Copy link
Contributor

alamb commented Mar 8, 2025

Is the idea that this would be able poll the response stream on a different runtime without all the high level wrapping needed if we did it in ObjectStore? If so that sounds like a good ideae

I am less clear if other layers (like chunked requests) for example make sense to do at this layer (as I think they also need to take into account things like verifing that the etag is the same for all responses, etc)

@tustvold
Copy link
Contributor Author

tustvold commented Mar 8, 2025

Is the idea that this would be able poll the response stream on a different runtime without all the high level wrapping needed if we did it in ObjectStore?

Yes, similar story for metrics, tracing, etc...

Caching could also be implemented at this layer quite easily, either by caching anything with an ETag header or by respecting the Cache-Control header, or some other mechanism.

chunks

Chunking can be done at this layer, however, it has the exact same restrictions as implementing it at the ObjectStore level:

  • Request needs to specify an ETag precondition to ensure same object
  • Requests must specify range(s)
  • Cannot return chunks out of order

It is unclear if these restrictions are too onerous for the feature to be useful.

@ion-elgreco
Copy link

ion-elgreco commented Mar 12, 2025

@tustvold do you think this would still land in a 0.12.x?

I would like to do a deltalake 1.0 soon but this is kind of a must to have included since it's a better implementation then the DeltaStorageBackend we have now.

@tustvold
Copy link
Contributor Author

@tustvold do you think this would still land in a 0.12.x?

I suspect the biggest potential blocker to that would be getting DataFusion updated to the 0.12 line, which in turn is waiting on the next breaking arrow release scheduled for the end of April.

would like to do a deltalake 1.0 soon

FWIW given that arrow and DataFusion are not "stable", i.e. they make periodic breaking releases, I would have thought releasing deltalake 1.0 would quickly end up needing a deltalake 2.0, etc... IMO neither arrow nor datafusion should be 1.0, but there is a historical reason for it.

@ion-elgreco
Copy link

@tustvold do you think this would still land in a 0.12.x?

I suspect the biggest potential blocker to that would be getting DataFusion updated to the 0.12 line, which in turn is waiting on the next breaking arrow release scheduled for the end of April.

That's fine! I will probably start with some RCs first, I was mainly curious that it gets published in the 0.12.x versioning.

would like to do a deltalake 1.0 soon

FWIW given that arrow and DataFusion are not "stable", i.e. they make periodic breaking releases, I would have thought releasing deltalake 1.0 would quickly end up needing a deltalake 2.0, etc... IMO neither arrow nor datafusion should be 1.0, but there is a historical reason for it.

I am mainly talking about the python API, which has been quite stable. Breaking changes generally effect our internal api's, and DF44 has resolved most of our issues.

There are some minor issues still in datafusion but tbh these are edge cases so that's fine for now

@alamb
Copy link
Contributor

alamb commented Mar 12, 2025

I will try and review this a bit more carefully tomorrow

I think it could do with some more documentation / examples (which I can help with)

I also think the name SpawnService while it makes sense given the underlying implementation (to spawn on another runtime) is somewhat confusing. I recommend a more explicit name like CrossRTService or CrossRTHttpService -- I realize I may prefer longer names than @tustvold , but given this one would likely be used only a few times (when setting up the object store) I think being more explicit is better

@alamb alamb changed the title Add SpawnService [ObjectStore] Add SpawnService Mar 13, 2025
@alamb
Copy link
Contributor

alamb commented Mar 18, 2025

I filed apache/arrow-rs-object-store#13 and linked it to this ticket's close

@alamb alamb changed the title [ObjectStore] Add SpawnService [ObjectStore] Add SpawnService for running requests on different tokio Runtimes Mar 18, 2025
@alamb alamb changed the title [ObjectStore] Add SpawnService for running requests on different tokio Runtimes [ObjectStore] Add SpawnService for running requests on different tokio runtime/Handle Mar 18, 2025
@alamb
Copy link
Contributor

alamb commented Mar 20, 2025

Thank you for this PR. We are in the process of moving the object_store code to its own repository. Would it be possible for you to create a PR in that repository instead?

(we will handle moving all existing issues to the new repository)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[object_store] Run requests on a different tokio runtime
3 participants