Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multipart uploads #26

Open
evanfuller opened this issue Jul 22, 2020 · 6 comments
Open

Add support for multipart uploads #26

evanfuller opened this issue Jul 22, 2020 · 6 comments

Comments

@evanfuller
Copy link

Tika appears to support multipart file uploads via the endpoint POST /tika/form, as documented here.

It would be great to add this support so that we could use the client for uploading large files without having to hold the buffer with all bytes in memory.

@tbpg
Copy link
Member

tbpg commented Aug 12, 2020

This seems reasonable to me (another endpoint to support). But, I'm not sure I understand this part:

hold the buffer with all bytes in memory.

An io.Reader doesn't need to be stored in memory, right? Or is this an issue under the hood with how the request is handled?

@evanfuller
Copy link
Author

@tbpg I'm basing my intuition off of this medium post.

The gist would be that if we want multipart support, we need some in-memory buffer to pass to multipart.NewWriter() when building requests, but in the (first) experiment the author conducted, Go just naively built a buffer in-memory containing the entire file. Later in the post, the author makes an improvement by using an io.Pipe instead so that the entire file is not held in memory just for the request.

Admittedly, I did not attempt to replicate the findings of this post, so it could be possible that Go has improved buffering for large requests like this, but I'm not sure.

@tbpg
Copy link
Member

tbpg commented Aug 12, 2020

Gotcha. That seems reasonable to me. I think we'd have to play with what the exact API is for the tika package. In general, I'd like to leave as much of the creation of the io.Reader to the caller. But, it might be too cumbersome to expect someone to use the multipart package?

@evanfuller
Copy link
Author

Yeah, point definitely taken on trying to be implementation/caller-agnostic. That said, the docs for this particular endpoint suggest that it is specifically for use with multipart uploads.

@k7en
Copy link

k7en commented Dec 10, 2020

Hello
I'm trying to extract meta-information and body text of a large file and it's consuming a lot of memory and I came here in search of a solution. I would like to communicate to Tika's API using the Multpart method.
Do you know what the situation is now with this Issue?
If you know how to solve this problem, I'd appreciate some advice.
BRGDS.

@tbpg
Copy link
Member

tbpg commented Dec 16, 2020

I'm very open to a PR here. Warning, we might need to modify the interface a little bit, keeping it as minimal as possible while enabling you to do what you need to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants