Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Python streaming #29

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

WIP Python streaming #29

wants to merge 22 commits into from

Conversation

alexlemann
Copy link

No description provided.

@alexlemann alexlemann changed the title Python streaming WIP Python streaming Mar 6, 2017
@alexlemann
Copy link
Author

@alvacouch Let me know what you think. I think we're on the same page with respect to simplifying what's in django_irods. My general principles are 1) that this should be usable by other projects and should not have any HS specific code in it 2) python-irodsclient is likely a better basis for a package than icommands (or fuse) 3) this should act like a regular django storage and not include (at least not too much) extra stuff.

I realize that these tenants might be overly optimistic at this stage in the game given the impacts that it may have on the overall HS code base, but I'm interested to hear your thoughts on the matter. I also, realize that I likely have removed a bit too much and am interested in potentially reimplementing any significant functionality that doesn't go against the principles and that is needed for HS.

@alexlemann
Copy link
Author

TODOs:

  • - Add a configurable base path per iRODS backend
  • - Reimplement existing previous functionality (which parts?)
  • - Consider using WSGI/Django FileWrapper for uploads
  • - Fix timeout issues or document them
  • - Resolve issues around configurable buffer sizes in underlying python-irodsclient
  • - Implement upload functionality
  • - Write tests
  • - Improve documentation
  • - Turn into proper Python module (setup.py, pypi etc.)

@pkdash pkdash requested a review from alvacouch March 7, 2017 03:53
@hyi
Copy link

hyi commented Mar 8, 2017

@alexlemann I am not 100% sure about the purpose of this work and where this is aimed at, but want to make sure the intent of this work is not to rely on python irods client solely for file transfer between hydroshare django server and iRODS. The reason is that python irods client does not support parallel file transfer, so you can use python irods client for file listing, adding metadata, etc., but not for file transfer especially for transfer of big files. The current implementation is to use icommands underneath for file transfer which can leverage iRODS parallel file transfers for performance reasons.

@alexlemann
Copy link
Author

alexlemann commented Mar 13, 2017

@hyi Here's the reference to STDOUT and threading in irods I was talking about:
https://github.com/irods/irods/blob/master/lib/api/src/rcDataObjGet.cpp#L99

You can check on Linux by checking /proc/<pid>/status for a line titled Threads while running iget <path> vs. iget <path> -

@alexlemann
Copy link
Author

Thanks for the feeback, here @hyi and those that entered the conversation on the call.

As we discussed on the call, the question I had was whether the threading or not threading would be the bottleneck or whether the end-user's internet connection is likely going to be the bottle neck. In the case of the streaming here, a temporary copy is not required to be completely made before starting to write out to the end HS client which could possibly lead to a performance win to the end user.

I also offered that the only place that the number of threads is limited in python-irodsclient is when telling the irods server how many threads the client would like to use ( eg https://github.com/irods/python-irodsclient/blob/master/irods/manager/data_object_manager.py#L65 ). With a small modification to this, a threaded client could be built on top of python-irodsclient that created a separate web microservice for handling interactions with irods. But, given the above questions about performance, this is likely excessive.

@alvacouch had other concerns about multi-threading and overall server resource usage that he will need to elaborate and distinguish how they are relevant to the changes proposed here either in the code or conversation.

@hyi
Copy link

hyi commented Mar 13, 2017

@alexlemann Thanks for the pointer on iget piping to stdout not supporting multiple threads. I tested it out and confirmed it is indeed the case. When I transfer a big file using iget, it used 17 threads, but when I used iget <path> - the piped-to-stdout option, only 1 thread is used. So you are correct that no parallel transfer is enabled for file being transferred from irods server to hydroshare server for downloading. That said, since there is fast network connection between irods server and hydroshare server, I think the bottleneck is still on the transfer from hydroshare server to web client. In terms of your suggestion on small modification on python-irodsclient to enable multiple threading, I suggest you to email irods chat user list: [email protected] to ask this question and get an answer from iRODS experts first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants