Skip to content
This repository has been archived by the owner on Oct 15, 2020. It is now read-only.

Add conversion CLI #7

Open
jeromekelleher opened this issue Jul 17, 2020 · 3 comments
Open

Add conversion CLI #7

jeromekelleher opened this issue Jul 17, 2020 · 3 comments

Comments

@jeromekelleher
Copy link
Contributor

A very natural way for sgkit to be used is to convert existing datasets into sgkit format. Having a CLI to do this will be very useful and will be a familiar pattern to many users. Given that it's going to take on the order of hours to convert large files, I think we need to make this a user-friendly process, with process monitoring etc.

I suggest we add CLI to this repo called plink2sg (just a suggestion - open to other names), which would work like

$ plink2sg [plink dataset prefix] [sgkit zarr path]

This would have options to

  • Options to fiddle the plink input, like the separator used etc.
  • Track verbosity. We should provide functionality to tap into Dask's logging and so make it easy to debug what's happening. I find daiquiri gives some really nice colour output for logs. It would be great if plink2sg -vv printed out lots of debug logs.
  • Control parallelism. In the simplest case we want to control how many processes dask will use. We might think about integrating with more advanced dask functionality as well I guess.
  • Get a progress monitor. This is essential. It must be possible to hook up a tqdm progress monitor with dask, right?
  • Control options for the zarr output. What compressors do we use, etc? I think we should also have options that allow us to change the zarr Store in use. For very large files, I would much prefer to have a ZipStore or an LmdbStore. We could provide options to the CLI that would make it easy to specify these.

As I write this, it seems that most of these options will be shared by other converters, so some of the code could be brought into the sgkit repo. Initially though, I think we should just write a good version of this utility and worry about pulling common bits out afterwards.

Thoughts?

@jeromekelleher
Copy link
Contributor Author

I'm happy to make a first pass at this. Any guidance on how to write CLI unit tests for pytest? I'm vaguely familiar with it, I'm stuck in the world of nose because of thousands of lines of tests.

@eric-czech
Copy link
Collaborator

eric-czech commented Jul 17, 2020

Any guidance on how to write CLI unit tests for pytest?

Hm I'm not but maybe @ravwojdyla has done that before?

It must be possible to hook up a tqdm progress monitor with dask, right?

If you're using the default scheduler, then all you should have to do is something like this:

from dask.diagnostics import ProgressBar
with ProgressBar():
    ds.to_zarr(...)
[########################################] | 100% Completed
# * it might show multiple bars though -- I'm not sure how many 
# different `.compute()` calls `Dataset.to_zarr` results in

For a CLI, that would probably make the most sense but if it supported taking a Dask scheduler address (to use a cluster) it would probably be better to simply spit out a URL for the dashboard so they can see the progress in addition to all the profiling stuff.

@ravwojdyla
Copy link
Contributor

ravwojdyla commented Jul 17, 2020

Any guidance on how to write CLI unit tests for pytest?

Hm I'm not but maybe @ravwojdyla has done that before?

Yep - just separate the parser logic from the invocation and from the main logic/lib and test each separately as needed + integration test with test data that invokes the CLI tool. Let me know if I can help in any way.

Edit: and when it comes to specifically pytest I don't think there is much difference between nose and pytest for CLI testing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants