-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get DS file structure with serviceX tool #4
base: main
Are you sure you want to change the base?
Conversation
This is my first attempt at building this feature, and initially, I didn't expect to reconstruct For eg, should I use Should I write a simpler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - nice! I like this and this is going to be very useful. I agree with your comment about simplifying things. Here is what I think should be done:
- Use
json
(with the built injson
module) to generate the output on ServiceX - Use the
json
module to parse it up on the client.
This should significantly simplify the code - the json
builtin parser is basically bullet proof. Once that is done, then how the downstream things work can probably be significantly simplified.
import os | ||
from .file_peeking import get_structure | ||
|
||
def run_from_command(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you have not run black
on the code. Could you be sure to do that? It might be worth doing that on a new PR because it will make "everything" looked changed. But uniform formatting done by black is pretty nice. I'll add this as a issue. :-)
parser.add_argument("--save-to-txt", action="store_true", help="Save output to a text file instead of printing.") | ||
|
||
args = parser.parse_args() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might look into using the typer
library for a command line. It handles much of this automatically and also gives you a bunch of extra power (like shell completions) automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And add an example json
file or something so people know what the json file format looks like? Or that goes in documentation proper? It should go somewhere.
dataset_file = args.dataset[0] | ||
|
||
if not os.path.isfile(dataset_file): | ||
print(f"\033[91mError: JSON file '{dataset_file}' not found.\033[0m", file=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use print
, instead us logging
- in this case logging.error
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, having unicode characters like this isn't great. I get what you are trying to do, but...
args = parser.parse_args() | ||
|
||
if len(args.dataset) == 1 and args.dataset[0].endswith(".json"): | ||
dataset_file = args.dataset[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole if statement might be a separate function that returns the dataset
list. You can make it recursive too, so one may specify multiple json files, etc.
print(f"\033[91mError: The JSON file must contain a dictionary.\033[0m", file=sys.stderr) | ||
sys.exit(1) | ||
|
||
except json.JSONDecodeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't swallow this error - make sure it gets printed out - it will contains the "reason" the json is bad, which will help the user figure out what they have to fix in their file.
if not args.save_to_txt: | ||
print(result) | ||
else: | ||
print("Saved to samples_structure.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a logging.info
message. :-)
|
||
parser.add_argument("dataset", nargs='+', help="Input datasets (Rucio DID) or a JSON file containing datasets in a dict.") | ||
parser.add_argument("--filter-branch", default="", help="Only display branches containing this string.") | ||
parser.add_argument("--save-to-txt", action="store_true", help="Save output to a text file instead of printing.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence about this - you can just pipe this to your output file (with >
or |
). Do we need this?
import awkward as ak | ||
|
||
def run_query(input_filenames=None): | ||
import uproot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the next line seem like they are repeats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean for the imports?
Since the run_query is sent to the transformer, it must call the import also.
{ | ||
"NFiles": 1, | ||
"Name": name, | ||
"Dataset": servicex.dataset.Rucio(did), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ok for this - but we will want to support the other types of inputs as well, I think. One way to do this is to pass in the resolved object (e.g. here servicex.dataset.Rucio(did)
) in place of did
in the dataset_dict
. Then someone using the library can use which ever input source they want to.
Might be a version 1.1 change?
New function:
get_structure
ak.array
an str encoding the file structureservicex.deliver
callsamples-structure.txt
ak.array
from the encoded str and returns the type constructorThe function can be called from the terminal:
servicex-get-structure
Options are added to save to .txt, load a single or multiple DS, write all DS in a .json to be loaded by the command.
Many helpers were added for this feature,
run_query
,build_deliver_spec
,print_structure_from_str
,parse_jagged_depth_and_dtype
,str_to_array
,run_from_command