Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get DS file structure with serviceX tool #4

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

ArturU043
Copy link
Collaborator

New function: get_structure

  1. Creates an SX python function query that reads one file of the requested DS and outputs to an ak.array an str encoding the file structure
  2. Builds a deliver spec with support for multiple samples and user-defined names
  3. Gets the result encoded str from the servicex.deliver call
  4. (opt) Prints the encoded str in a user-friendly format
  5. (opt) Returns the re-formatted str
  6. (opt) Saves the re-formatted str to samples-structure.txt
  7. (opt) Reconstructs a dummy ak.array from the encoded str and returns the type constructor

The function can be called from the terminal: servicex-get-structure
Options are added to save to .txt, load a single or multiple DS, write all DS in a .json to be loaded by the command.

Many helpers were added for this feature, run_query, build_deliver_spec, print_structure_from_str, parse_jagged_depth_and_dtype, str_to_array, run_from_command

@ArturU043 ArturU043 self-assigned this Mar 25, 2025
@ArturU043
Copy link
Collaborator Author

This is my first attempt at building this feature, and initially, I didn't expect to reconstruct ak.arrays from the encoded string, but after stepping back, I wonder if it might not be over-complex.

For eg, should I use regex matching, instead of positional methods to extract information from the encoded str?

Should I write a simpler encoded str using the awkward type constructor directly?

Copy link

@gordonwatts gordonwatts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - nice! I like this and this is going to be very useful. I agree with your comment about simplifying things. Here is what I think should be done:

  1. Use json (with the built in json module) to generate the output on ServiceX
  2. Use the json module to parse it up on the client.

This should significantly simplify the code - the json builtin parser is basically bullet proof. Once that is done, then how the downstream things work can probably be significantly simplified.

import os
from .file_peeking import get_structure

def run_from_command():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you have not run black on the code. Could you be sure to do that? It might be worth doing that on a new PR because it will make "everything" looked changed. But uniform formatting done by black is pretty nice. I'll add this as a issue. :-)

parser.add_argument("--save-to-txt", action="store_true", help="Save output to a text file instead of printing.")

args = parser.parse_args()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might look into using the typer library for a command line. It handles much of this automatically and also gives you a bunch of extra power (like shell completions) automatically.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add an example json file or something so people know what the json file format looks like? Or that goes in documentation proper? It should go somewhere.

dataset_file = args.dataset[0]

if not os.path.isfile(dataset_file):
print(f"\033[91mError: JSON file '{dataset_file}' not found.\033[0m", file=sys.stderr)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use print, instead us logging - in this case logging.error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, having unicode characters like this isn't great. I get what you are trying to do, but...

args = parser.parse_args()

if len(args.dataset) == 1 and args.dataset[0].endswith(".json"):
dataset_file = args.dataset[0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole if statement might be a separate function that returns the dataset list. You can make it recursive too, so one may specify multiple json files, etc.

print(f"\033[91mError: The JSON file must contain a dictionary.\033[0m", file=sys.stderr)
sys.exit(1)

except json.JSONDecodeError:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't swallow this error - make sure it gets printed out - it will contains the "reason" the json is bad, which will help the user figure out what they have to fix in their file.

if not args.save_to_txt:
print(result)
else:
print("Saved to samples_structure.txt")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a logging.info message. :-)


parser.add_argument("dataset", nargs='+', help="Input datasets (Rucio DID) or a JSON file containing datasets in a dict.")
parser.add_argument("--filter-branch", default="", help="Only display branches containing this string.")
parser.add_argument("--save-to-txt", action="store_true", help="Save output to a text file instead of printing.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence about this - you can just pipe this to your output file (with > or |). Do we need this?

import awkward as ak

def run_query(input_filenames=None):
import uproot

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the next line seem like they are repeats?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean for the imports?

Since the run_query is sent to the transformer, it must call the import also.

{
"NFiles": 1,
"Name": name,
"Dataset": servicex.dataset.Rucio(did),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok for this - but we will want to support the other types of inputs as well, I think. One way to do this is to pass in the resolved object (e.g. here servicex.dataset.Rucio(did)) in place of did in the dataset_dict. Then someone using the library can use which ever input source they want to.

Might be a version 1.1 change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants