New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Get DS file structure with serviceX tool #4

Open

ArturU043 wants to merge 17 commits into main from fille_peek_dev

Collaborator

ArturU043 commented Mar 25, 2025

New function: get_structure

Creates an SX python function query that reads one file of the requested DS and outputs to an ak.array an str encoding the file structure
Builds a deliver spec with support for multiple samples and user-defined names
Gets the result encoded str from the servicex.deliver call
(opt) Prints the encoded str in a user-friendly format
(opt) Returns the re-formatted str
(opt) Saves the re-formatted str to samples-structure.txt
(opt) Reconstructs a dummy ak.array from the encoded str and returns the type constructor

The function can be called from the terminal: servicex-get-structure
Options are added to save to .txt, load a single or multiple DS, write all DS in a .json to be loaded by the command.

Many helpers were added for this feature, run_query, build_deliver_spec, print_structure_from_str, parse_jagged_depth_and_dtype, str_to_array, run_from_command

Artur Cordeiro Oudot Choi added 14 commits

March 6, 2025 15:58


          servicex token ignore

afccc4b


          adding get_structure utility function and dependencies

ae69520


          adding suppport for multiple files

b8d737f


          adding cli option, save_to_text and print flags

d5e37ca


          CI tests for helper functions + .txt file

04bed65


          Split get_structure into more helpers

fe6c414


          adding more tests for deliver spec builder

139708e


          Comments on helpers

e71a01e


          docstrings & error msg improvements

7d87a47


          helper test for raw decoding into array - should be in file_peeking.p…

2c88a67

…y later


          remove unused raw arg

e1954db


          Removing decode-raw and implementing it in file_peeking

054e4a6


          Add tests for str_to_array

d987b41


          return type instead of array

fb4914e

ArturU043 requested review from gordonwatts, ponyisi and BenGalewsky

March 25, 2025 13:23

ArturU043 self-assigned this

Collaborator Author

ArturU043 commented Mar 25, 2025

This is my first attempt at building this feature, and initially, I didn't expect to reconstruct ak.arrays from the encoded string, but after stepping back, I wonder if it might not be over-complex.

For eg, should I use regex matching, instead of positional methods to extract information from the encoded str?

Should I write a simpler encoded str using the awkward type constructor directly?

gordonwatts requested changes

View reviewed changes

gordonwatts left a comment

Ok - nice! I like this and this is going to be very useful. I agree with your comment about simplifying things. Here is what I think should be done:

Use json (with the built in json module) to generate the output on ServiceX
Use the json module to parse it up on the client.

This should significantly simplify the code - the json builtin parser is basically bullet proof. Once that is done, then how the downstream things work can probably be significantly simplified.

servicex_analysis_utils/cli.py

+              import os
+              from .file_peeking import get_structure
+              def run_from_command():

gordonwatts Apr 1, 2025

It looks like you have not run black on the code. Could you be sure to do that? It might be worth doing that on a new PR because it will make "everything" looked changed. But uniform formatting done by black is pretty nice. I'll add this as a issue. :-)

servicex_analysis_utils/cli.py

		parser.add_argument("--save-to-txt", action="store_true", help="Save output to a text file instead of printing.")

		args = parser.parse_args()

gordonwatts Apr 1, 2025

You might look into using the typer library for a command line. It handles much of this automatically and also gives you a bunch of extra power (like shell completions) automatically.

gordonwatts Apr 1, 2025

And add an example json file or something so people know what the json file format looks like? Or that goes in documentation proper? It should go somewhere.

servicex_analysis_utils/cli.py

+                      dataset_file = args.dataset[0]
+                      if not os.path.isfile(dataset_file):
+                          print(f"\033[91mError: JSON file '{dataset_file}' not found.\033[0m", file=sys.stderr)

gordonwatts Apr 1, 2025

Don't use print, instead us logging - in this case logging.error.

gordonwatts Apr 1, 2025

Also, having unicode characters like this isn't great. I get what you are trying to do, but...

servicex_analysis_utils/cli.py

+                  args = parser.parse_args()
+                  if len(args.dataset) == 1 and args.dataset[0].endswith(".json"):
+                      dataset_file = args.dataset[0]

gordonwatts Apr 1, 2025

This whole if statement might be a separate function that returns the dataset list. You can make it recursive too, so one may specify multiple json files, etc.

servicex_analysis_utils/cli.py

+                                  print(f"\033[91mError: The JSON file must contain a dictionary.\033[0m", file=sys.stderr)
+                                  sys.exit(1)
+                      except json.JSONDecodeError:

gordonwatts Apr 1, 2025

Don't swallow this error - make sure it gets printed out - it will contains the "reason" the json is bad, which will help the user figure out what they have to fix in their file.

servicex_analysis_utils/cli.py

+                  if not args.save_to_txt:
+                      print(result)
+                  else:
+                      print("Saved to samples_structure.txt")

gordonwatts Apr 1, 2025

This should be a logging.info message. :-)

servicex_analysis_utils/cli.py

+                  parser.add_argument("dataset", nargs='+', help="Input datasets (Rucio DID) or a JSON file containing datasets in a dict.")
+                  parser.add_argument("--filter-branch", default="", help="Only display branches containing this string.")
+                  parser.add_argument("--save-to-txt", action="store_true", help="Save output to a text file instead of printing.")

gordonwatts Apr 1, 2025

I'm on the fence about this - you can just pipe this to your output file (with > or |). Do we need this?

servicex_analysis_utils/file_peeking.py

+              import awkward as ak
+              def run_query(input_filenames=None):
+                  import uproot

gordonwatts Apr 1, 2025

This and the next line seem like they are repeats?

Collaborator Author

ArturU043 Apr 9, 2025

You mean for the imports?

Since the run_query is sent to the transformer, it must call the import also.

servicex_analysis_utils/file_peeking.py Outdated Show resolved Hide resolved

servicex_analysis_utils/file_peeking.py Outdated

+                      {
+                          "NFiles": 1,
+                          "Name": name,
+                          "Dataset": servicex.dataset.Rucio(did),

gordonwatts Apr 1, 2025

This is ok for this - but we will want to support the other types of inputs as well, I think. One way to do this is to pass in the resolved object (e.g. here servicex.dataset.Rucio(did)) in place of did in the dataset_dict. Then someone using the library can use which ever input source they want to.

Might be a version 1.1 change?

Artur Cordeiro Oudot Choi added 3 commits

April 9, 2025 17:08


          json-based file structure enconding and decoding

a827352


          simplify serivex import

61c7114


          miniopy dependencie to fix CI importError

95803a6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet