extraction workflow specification #1

adelavega · 2024-07-31T20:00:11Z

neurostore-text-extraction
meta-runner scripts handle data inputs from mega-ni-dataset.
since input data is split into distinct slices (hashes), the pipelines will operate on specific slices and output different outputs for each slice (with reference hash), plus the hash of the pipeline arguments.

the meta-runner will decide if a pipeline needs to be run on an {input_data_hash} based on if an output already exists for a given pipeline/arghhash combination.
optionally, a pipeline may be force re-run, generating a new timestamped output folder

outputs/
- {input_data_hash}
  - {pipeline_name}
    - {arghhash-timestamp}
      - features.csv
      - descriptions.csv
      - args.json
      - info.json
pipelines/
- pipeline_name
  - run.py
scripts/
- run_all.py

mega-ni-dataset
Organization of mega-ni-dataset (separate repo)

/input_data
- searches from ACE, pubget and other data from neurostore dataset
/processed_data
- previously called combined_data
- each folder is a hash_id of the contents
- when new data is acquired, a new hash_id folder is created from input_data
/combined_data
- for human consumption, combines all the hashes of processed data into a single outputs

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extraction workflow specification #1

extraction workflow specification #1

adelavega commented Jul 31, 2024 •

edited

Loading

extraction workflow specification #1

extraction workflow specification #1

Comments

adelavega commented Jul 31, 2024 • edited Loading

adelavega commented Jul 31, 2024 •

edited

Loading