-
Notifications
You must be signed in to change notification settings - Fork 8
9.1 Useful Pachyderm Commands
Option 1: Enter the following command in the terminal (change pachd2 to whatever is appropriate)
pachctl config set active-context pachd2
Option 2: Manually edit the config file:
- In the terminal:
vi ~/.pachyderm/config.json
- Type
i
- Change the active_context field to pachd2 (or whatever is appropriate)
- Hit ESC
- Type
:wq
Using pachctl inspect pipeline <your_pipeline_name>
lets you see the status of your pipeline, how it was configured, and sometimes why it failed.
Failure reasons are only printed if the docker container for your pipeline doesn't initialize. This is usually the case if you specified an image in your pipeline spec that doesn't exist, so check the spelling and image version. If everything looks fine, or the reason for failure is not apparent, check out the job logs for the pipeline.
The following command lists the recent jobs for all pipelines:
pachctl list job --expand
If you see a failure for your pipeline, first look to see if an upstream pipeline failed. This will automatically cause failure for all downstream pipelines. Let's assume that your pipeline is where the problem started, or maybe it was successful but you still want to see what happened in the code when the job ran. Take note of the job ID for your pipeline and see the section below to look at the logs for that job.
You can pipe the pachctl list job/pipeline/repo
output through grep, which matches the indicated string, like so:
pachctl list pipeline | grep <SENSOR>
Logs will only be written out for pipelines that have the standby option set to false in the pipeline's JSON file ("standby":false,
). Change this value from true
as needed, the update your pipeline with:
pachctl update pipeline -f ~PATH/TO/PIPELINE/FILE/PIPELINE_FILE.json
Once your pipeline has started running (or has finished/failed), list the jobs for the pipeline:
pachctl list job --expand --pipeline=<insert_pipeline_name>
Example: pachctl list job --expand --pipeline=par-surfacewater_context_filter
Which will produce something like:
ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE
2efb42c31b444f338991b14be0874ad4 pipeline_name 9 minutes ago 4 seconds 0 0 + 5 / 5 0B 0B failure
copy the job ID, and use it in the following:
pachctl logs --job=<YOUR-JOB-ID-FROM-ABOVE>
If nothing prints nothing was logged. This could mean things are fine, or you didn't do "standby":false,
.
REMEMBER!: Set standby to true after you are done!
pachctl update pipeline --reprocess -f ~[PATH TO FILE]
Example:
pachctl update pipeline --reprocess -f ~/R/NEON-IS-data-processing/pipe/pqs1/pqs1_merge_data_by_location.json
For pipeline specs in json format:
pachctl inspect pipeline [pipeline_name] --raw -o json | pachctl update pipeline --reprocess
For pipeline specs in yaml format:
pachctl inspect pipeline [pipeline_name] --raw -o yaml | pachctl update pipeline --reprocess
To save the current pipeline to a file (instead of reloading it):
pachctl inspect pipeline [pipeline_name] --raw -o json > [/path/to/new/file.json]
If you just have one file to upload into a pachyderm repo, you can use the standalone pachctl put file
command:
pachctl put file <repo>@<branch>:</path/to/file> -f </path/to/local/file>
If you have a whole folder of files to put into a pachyderm repo, you can put the whole folder in using:
pachctl put file -r <repo>@<branch>:</path/to/folder> -f </path/to/local/folder>
Note the -r
flag, which means recursive.
If you have a bunch of files or folders that cannot be put into the pachyderm repo with a single pachctl put file...
command, it's important to start a commit, put the files, then finish the commit. Why? Every time you use pachctl put file...
as a standalone command it creates a commit in the repo you are placing your files in. Each commit will result in a processing job for the pipelines downstream of the repo. This can create a lot of processing overhead, especially if the chain of downstream pipelines is long and you run the command multiple times. Putting all the files in under a single commit is as simple as this:
pachctl start commit <repo>@<branch>
pachctl put file ... as many times as you need to
pachctl finish commit <repo>@<branch>
If you want to be really savvy, use the commit ID that is generated from the pachctl start commit <repo>@<branch>
command when you put the files in:
pachctl start commit <repo>@<branch>
3jsnv095mkd0mdjvghasklw305612
pachctl put file <repo>@3jsnv095mkd0mdjvghasklw305612:</path/to/file1>
pachctl put file <repo>@3jsnv095mkd0mdjvghasklw305612:</path/to/file2>
pachctl put file <repo>@3jsnv095mkd0mdjvghasklw305612:</path/to/file3>
Of course, replace that unique ID with the one that is output to the screen after you start the commit. What this allows you to do is view what you've done to make sure all is well before finishing the commit and kicking off a job. First, view what you've done:
pachctl list file <repo>@3jsnv095mkd0mdjvghasklw305612:/**
If you see a problem, you can always start over by deleting the commit before you finish it:
pachctl delete commit <repo>@3jsnv095mkd0mdjvghasklw305612
If everything looks good, then finish the commit:
pachctl finish commit <repo>@3jsnv095mkd0mdjvghasklw305612
Rob Markel wrote a python script to stand up large sections of a DAG, assuming you've created all the pipeline specs and the data_source
pipelines have been set up (data_source_<SENSOR>_site
, data_source_<SENSOR>_linkmerge
, data_source_<SENSOR>_list_years
). In your terminal window, navigate to your local instance of the NEON-IS-data-processing Git repo, and specifically to the utilities folder. For example:
cd ~/R/NEON-IS-data-processing/utilities
From there, run the following:
python3 -B -m dag.create_dag --spec_dir=<path to the folder where the pipeline specs are> --end_node_spec=<path to the last pipeline spec in the DAG>
For example:
python3 -B -m dag.create_dag --spec_dir=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo --end_node_spec=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo/exo2_named_location_filter.json
Note that the paths you put into the arguments must be absolute paths (don't use e.g. ~/R/...). If a DAG is complicated you may get some “Pipeline not found” messages in the output when the script runs. You can run the script repeatedly until these disappear. It does not cause any issue to run the script more than once. Note that the script above will only stand up the pipeline specs within a single folder, so if your DAG is spread across multiple folders, you'll need to run the script for each folder.
If you are working on the som development server, all the python packages needed in order to run the script are already installed. No need to read further. If not, you'll need python3 installed. Once you've done that you'll need to install dependency packages. To do so, navigate to the utilities/dag folder of your local NEON-IS-data-processing Git repo. Then:
sudo pip3 install -r requirements.txt
sudo python3 -m pip install graphviz
sudo yum install graphviz
Similar to standing up a whole DAG (above), you can delete a whole DAG. In your terminal window, navigate to your local instance of the NEON-IS-data-processing Git repo, and specifically to the utilities folder. For example:
cd ~/R/NEON-IS-data-processing/utilities
From there, run the following:
python3 -B -m dag.delete_dag --spec_dir=<path to the folder where the pipeline specs are> --end_node_spec=<path to the last pipeline spec in the DAG>
For example:
python3 -B -m dag.delete_dag --spec_dir=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo --end_node_spec=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo/exo2_named_location_filter.json
You'll probably get a bunch of warnings for each pipeline you delete, like:
WARNING: If using the --split-txn flag, this command must run until complete. If a failure or incomplete run occurs, then Pachyderm will be left in an inconsistent state. To resolve an inconsistent state, rerun this command.
Accept the warning with a y
each time and rerun the whole command until until all the related pipelines are deleted. This may take a long time and several passes.
The same notes in the section above about using absolute paths and installing dependencies also apply here. See the "Standing up a whole DAG" section above.
You may encounter provenance errors while deleting the pipelines in the DAG. These typically result from someone having force-deleted one or more pipelines in the past that they shouldn't have. As a reminder, never force-delete a pipeline that is an input to another pipeline. These errors look something like this:
error fixing commit subvenance tempSoil_calibrated_data/1433908606244064999599c74485acb5: /pachyderm_pfs/commits/tempSoil_calibrated_data/ 1433908606244064999599c74485acb5 not found
These errors are a problem because they won't allow you to delete the pipeline or any pipelines upstream from it. If you get a provenance error when deleting a pipeline that has no other downstream pipelines attached, try this:
pachctl fsck --fix
This command may take a few hours, and hopefully after it completes you will be able to delete the full DAG.
Similar to standing up a whole DAG (above), you can update a whole DAG, with or without reprocessing. In your terminal window, navigate to your local instance of the NEON-IS-data-processing Git repo, and specifically to the utilities folder. For example:
cd ~/R/NEON-IS-data-processing/utilities
From there, run the following if you want to update all the pipelines in a DAG without reprocessing:
python3 -B -m dag.update_dag --spec_dir=<path to the folder where the pipeline specs are> --end_node_spec=<path to the last pipeline spec in the DAG>
For example:
python3 -B -m dag.update_dag --spec_dir=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo --end_node_spec=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo/exo2_named_location_filter.json
If you want to update with reprocessing, replace dag.update_dag
in the code above with dag.update_reprocess_dag
. For example:
python3 -B -m dag.update_reprocess_dag --spec_dir=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo --end_node_spec=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo/exo2_named_location_filter.json
The same notes in the "Standing up a whole DAG" section above about using absolute paths and installing dependencies also apply here.
NOTE The update script above does all the updating in a single pachyderm transaction. If you cancel execution of the script before it completes, or it errors somewhere in the middle, you MUST complete or delete the transaction manually using pachct finish transaction
or pachct delete transaction <transaction-id>
, respectively. Otherwise, nothing you do afterward will happen until the transaction is completed/deleted.
There is also a handy utility to graph the DAG based on the pipeline specs in a given folder. In your terminal window, navigate to your local instance of the NEON-IS-data-processing Git repo, and specifically to the utilities folder. For example:
cd ~/R/NEON-IS-data-processing/utilities
From there, run the following:
python3 -B -m dag.graph_dag --spec_dir=<path to the folder where the pipeline specs are> --end_node_spec=<path to the last pipeline spec in the DAG>
For example:
python3 -B -m dag.graph_dag --spec_dir=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo --end_node_spec=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo/exo2_named_location_filter.json
This will create a pdf graphic showing how all of the pipelines are connected. You might get an error on Linux saying xdg-open: no method available for opening 'pipeline-graph.pdf'
. Just cntl-C out of that and open the file from Rstudio. The file will be saved in the utilities folder of the Git project, so make sure to delete it or move it somewhere else before committing your work. Note that the visual organization is done automatically, so it may not match the linear flow you might expect, but it is very useful to show how the pipelines are connected.
The same notes in the sections above about using absolute paths and installing dependencies also apply here. See the "Standing up a whole DAG" section above.
for pipe in $(pachctl list pipeline --raw | jq -r '.| select(.state=="PIPELINE_RUNNING")|.pipeline.name'); do
echo "Putting pipeline $pipe on standby"
pachctl extract pipeline $pipe -o json | jq -r '.standby = true' | pachctl update pipeline
done
Note: This can only be done on a repo without any upstream pipelines. Earliest commit is deleted first.
#Be sure to edit the repo name and the date string below.
export repo=<your repo name>
for commit in $(pachctl list commit $repo@master --raw |jq -r '.|select(.finished <= "2021-05-01T00:00:00")|.commit.id' | tac); do
echo "Deleting commit $repo@$commit"
pachctl delete commit $repo@$commit
done
Because this is a bigger issue, the notes have been moved to their own wiki page
Pachyderm, and common sense, really want you to delete repos 'backwards' from the end of the pipeline to the start. However, you may want to circumvent this and wholesale delete a middle repo using the --force
switch.
This will immediately delete the pipeline, even if there are other pipelines dependent on it. DON'T DO THIS. Not only will this break downstream pipelines/repos, but a bunch of provenance errors will be created and may not be able to be fixed without wiping away the entire pipeline. I promise, you'll regret it using the --force option.