diff --git a/docs/source/technical_tutorials/dps_tutorial/DPS_runner_template.ipynb b/docs/source/technical_tutorials/dps_tutorial/DPS_runner_template.ipynb new file mode 100644 index 00000000..f3064f34 --- /dev/null +++ b/docs/source/technical_tutorials/dps_tutorial/DPS_runner_template.ipynb @@ -0,0 +1,353 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "330e3ace", + "metadata": {}, + "source": [ + "# Prepare and launch a DPS batch of jobs for a particular algorithm\n", + "\n", + "**Goal**\n", + "Provide a template for DPS job submission which will be changed/adapted according to specific algorithms being run in DPS.\n", + "\n", + "**Motivation** \n", + "It's easier to learn how to run many jobs of your script (where for each job there is some input that changes) if you can first see an example.\n", + "\n", + "Paul Montesano, PhD \n", + "paul.m.montesano@nasa.gov \n", + "June 2024" + ] + }, + { + "cell_type": "code", + "execution_count": 126, + "id": "ea7bcf9f", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from maap.maap import MAAP\n", + "maap = MAAP()" + ] + }, + { + "cell_type": "code", + "execution_count": 127, + "id": "be655aaf-644c-4041-8d04-e1237a50a7f4", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'api.maap-project.org'" + ] + }, + "execution_count": 127, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "maap._MAAP_HOST" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5c541eee", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "import pandas as pd\n", + "import glob\n", + "import datetime\n", + "import sys" + ] + }, + { + "cell_type": "markdown", + "id": "3a058f23-c0a1-4445-9656-70eb7489441b", + "metadata": {}, + "source": [ + "### Use MAAP Registration call in notebook chunk to register DPS algorithm\n", + " - You need to register the DPS algorithm before first before you loop over jobs that will use it.\n", + " - If you register your algorithm using the Register Algorithm UI in Jupyter, a configuration file (in yml format) will be placed in your workspace home folder, which can then be used as a template for reuse" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7810c9e6-5dc8-4969-b1f4-beb3d06e9d96", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "maap.register_algorithm_from_yaml_file(\"/projects/.../.../.yml\").text" + ] + }, + { + "cell_type": "markdown", + "id": "836409b4", + "metadata": {}, + "source": [ + "### Build a dictionary of the argument names and values needed to run the algorithm in the way you want\n", + "\n", + "This can be called a `parameters dictionary` \n", + "\n", + " - These will be arguments that the `.sh` wrapper (which calls your `.py` or `.R` code) is hard-coded to accept. \n", + " - The `.yml` file that you use to Register your algorithm is what connects this `parameters dictionary` to your `.sh` wrapper. \n", + " - This combo of the `parameters dictionary`, the `.yml`, and the `.sh` provides a specific (and repeatable) way of running your `.py` or `.R` code." + ] + }, + { + "cell_type": "markdown", + "id": "c0fea3b7", + "metadata": {}, + "source": [ + "#### Note: make sure the `in_params_dict` coincides with the args of your underlying Python/R code" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "65681b96", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "in_params_dict = {\n", + " 'arg name_1': 'some_value',\n", + " 'arg_name_2': 'another_value',\n", + " 'in_tile_num': 1\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "46e6ffc9-cc7d-4b56-a310-811774054d7e", + "metadata": {}, + "source": [ + "### Set up a list of items you want to run across - this is an example of some algorithm input that will vary according to job\n", + "\n", + "In this example, we are using geographic `tiles` to break up our processing. These tiles are defined by vector polygons and have ids that our `.sh`, `.py`, and `.yml` files are set up to take in as arguments. We use these ids as the basis for a loop that will sequentially submit our jobs to DPS. \n", + "\n", + "There are many ways one could decide to split up their DPS jobs - so this use of tiles here is just for the purposes of this example." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "4fd13e32-77c8-4641-82e9-85c0ad0e8cde", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Just an example of a list of some input parameter to your script that needs to vary for each job, thus creating multiple jobs\n", + "DPS_INPUT_TILE_NUM_LIST = [1,3,5,7,13,17,19]" + ] + }, + { + "cell_type": "markdown", + "id": "d72590cf-d9c4-438c-9a2d-684ab5d08549", + "metadata": {}, + "source": [ + "### Set up the general submission variables that will be applied to all runs of this DPS batch\n", + "\n", + "These will also determine the look of path of the DPS output (`/projects/my-private-bucket/dps_output`): \n", + "`/projects/my-private-bucket/dps_output///`" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "e6c61e32-3550-43ff-aa3a-cbbfa97efb2d", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# MAAP algorithm version name\n", + "IDENTIFIER='BIOMASS_2020'\n", + "ALGO_VERSION = 'my_biomass_algorithm_v2024_1'\n", + "ALGO_ID = \"run_my_biomass_algorithm\"\n", + "USER = 'montesano'\n", + "WORKER_TYPE = 'maap-dps-worker-8gb'" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "01c52cde-1d06-4007-a637-34988938b099", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'BIOMASS_2020'" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "RUN_NAME = IDENTIFIER\n", + "RUN_NAME" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "6490e474-3f44-4634-b198-6c03eaccc171", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[1, 3]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "DPS_INPUT_TILE_NUM_LIST[0:2]" + ] + }, + { + "cell_type": "markdown", + "id": "80232c11-dd65-43b4-9c50-40c9f2dc87a4", + "metadata": {}, + "source": [ + "### Set up a dir to hold the metadata output table from the DPS submission" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f45994e-10f6-405d-aeb9-f2263b4e7662", + "metadata": {}, + "outputs": [], + "source": [ + "DPS_SUBMISSION_RESULTS_DIR = '/projects/my-public-bucket/dps_submission_results'\n", + "!mkdir -p $DPS_SUBMISSION_RESULTS_DIR" + ] + }, + { + "cell_type": "markdown", + "id": "86193dd5", + "metadata": {}, + "source": [ + "## Run a DPS job across the list\n", + "\n", + "The submission is done as a loop. \n", + "\n", + "Since submission is fast, this doesn't need to be parallelized. The jobs will start soon after submission and will be processed in parallel depending on administrator settings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4abfe38b", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%time\n", + "\n", + "import json\n", + "\n", + "submit_results_df_list = []\n", + "len_input_list = len(DPS_INPUT_TILE_NUM_LIST)\n", + "print(f\"# of input tiles for DPS: {len_input_list}\")\n", + "\n", + "for i, INPUT_TILE_NUM in enumerate(DPS_INPUT_TILE_NUM_LIST):\n", + " \n", + " # Just a way to keep track of the job number associated with this submission's loop\n", + " DPS_num = i+1\n", + " \n", + " # Update the in_params_dict with the current INPUT_TILE_NUM from this loop\n", + " in_params_dict['in_tile_num'] = INPUT_TILE_NUM\n", + " \n", + " submit_result = maap.submitJob(\n", + " identifier=IDENTIFIER,\n", + " algo_id=ALGO_ID,\n", + " version=ALGO_VERSION,\n", + " username=USER, # username needs to be the same as whoever created the workspace\n", + " queue=WORKER_TYPE,\n", + " **in_params_dict\n", + " )\n", + " \n", + " # Build a dataframe of submission details - this holds metadata about your DPS job\n", + " submit_result_df = pd.DataFrame( \n", + " {\n", + " 'dps_num':[DPS_num],\n", + " 'tile_num':[INPUT_TILE_NUM],\n", + " 'submit_time':[datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%s')],\n", + " 'dbs_job_hour': [datetime.datetime.now().hour],\n", + " 'algo_id': [ALGO_ID],\n", + " 'user': [USER],\n", + " 'worker_type': [WORKER_TYPE],\n", + " 'job_id': [submit_result.id],\n", + " 'submit_status': [submit_result.status],\n", + " \n", + " } \n", + " )\n", + " \n", + " # Append to a list of data frames of DPS submission results\n", + " submit_results_df_list.append(submit_result_df)\n", + " \n", + " if DPS_num in [1, 5, 10, 50, 100, 250, 500, 750, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 7000, 9000, 11000, 13000, 15000, 17000, 19000, 21000, 24000, len_input_list]:\n", + " print(f\"DPS run #: {DPS_num}\\t| tile num: {INPUT_TILE_NUM}\\t| submit status: {submit_result.status}\\t| job id: {submit_result.id}\") \n", + " \n", + "# Build a final submission results data frame and save\n", + "submit_results_df = pd.concat(submit_results_df_list)\n", + "submit_results_df['run_name'] = RUN_NAME\n", + "nowtime = pd.Timestamp.now().strftime('%Y%m%d%H%M')\n", + "print(f\"Current time:\\t{nowtime}\")\n", + "\n", + "# This creates a CSV of the metadata associated with the DPS jobs you have just submitted\n", + "submit_results_df.to_csv(f'{DPS_SUBMISSION_RESULTS_DIR}/DPS_{ALGO_ID}_{RUN_NAME}_submission_results_{len_input_list}_{nowtime}.csv')\n", + "submit_results_df.info()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/source/technical_tutorials/dps_tutorial/_static/python_env_default.png b/docs/source/technical_tutorials/dps_tutorial/_static/python_env_default.png new file mode 100644 index 00000000..5e451ad1 Binary files /dev/null and b/docs/source/technical_tutorials/dps_tutorial/_static/python_env_default.png differ diff --git a/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_overview.png b/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_overview.png new file mode 100644 index 00000000..d151534c Binary files /dev/null and b/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_overview.png differ diff --git a/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_register_api_1.png b/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_register_api_1.png new file mode 100644 index 00000000..2e34f23b Binary files /dev/null and b/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_register_api_1.png differ diff --git a/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_view_2.png b/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_view_2.png index 09a24f6e..5c6c4105 100644 Binary files a/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_view_2.png and b/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_view_2.png differ diff --git a/docs/source/technical_tutorials/dps_tutorial/algorithm_config_template.yml b/docs/source/technical_tutorials/dps_tutorial/algorithm_config_template.yml new file mode 100644 index 00000000..d0c69181 --- /dev/null +++ b/docs/source/technical_tutorials/dps_tutorial/algorithm_config_template.yml @@ -0,0 +1,25 @@ +algorithm_description: This is a free-form description of your algorithm +algorithm_name: dps-tutorial-name +algorithm_version: main +build_command: dps_tutorial/gdal_wrapper/build-env.sh +disk_space: 1GB +docker_container_url: mas.maap-project.org/root/maap-workspaces/base_images/vanilla:v3.1.5 +inputs: + config: [] + file: + - default: '' + description: The name of the input file + name: input_file + required: true + positional: + - default: '' + description: output file name + name: output_file + required: true + - default: '30' + description: the percent reduction of your output file vs the input file + name: percent_reduction + required: true +queue: maap-dps-worker-8gb +repository_url: https://github.com/MAAP-Project/dps_tutorial.git +run_command: dps_tutorial/gdal_wrapper/run_gdal.sh diff --git a/docs/source/technical_tutorials/dps_tutorial/dps_tutorial_demo.ipynb b/docs/source/technical_tutorials/dps_tutorial/dps_tutorial_demo.ipynb index 1b72d081..1df5a60e 100644 --- a/docs/source/technical_tutorials/dps_tutorial/dps_tutorial_demo.ipynb +++ b/docs/source/technical_tutorials/dps_tutorial/dps_tutorial_demo.ipynb @@ -41,7 +41,7 @@ "## Before Starting\n", "\n", "- This tutorial assumes that you have at least run through the [Getting Started Guide](../../getting_started/getting_started.ipynb) and have set up your MAAP account.\n", - "- This tutorial is made for the Application Development Environment (ADE) \"Basic Stable\" workspace v3.1.4 or later (February 2024 or later).\n", + "- This tutorial is made for the Application Development Environment (ADE) \"Python (default)\" workspace v4.0.0 or later (July 2024 or later).\n", "- This also assumes that you are familiar with using [Github with MAAP](../../system_reference_guide/work_with_git.ipynb)." ] }, @@ -76,9 +76,12 @@ "- Clone the demo Algorithm\n", "- Edit and test your Algorithm code to make sure that it is working in its original form\n", "- Prepare the Algorithm for DPS by setting up the runtime arguments and pre-run environment set-up\n", - "- Register the Algorithm with the Algorithm UI\n", - "- Run and Monitor the Algorithm using the Jobs UI\n", - "- View the outputs and errors from your run" + "- Register the Algorithm with the Register Algorithm UI (or with maap.py)\n", + "- Run and Monitor the Algorithm using the Jobs UI (or with maap.py)\n", + "- View the outputs and errors from your run\n", + "\n", + "This graphical overview may also help orient to the general flow:\n", + "![DPS Tutorial overview](_static/tutorial_overview.png)" ] }, { @@ -94,7 +97,25 @@ "source": [ "If you are not familiar with running jobs in the DPS, please try running through the [Jobs UI guide](../../system_reference_guide/jobsui.ipynb) and the [Getting Started Guide](../../getting_started/running_at_scale.ipynb).\n", "\n", - "This can be helpful because the process of Registering an Algorithm for DPS helps to build the user-interface to Run a Job. By familiarizing yourself with the process of running a Job, the Registration process may become more intuitive." + "By familiarizing yourself with the process of running a Job, the Registration process may become more intuitive." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Context within a Typical Workflow" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In most cases, before deciding to run an analysis at scale in the DPS, scientists will have a Jupyter notebook that has been developed to analyze data in an interactive fashion. Once the basic template of an analysis process has been tested in the Jupyter notebook format, the code needs to be formatted in a way to run in the DPS.\n", + "\n", + "This tutorial uses a demo algorithm that is represented as a Python script. It is already formatted in a way that makes it easy to register as a DPS algorithm.\n", + "\n", + "When you are migrating a Jupyter notebook to run in the DPS, the first step will be to migrate the core analysis code into a script. Hopefully the tutorial will demonstrate some of the key features of a standalone script that make it easy to run in the DPS such as a way to manage command-line arguments all at once, and an encapsulation of any custom environment needs." ] }, { @@ -110,7 +131,7 @@ "source": [ "We will use an example Python-based algorithm for this tutorial. First we need to get the demo code into a Jupyter workspace.\n", "\n", - "1. For this tutorial, please use a Basic Stable workspace (v3.1.4 or later). \n", + "1. For this tutorial, please use a Python (default) workspace (v4.0.0 or later). \n", "2. Clone the Github repository at https://github.com/MAAP-Project/dps_tutorial . For the sake of this tutorial, we will assume that the clone is placed into a folder called algorithms in our home folder (i.e., ~/algorithms).\n", "```\n", "mkdir ~/algorithms\n", @@ -128,10 +149,10 @@ "Anatomy of the `gdal_wrapper` algorithm folder in the `dps_tutorial` repo:\n", "\n", "- `README.md` to describe the algorithm in Github\n", - "- `build-env.sh`: a shell script that is executed before the algorithm is run; it is used to set up any custom programming libraries used in the algorithm (i.e., a custom conda environment)\n", - "- `environment.yml`: a configuration file used by conda to add any custom libraries; this is used by build-env.sh\n", + "- `build-env.sh`: (Generically: the **build script**) is a shell script that is executed before the algorithm is run; it is used to set up any custom programming libraries used in the algorithm (i.e., a custom conda environment)\n", + "- `environment.yml`: a configuration file used by conda to add any custom libraries; this is used by your build script (build-env.sh)\n", "- `gdal_wrapper.py`: a python script that contains the logic of the algorithm\n", - "- `run_gdal.sh`: a shell script that DPS will execute when a run is requested. It calls any relevant python files with the required inputs\n", + "- `run_gdal.sh`: (Generically: the **run script**) a shell script that DPS will execute when a run is requested. It calls any relevant python files with the required inputs. The run script is necessary to run your build script (to set up the runtime environment) and to execute your underlying alogorithm script(s)\n", "\n", "![DPS Tutorial Git repository overview](_static/dps_tutorial_git_repo.png)\n" ] @@ -149,7 +170,7 @@ "source": [ "Once you have an algorithm such as the `gdal_wrapper` test it to make sure that it is running properly. If it runs properly in a Jupyter Terminal window, you are one step closer to registering and running your algorithm in the DPS.\n", "\n", - "Typically a Jupyter Notebook is run interacively. A DPS algorithm will take all inputs up-front, do the processing, and produce output files. The `gdal_wrapper` script is already set up like a DPS algorithm. Some aspects to note:\n", + "Typically a Jupyter Notebook is run interactively. A DPS algorithm will take all inputs up-front, do the processing, and produce output files--it is non-interactive while it runs. The `gdal_wrapper` script is already set up like a DPS algorithm. Some aspects to note:\n", "\n", "- **Python argparse**: Using a library like [argparse](https://docs.python.org/3/library/argparse.html) to accept input parameters helps to make the code more readable and easier to debug when working locally. It provides easy to write user-friendly command-line interface. \n", "\n", @@ -159,11 +180,13 @@ "\n", "Before registering your algorithm you can test it locally to catch common errors related to input parsing and storing output. To test your algorithm locally before registration follow the below steps:\n", "\n", - "- Deactivate the current python virtual environment and activate the pre-installed conda environment (for the Basic Stable workspace, it is vanilla)\n", + "- Open a fresh Terminal window and go to the DPS Tutorial folder that you just cloned. The prompt should indicate that you are in the `python` conda environment because that is the default in a Python (default) type of workspace.\n", "```\n", - "conda deactivate\n", - "conda activate vanilla\n", + "cd ~/algorithms/dps_tutorial\n", "```\n", + "\n", + "![Python environment is the default](_static/python_env_default.png)\n", + "\n", "- Make sure that your runtime conda environment is set up. To do this, run `build-env.sh` in the `gdal_wrapper` folder.\n", "```\n", "cd ~/algorithms\n", @@ -184,7 +207,7 @@ "# ls -F\n", "input/ output/\n", "```\n", - "- You will need a test GeoTIF file as input. If you do not have one, go to the folder where you'd like to download the example file (assuming you're in the `dps_test_run` folder as above, `cd input`) and use the following aws command (NOTE: if this step fails, it is likely that you are either in a Basic Stable workspace version prior to v3.1.4, or you do not have the vanilla conda environment activated):\n", + "- You will need a test GeoTIF file as input. If you do not have one, go to the folder where you'd like to download the example file (assuming you're in the `dps_test_run` folder as above, `cd input`) and use the following aws command (NOTE: if this step fails, it is likely that you are either in a Python (default) workspace, or you do not have the python conda environment activated):\n", "```\n", "cd input\n", "```\n", @@ -258,7 +281,7 @@ "-rw-r--r-- 1 root root 15834792 Feb 28 15:54 output_from_shell.tif\n", "```\n", "\n", - "Some important things to note:\n", + "### Some important things to note\n", "\n", "File: `build-env.sh`\n", "\n", @@ -272,36 +295,52 @@ "- sets the correct python environment for your code to run\n", "- the best way to execute your algorithm with a custom environment is to use `conda run`, as shown in this script (`conda run --live-stream --name dps_tutorial python ${basedir}/gdal_wrapper.py --input_file ${INPUT_FILENAME} --output_file output/${OUTPUT_FILENAME} --outsize ${REDUCTION_SIZE}`)\n", "\n", - "Run your scripts as if DPS is executing them:\n", + "### Run your scripts as if DPS is executing them\n", "\n", - "- activate the default conda environment, in this case `conda activate vanilla`\n", + "- activate the default conda environment, in this case `conda activate python`\n", "- run `build-env.sh` to create or update your custom environment\n", "- run `run_gdal.sh` to execute your algorithm using the custom environment\n", "\n", - "Future topics:\n", + "### Output folder\n", + "\n", + "The DPS treats a folder named `output` specially. Any files stored in this folder will be preserved and uploaded to S3 after the algorithm run is complete. The location of this output will depend on factors like algorithm name, time of run, tags, etc. This output folder can be viewed within your workspace under the `my-private-bucket/dps_output` directory.\n", + "\n", + "The output directory is created relative to your script specified in `run_command` at the time of registration. So to access the directory, simply do something like this in your run script. \n", + "```\n", + "mkdir -p output\n", + "``` \n", + " \n", + "### Stderr & Stdout \n", + "\n", + "By default, anything written to the stderr and stdout pipes will be stored in files call _stderr and _stdout and placed in your output directory. \n", + "\n", + "### Logfiles\n", + "\n", + "DPS does not automatically store any logfiles written by your algorithm, if you would like them to be preserved make sure to write them in the output directory. \n", + "\n", + "### Future topics\n", "\n", "- What happens with input and output in DPS\n", "- How does file management happen?\n", "- Relative paths vs. absolute for input/output\n", "- Mimic what’s happening on DPS (basedir)\n", - "- This wrapper `run_gdal.sh` script needs to manage the input files the way that your python script requires them (e.g. pass single file at a time vs. multiple files at once, etc.)\n", - "\n" + "- This wrapper `run_gdal.sh` script needs to manage the input files the way that your python script requires them (e.g. pass single file at a time vs. multiple files at once, etc.)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Register the Algorithm with DPS using the Algorithm UI" + "## Register the Algorithm with DPS using the Register Algorithm UI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "0. When you are registering your own algorithm, make sure that all your changes are commited and pushed into Github. The registration process will pull the code from Github as part of registration. In this case, we will simply use the existing demonstration repo.\n", - "1. Open up [Launcher: Register Algorithm](../../getting_started/running_at_scale.ipynb#Register-an-Algorithm) (the same as the Register Algorithm option from the Command Palette)\n", - "2. Fill in the fields as described below.\n", + "- When you are registering your own algorithm, make sure that all your changes are commited and pushed into Github. The registration process will pull the code from Github as part of registration. In this case, we will simply use the existing demonstration repo.\n", + "- Open up [Launcher: Register Algorithm](../../getting_started/running_at_scale.ipynb#Register-an-Algorithm) (the same as the Register Algorithm option from the Command Palette)\n", + "- Fill in the fields as described below.\n", "\n", "#### First you fill in the public code-repository information:\n", "![Code repo information](_static/tutorial_register_1.png)\n", @@ -311,7 +350,7 @@ "https://github.com/MAAP-Project/dps_tutorial.git\n", "```\n", "- Repository Branch is used as a version when this algorithm is registered. For your test it is likely `main`\n", - "- The Run and Build Commands must be the full path of the scripts that will be used by the DPS to build and execute the algorithm. Typically these will be the `repository_name/script_name.sh`. In this case we have a run command:\n", + "- The Run and Build Commands must be the relative paths of the scripts, starting from the repository root. This will be used by the DPS to build and execute the algorithm. Typically these will be the `repository_name/script_name.sh`. In this case we have a run command:\n", "```\n", "dps_tutorial/gdal_wrapper/run_gdal.sh\n", "```\n", @@ -323,15 +362,15 @@ "#### Then fill in the rest of the algorithm information:\n", "![Algorithm information](_static/tutorial_register_2.png)\n", "\n", - "- The Algorithm Name will be the unique identifier for the algorithm in the MAAP system. It can be whatever you want. \n", + "- The Algorithm Name will be the unique identifier for the algorithm in the MAAP system. It can be whatever you want. **Note**: If you use the same name as an existing algorithm, your new algorithm will replace the old one in the system (the old one will be gone). This is also how you would \"update\" an existing algorithm with a new version of the same name. If you want to have two versions of an algorithm available in the system, you must make the name unique (e.g \"alg_v1\" and \"alg_v2\" instead of just \"alg\")\n", "- Algorithm Description is additional free-form text to describe what this algorithm does.\n", "- Disk Space is the minimum amount of space you expect—including all inputs, scratch, and outputs—it gives the DPS an approximation to help optimize the run.\n", "- Resource Allocation is a dropdown-selection with some options for memory and CPU of the cloud compute you expect to need.\n", - "- The Container URL is a URL of the Stack (workspace image environment) you are using as a base for the algorithm. The user-interface will pre-fill this with the Container of your current workspace; if this is the correct base workspace for the Algorithm (i.e., you successfully ran the Algorithm in a Terminal without requiring a custom base-Container), then you can leave it as is. In this example we use: `mas.maap-project.org/root/maap-workspaces/base_images/vanilla:main`\n", + "- The Container URL is a URL of the Stack (workspace image environment) you are using as a base for the algorithm. The user-interface will pre-fill this with the Container of your current workspace; if this is the correct base workspace for the Algorithm (i.e., you successfully ran the Algorithm in a Terminal without requiring a custom base-Container), then you can leave it as is. In this example we use: `mas.maap-project.org/root/maap-workspaces/base_images/python:main`\n", "See [the Getting Started guide](../../getting_started/running_at_scale.ipynb#Container-URLs) for more information on Containers.\n", "\n", "#### Finally you fill in the input section:\n", - "- There are File Inputs and Positional Inputs (command-line parameters to adjust how the algorithm runs). In our example we have a File Input called `input_file` and two Positional Inputs: an output file called `output_file` and a parameter called `outsize` describing how much file-size reduction we want to get. For each input you can add a Description, a Default Value, and mark whether it’s required or optional.\n", + "- There are **File Inputs** and **Positional Inputs** (command-line parameters to adjust how the algorithm runs). In our example we have a File Input called `input_file` and two Positional Inputs: an output file called `output_file` and a parameter called `outsize` describing how much file-size reduction we want to get. For each input you can add a Description, a Default Value, and mark whether it’s required or optional.\n", "\n", "![Algorithm-Inputs information](_static/tutorial_register_3.png)\n", "\n", @@ -360,15 +399,15 @@ "\n", "1. Open the Launcher and select the [Submit Jobs](../../getting_started/running_at_scale.ipynb#Run-the-Algorithm-as-a-Job-and-Monitor-it) icon\n", "2. Run the job. \n", - "- Choose the Algorithm you just registered using the dropdown menu.\n", - "- The Job Tag can be empty or any list of short terms that you would like to associate with your job. This will help you sort and filter the job list later. It is a comma-separated list of tags.\n", - "- The Resource is likely to be the same as the one you chose when registering the Algorithm. For the tutorial it can be the smallest one (8 GB).\n", - "- The input file can be any GeoTIF file that is accessible by the system. For example, you can browse the [MAAP STAC](https://stac-browser.maap-project.org/collections/ESACCI_Biomass_L4_AGB_V4_100m?.language=en) and find a GeoTIF. For example\n", + "- Choose the **Algorithm name** you just registered using the dropdown menu.\n", + "- The **Job Tag** can be empty or any list of short terms that you would like to associate with your job. This will help you sort and filter the job list later. It is a comma-separated list of tags. **Note** that the Job Tag is also used to organize the output files. If you would like to have a set of Jobs run with the same Algorithm to be organized together, use the Job Tag to do so.\n", + "- The **Resource** is likely to be the same as the one you chose when registering the Algorithm. For the tutorial it can be the smallest one (8 GB).\n", + "- The **input file** can be any GeoTIF file that is accessible by the system. For example, you can browse the [MAAP STAC](https://stac-browser.maap-project.org/collections/ESACCI_Biomass_L4_AGB_V4_100m?.language=en) and find a GeoTIF. For example\n", "```\n", "s3://nasa-maap-data-store/file-staging/nasa-map/ESACCI_Biomass_L4_AGB_V4_100m_2020/S40E160_ESACCI-BIOMASS-L4-AGB-MERGED-100m-2020-fv4.0.tif\n", "```\n", - "- The output file can have any name. It should end with .tif because it will be a GeoTIF also.\n", - "- Outsize is a number from 1 to 100.\n", + "- The **output file** can have any name. It should end with .tif because it will be a GeoTIF also.\n", + "- **Outsize** is a number from 1 to 100.\n", "![Submit Job Page](_static/tutorial_submit_1.png)\n", "\n", "3. Submit the job and go back to the View tab\n", @@ -380,15 +419,7 @@ "![View Jobs Page](_static/tutorial_view_1.png)\n", "- By selecting a row from the table (top panel) it will show Job Details (in the bottom panel)\n", "- The status should go from queued to running, to completed or failed\n", - "- Check the Inputs and Outputs sections of the Job Details\n", - "\n", - "5. From the Outputs section, you can copy the path of your output file starting with `dps_outputs` and find it by going to your `~/my-private-bucket` folder and then following the remainder of the path. \n", - "![Copy the Path information](_static/tutorial_view_2.png)\n", - "\n", - "In that folder you will see some JSON files with metadata about the job and the data, as well as the output file (your .tif file).\n", - "![cd to path using Terminal](_static/tutorial_view_3.png)\n", - "\n", - "You can download the output files by browsing to them in the Jupyter file panel and selecting Download from the contextual menu (right-click)." + "- Check the Inputs and Outputs sections of the Job Details\n" ] }, { @@ -416,7 +447,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This will be described in a future update. Often larger batch-jobs are run from Python Notebooks rather than the GUI." + "Use the [DPS Runner notebook template](DPS_runner_template.ipynb) as a starting point for batch-job execution via a Jupyter notebook. Some supplemental background information is below.\n", + "\n", + "#### Registering Algorithms via the UI and algorithm_config.yml\n", + "It is possible to register algorithms via the maap.py API and a configuration file (in [YAML format](https://yaml.org/spec/1.2.2/#chapter-2-language-overview)), using:\n", + "```\n", + "maap.register_algorithm_from_yaml_file(\"/projects/.yml\").text\n", + "```\n", + "\n", + "- It is automatically generated by the Register Algorithm UI when you first register an algorithm. You will see yml files in your home directory after registering an algorithm via the UI. Open one of these yml files to see what it looks like. If you simply re-reference this file with the maap.py registration function, you can quickly re-register an algorithm with the same parameters that you first typed into the Algorithm Registration UI.\n", + "- This can be a hand-written yml file: [get a template yml here](algorithm_config_template.yml). Compare it to the [Register Algorithm UI fields](#Register-the-Algorithm-with-DPS-using-the-Register-Algorithm-UI).\n", + "\n", + "![Example algorithm_configuration.yml](_static/tutorial_register_api_1.png)\n" ] }, { @@ -430,22 +472,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### Output folder\n", - "\n", - "The DPS treats a folder named `output` specially. Any files stored in this folder will be preserved and uploaded to S3 after the algorithm run is complete. The location of this output will depend on factors like algorithm name, time of run, tags, etc. This output folder can be viewed within your workspace under the `my-private-bucket/dps_output` directory.\n", + "### Finding the Output folder for a Job using the Jobs UI\n", + "From the Outputs section of your Job, there are two buttons to find the output of your job. One will navigate the Jupyter file-browser to the output folder. If you then create a new Terminal it will also be at that location. The other way is to go to your home folder in a Terminal and then use the Copy Folder Path to Clipboard button, then type `cd ` and paste in the path.\n", + "![Outputs of a Job](_static/tutorial_view_2.png)\n", "\n", - "The output directory is created relative to your script specified in `run_command` at the time of registration. So to access the directory, simply do something like this in your run script. \n", - "```\n", - "mkdir -p output\n", - "``` \n", - " \n", - "#### Stderr & Stdout \n", + "In that folder you will see some JSON files with metadata about the job and the data, as well as the output file (your .tif file).\n", + "![cd to path using Terminal](_static/tutorial_view_3.png)\n", "\n", - "By default, anything written to the stderr and stdout pipes will be stored in files call _stderr and _stdout and placed in your output directory. \n", + "Once you have browsed to the output folder in the Jupyter file-browser, you may select Download from the contextual menu (right-click) to download the file(s) of interest.\n", "\n", - "#### Logfiles\n", + "### Browsing the File Tree or using the Terminal\n", "\n", - "DPS does not automatically store any logfiles written by your algorithm, if you would like them to be preserved make sure to write them in the output directory. " + "You can also browse the file structure for output files. All the jobs that you run will put output files into `~/my-private-bucket/dps_output`. Files are organized in this area by the algorithm name, job tag, and a set of folders organized by date and time." ] }, {