Merge pull request #408 from MAAP-Project/dps_tut_v2

DPS Tutorial v2 (for workspace-release v4.0.0)
MAAP-Project · Jul 5, 2024 · f9ab145 · f9ab145
2 parents 162544b + 7145a16
commit f9ab145
Show file tree

Hide file tree

Showing 7 changed files with 472 additions and 56 deletions.
diff --git a/docs/source/technical_tutorials/dps_tutorial/DPS_runner_template.ipynb b/docs/source/technical_tutorials/dps_tutorial/DPS_runner_template.ipynb
@@ -0,0 +1,353 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "330e3ace",
+   "metadata": {},
+   "source": [
+    "# Prepare and launch a DPS batch of jobs for a particular algorithm\n",
+    "\n",
+    "**Goal**\n",
+    "Provide a template for DPS job submission which will be changed/adapted according to specific algorithms being run in DPS.\n",
+    "\n",
+    "**Motivation**  \n",
+    "It's easier to learn how to run many jobs of your script (where for each job there is some input that changes) if you can first see an example.\n",
+    "\n",
+    "Paul Montesano, PhD  \n",
+    "[email protected]  \n",
+    "June 2024"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 126,
+   "id": "ea7bcf9f",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from maap.maap import MAAP\n",
+    "maap = MAAP()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 127,
+   "id": "be655aaf-644c-4041-8d04-e1237a50a7f4",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'api.maap-project.org'"
+      ]
+     },
+     "execution_count": 127,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "maap._MAAP_HOST"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5c541eee",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "import glob\n",
+    "import datetime\n",
+    "import sys"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3a058f23-c0a1-4445-9656-70eb7489441b",
+   "metadata": {},
+   "source": [
+    "### Use MAAP Registration call in notebook chunk to register DPS algorithm\n",
+    " - You need to register the DPS algorithm before first before you loop over jobs that will use it.\n",
+    " - If you register your algorithm using the Register Algorithm UI in Jupyter, a configuration file (in yml format) will be placed in your workspace home folder, which can then be used as a template for reuse"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7810c9e6-5dc8-4969-b1f4-beb3d06e9d96",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "maap.register_algorithm_from_yaml_file(\"/projects/.../.../<my_algorithms_yaml_file>.yml\").text"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "836409b4",
+   "metadata": {},
+   "source": [
+    "### Build a dictionary of the argument names and values needed to run the algorithm in the way you want\n",
+    "\n",
+    "This can be called a `parameters dictionary`  \n",
+    "\n",
+    " - These will be arguments that the `.sh` wrapper (which calls your `.py` or `.R` code) is hard-coded to accept.  \n",
+    " - The `.yml` file that you use to Register your algorithm is what connects this `parameters dictionary` to your `.sh` wrapper.  \n",
+    " - This combo of the `parameters dictionary`, the `.yml`, and the `.sh` provides a specific (and repeatable) way of running your `.py` or `.R` code."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c0fea3b7",
+   "metadata": {},
+   "source": [
+    "#### Note: make sure the `in_params_dict` coincides with the args of your underlying Python/R code"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "65681b96",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "in_params_dict = {\n",
+    "            'arg name_1': 'some_value',\n",
+    "            'arg_name_2': 'another_value',\n",
+    "            'in_tile_num': 1\n",
+    "        }"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "46e6ffc9-cc7d-4b56-a310-811774054d7e",
+   "metadata": {},
+   "source": [
+    "### Set up a list of items you want to run across - this is an example of some algorithm input that will vary according to job\n",
+    "\n",
+    "In this example, we are using geographic `tiles` to break up our processing. These tiles are defined by vector polygons and have ids that our `.sh`, `.py`, and `.yml` files are set up to take in as arguments. We use these ids as the basis for a loop that will sequentially submit our jobs to DPS. \n",
+    "\n",
+    "There are many ways one could decide to split up their DPS jobs - so this use of tiles here is just for the purposes of this example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "4fd13e32-77c8-4641-82e9-85c0ad0e8cde",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Just an example of a list of some input parameter to your script that needs to vary for each job, thus creating multiple jobs\n",
+    "DPS_INPUT_TILE_NUM_LIST = [1,3,5,7,13,17,19]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d72590cf-d9c4-438c-9a2d-684ab5d08549",
+   "metadata": {},
+   "source": [
+    "### Set up the general submission variables that will be applied to all runs of this DPS batch\n",
+    "\n",
+    "These will also determine the look of path of the DPS output (`/projects/my-private-bucket/dps_output`):  \n",
+    "`/projects/my-private-bucket/dps_output/<ALGO_ID>/<ALGO_VERSION>/<IDENTIFIER>`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "e6c61e32-3550-43ff-aa3a-cbbfa97efb2d",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# MAAP algorithm version name\n",
+    "IDENTIFIER='BIOMASS_2020'\n",
+    "ALGO_VERSION = 'my_biomass_algorithm_v2024_1'\n",
+    "ALGO_ID = \"run_my_biomass_algorithm\"\n",
+    "USER = 'montesano'\n",
+    "WORKER_TYPE = 'maap-dps-worker-8gb'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "01c52cde-1d06-4007-a637-34988938b099",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'BIOMASS_2020'"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "RUN_NAME = IDENTIFIER\n",
+    "RUN_NAME"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "6490e474-3f44-4634-b198-6c03eaccc171",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[1, 3]"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "DPS_INPUT_TILE_NUM_LIST[0:2]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "80232c11-dd65-43b4-9c50-40c9f2dc87a4",
+   "metadata": {},
+   "source": [
+    "### Set up a dir to hold the metadata output table from the DPS submission"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7f45994e-10f6-405d-aeb9-f2263b4e7662",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DPS_SUBMISSION_RESULTS_DIR = '/projects/my-public-bucket/dps_submission_results'\n",
+    "!mkdir -p $DPS_SUBMISSION_RESULTS_DIR"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "86193dd5",
+   "metadata": {},
+   "source": [
+    "## Run a DPS job across the list\n",
+    "\n",
+    "The submission is done as a loop.  \n",
+    "\n",
+    "Since submission is fast, this doesn't need to be parallelized. The jobs will start soon after submission and will be processed in parallel depending on administrator settings."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4abfe38b",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "\n",
+    "import json\n",
+    "\n",
+    "submit_results_df_list = []\n",
+    "len_input_list = len(DPS_INPUT_TILE_NUM_LIST)\n",
+    "print(f\"# of input tiles for DPS: {len_input_list}\")\n",
+    "\n",
+    "for i, INPUT_TILE_NUM in enumerate(DPS_INPUT_TILE_NUM_LIST):\n",
+    "    \n",
+    "    # Just a way to keep track of the job number associated with this submission's loop\n",
+    "    DPS_num = i+1\n",
+    "    \n",
+    "    # Update the in_params_dict with the current INPUT_TILE_NUM from this loop\n",
+    "    in_params_dict['in_tile_num'] = INPUT_TILE_NUM\n",
+    "    \n",
+    "    submit_result = maap.submitJob(\n",
+    "            identifier=IDENTIFIER,\n",
+    "            algo_id=ALGO_ID,\n",
+    "            version=ALGO_VERSION,\n",
+    "            username=USER, # username needs to be the same as whoever created the workspace\n",
+    "            queue=WORKER_TYPE,\n",
+    "            **in_params_dict\n",
+    "        )\n",
+    "    \n",
+    "    # Build a dataframe of submission details - this holds metadata about your DPS job\n",
+    "    submit_result_df = pd.DataFrame( \n",
+    "        {\n",
+    "                'dps_num':[DPS_num],\n",
+    "                'tile_num':[INPUT_TILE_NUM],\n",
+    "                'submit_time':[datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%s')],\n",
+    "                'dbs_job_hour': [datetime.datetime.now().hour],\n",
+    "                'algo_id': [ALGO_ID],\n",
+    "                'user': [USER],\n",
+    "                'worker_type': [WORKER_TYPE],\n",
+    "                'job_id': [submit_result.id],\n",
+    "                'submit_status': [submit_result.status],\n",
+    "            \n",
+    "        } \n",
+    "    )\n",
+    "    \n",
+    "    # Append to a list of data frames of DPS submission results\n",
+    "    submit_results_df_list.append(submit_result_df)\n",
+    "    \n",
+    "    if DPS_num in [1, 5, 10, 50, 100, 250, 500, 750, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 7000, 9000, 11000, 13000, 15000, 17000, 19000, 21000, 24000, len_input_list]:\n",
+    "        print(f\"DPS run #: {DPS_num}\\t| tile num: {INPUT_TILE_NUM}\\t| submit status: {submit_result.status}\\t| job id: {submit_result.id}\") \n",
+    "        \n",
+    "# Build a final submission results data frame and save\n",
+    "submit_results_df = pd.concat(submit_results_df_list)\n",
+    "submit_results_df['run_name'] = RUN_NAME\n",
+    "nowtime = pd.Timestamp.now().strftime('%Y%m%d%H%M')\n",
+    "print(f\"Current time:\\t{nowtime}\")\n",
+    "\n",
+    "# This creates a CSV of the metadata associated with the DPS jobs you have just submitted\n",
+    "submit_results_df.to_csv(f'{DPS_SUBMISSION_RESULTS_DIR}/DPS_{ALGO_ID}_{RUN_NAME}_submission_results_{len_input_list}_{nowtime}.csv')\n",
+    "submit_results_df.info()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/source/technical_tutorials/dps_tutorial/_static/python_env_default.png b/docs/source/technical_tutorials/dps_tutorial/_static/python_env_default.png
diff --git a/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_overview.png b/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_overview.png
diff --git a/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_register_api_1.png b/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_register_api_1.png
diff --git a/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_view_2.png b/docs/source/technical_tutorials/dps_tutorial/_static/tutorial_view_2.png
diff --git a/docs/source/technical_tutorials/dps_tutorial/algorithm_config_template.yml b/docs/source/technical_tutorials/dps_tutorial/algorithm_config_template.yml
@@ -0,0 +1,25 @@
+algorithm_description: This is a free-form description of your algorithm
+algorithm_name: dps-tutorial-name
+algorithm_version: main
+build_command: dps_tutorial/gdal_wrapper/build-env.sh
+disk_space: 1GB
+docker_container_url: mas.maap-project.org/root/maap-workspaces/base_images/vanilla:v3.1.5
+inputs:
+  config: []
+  file:
+  - default: ''
+    description: The name of the input file
+    name: input_file
+    required: true
+  positional:
+  - default: ''
+    description: output file name
+    name: output_file
+    required: true
+  - default: '30'
+    description: the percent reduction of your output file vs the input file
+    name: percent_reduction
+    required: true
+queue: maap-dps-worker-8gb
+repository_url: https://github.com/MAAP-Project/dps_tutorial.git
+run_command: dps_tutorial/gdal_wrapper/run_gdal.sh