Skip to content

Commit

Permalink
Merge pull request #408 from MAAP-Project/dps_tut_v2
Browse files Browse the repository at this point in the history
DPS Tutorial v2 (for workspace-release v4.0.0)
  • Loading branch information
rtapella authored Jul 5, 2024
2 parents 162544b + 7145a16 commit f9ab145
Show file tree
Hide file tree
Showing 7 changed files with 472 additions and 56 deletions.
353 changes: 353 additions & 0 deletions docs/source/technical_tutorials/dps_tutorial/DPS_runner_template.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,353 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "330e3ace",
"metadata": {},
"source": [
"# Prepare and launch a DPS batch of jobs for a particular algorithm\n",
"\n",
"**Goal**\n",
"Provide a template for DPS job submission which will be changed/adapted according to specific algorithms being run in DPS.\n",
"\n",
"**Motivation** \n",
"It's easier to learn how to run many jobs of your script (where for each job there is some input that changes) if you can first see an example.\n",
"\n",
"Paul Montesano, PhD \n",
"[email protected] \n",
"June 2024"
]
},
{
"cell_type": "code",
"execution_count": 126,
"id": "ea7bcf9f",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from maap.maap import MAAP\n",
"maap = MAAP()"
]
},
{
"cell_type": "code",
"execution_count": 127,
"id": "be655aaf-644c-4041-8d04-e1237a50a7f4",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'api.maap-project.org'"
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"maap._MAAP_HOST"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5c541eee",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd\n",
"import glob\n",
"import datetime\n",
"import sys"
]
},
{
"cell_type": "markdown",
"id": "3a058f23-c0a1-4445-9656-70eb7489441b",
"metadata": {},
"source": [
"### Use MAAP Registration call in notebook chunk to register DPS algorithm\n",
" - You need to register the DPS algorithm before first before you loop over jobs that will use it.\n",
" - If you register your algorithm using the Register Algorithm UI in Jupyter, a configuration file (in yml format) will be placed in your workspace home folder, which can then be used as a template for reuse"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7810c9e6-5dc8-4969-b1f4-beb3d06e9d96",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"maap.register_algorithm_from_yaml_file(\"/projects/.../.../<my_algorithms_yaml_file>.yml\").text"
]
},
{
"cell_type": "markdown",
"id": "836409b4",
"metadata": {},
"source": [
"### Build a dictionary of the argument names and values needed to run the algorithm in the way you want\n",
"\n",
"This can be called a `parameters dictionary` \n",
"\n",
" - These will be arguments that the `.sh` wrapper (which calls your `.py` or `.R` code) is hard-coded to accept. \n",
" - The `.yml` file that you use to Register your algorithm is what connects this `parameters dictionary` to your `.sh` wrapper. \n",
" - This combo of the `parameters dictionary`, the `.yml`, and the `.sh` provides a specific (and repeatable) way of running your `.py` or `.R` code."
]
},
{
"cell_type": "markdown",
"id": "c0fea3b7",
"metadata": {},
"source": [
"#### Note: make sure the `in_params_dict` coincides with the args of your underlying Python/R code"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "65681b96",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"in_params_dict = {\n",
" 'arg name_1': 'some_value',\n",
" 'arg_name_2': 'another_value',\n",
" 'in_tile_num': 1\n",
" }"
]
},
{
"cell_type": "markdown",
"id": "46e6ffc9-cc7d-4b56-a310-811774054d7e",
"metadata": {},
"source": [
"### Set up a list of items you want to run across - this is an example of some algorithm input that will vary according to job\n",
"\n",
"In this example, we are using geographic `tiles` to break up our processing. These tiles are defined by vector polygons and have ids that our `.sh`, `.py`, and `.yml` files are set up to take in as arguments. We use these ids as the basis for a loop that will sequentially submit our jobs to DPS. \n",
"\n",
"There are many ways one could decide to split up their DPS jobs - so this use of tiles here is just for the purposes of this example."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "4fd13e32-77c8-4641-82e9-85c0ad0e8cde",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Just an example of a list of some input parameter to your script that needs to vary for each job, thus creating multiple jobs\n",
"DPS_INPUT_TILE_NUM_LIST = [1,3,5,7,13,17,19]"
]
},
{
"cell_type": "markdown",
"id": "d72590cf-d9c4-438c-9a2d-684ab5d08549",
"metadata": {},
"source": [
"### Set up the general submission variables that will be applied to all runs of this DPS batch\n",
"\n",
"These will also determine the look of path of the DPS output (`/projects/my-private-bucket/dps_output`): \n",
"`/projects/my-private-bucket/dps_output/<ALGO_ID>/<ALGO_VERSION>/<IDENTIFIER>`"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "e6c61e32-3550-43ff-aa3a-cbbfa97efb2d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# MAAP algorithm version name\n",
"IDENTIFIER='BIOMASS_2020'\n",
"ALGO_VERSION = 'my_biomass_algorithm_v2024_1'\n",
"ALGO_ID = \"run_my_biomass_algorithm\"\n",
"USER = 'montesano'\n",
"WORKER_TYPE = 'maap-dps-worker-8gb'"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "01c52cde-1d06-4007-a637-34988938b099",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'BIOMASS_2020'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"RUN_NAME = IDENTIFIER\n",
"RUN_NAME"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "6490e474-3f44-4634-b198-6c03eaccc171",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[1, 3]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DPS_INPUT_TILE_NUM_LIST[0:2]"
]
},
{
"cell_type": "markdown",
"id": "80232c11-dd65-43b4-9c50-40c9f2dc87a4",
"metadata": {},
"source": [
"### Set up a dir to hold the metadata output table from the DPS submission"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f45994e-10f6-405d-aeb9-f2263b4e7662",
"metadata": {},
"outputs": [],
"source": [
"DPS_SUBMISSION_RESULTS_DIR = '/projects/my-public-bucket/dps_submission_results'\n",
"!mkdir -p $DPS_SUBMISSION_RESULTS_DIR"
]
},
{
"cell_type": "markdown",
"id": "86193dd5",
"metadata": {},
"source": [
"## Run a DPS job across the list\n",
"\n",
"The submission is done as a loop. \n",
"\n",
"Since submission is fast, this doesn't need to be parallelized. The jobs will start soon after submission and will be processed in parallel depending on administrator settings."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4abfe38b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"\n",
"import json\n",
"\n",
"submit_results_df_list = []\n",
"len_input_list = len(DPS_INPUT_TILE_NUM_LIST)\n",
"print(f\"# of input tiles for DPS: {len_input_list}\")\n",
"\n",
"for i, INPUT_TILE_NUM in enumerate(DPS_INPUT_TILE_NUM_LIST):\n",
" \n",
" # Just a way to keep track of the job number associated with this submission's loop\n",
" DPS_num = i+1\n",
" \n",
" # Update the in_params_dict with the current INPUT_TILE_NUM from this loop\n",
" in_params_dict['in_tile_num'] = INPUT_TILE_NUM\n",
" \n",
" submit_result = maap.submitJob(\n",
" identifier=IDENTIFIER,\n",
" algo_id=ALGO_ID,\n",
" version=ALGO_VERSION,\n",
" username=USER, # username needs to be the same as whoever created the workspace\n",
" queue=WORKER_TYPE,\n",
" **in_params_dict\n",
" )\n",
" \n",
" # Build a dataframe of submission details - this holds metadata about your DPS job\n",
" submit_result_df = pd.DataFrame( \n",
" {\n",
" 'dps_num':[DPS_num],\n",
" 'tile_num':[INPUT_TILE_NUM],\n",
" 'submit_time':[datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%s')],\n",
" 'dbs_job_hour': [datetime.datetime.now().hour],\n",
" 'algo_id': [ALGO_ID],\n",
" 'user': [USER],\n",
" 'worker_type': [WORKER_TYPE],\n",
" 'job_id': [submit_result.id],\n",
" 'submit_status': [submit_result.status],\n",
" \n",
" } \n",
" )\n",
" \n",
" # Append to a list of data frames of DPS submission results\n",
" submit_results_df_list.append(submit_result_df)\n",
" \n",
" if DPS_num in [1, 5, 10, 50, 100, 250, 500, 750, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 7000, 9000, 11000, 13000, 15000, 17000, 19000, 21000, 24000, len_input_list]:\n",
" print(f\"DPS run #: {DPS_num}\\t| tile num: {INPUT_TILE_NUM}\\t| submit status: {submit_result.status}\\t| job id: {submit_result.id}\") \n",
" \n",
"# Build a final submission results data frame and save\n",
"submit_results_df = pd.concat(submit_results_df_list)\n",
"submit_results_df['run_name'] = RUN_NAME\n",
"nowtime = pd.Timestamp.now().strftime('%Y%m%d%H%M')\n",
"print(f\"Current time:\\t{nowtime}\")\n",
"\n",
"# This creates a CSV of the metadata associated with the DPS jobs you have just submitted\n",
"submit_results_df.to_csv(f'{DPS_SUBMISSION_RESULTS_DIR}/DPS_{ALGO_ID}_{RUN_NAME}_submission_results_{len_input_list}_{nowtime}.csv')\n",
"submit_results_df.info()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
algorithm_description: This is a free-form description of your algorithm
algorithm_name: dps-tutorial-name
algorithm_version: main
build_command: dps_tutorial/gdal_wrapper/build-env.sh
disk_space: 1GB
docker_container_url: mas.maap-project.org/root/maap-workspaces/base_images/vanilla:v3.1.5
inputs:
config: []
file:
- default: ''
description: The name of the input file
name: input_file
required: true
positional:
- default: ''
description: output file name
name: output_file
required: true
- default: '30'
description: the percent reduction of your output file vs the input file
name: percent_reduction
required: true
queue: maap-dps-worker-8gb
repository_url: https://github.com/MAAP-Project/dps_tutorial.git
run_command: dps_tutorial/gdal_wrapper/run_gdal.sh
Loading

0 comments on commit f9ab145

Please sign in to comment.