Skip to content

Commit

Permalink
Make the cluster/gpu deployment work from buildartifacts
Browse files Browse the repository at this point in the history
Give up on running it locally for now.
  • Loading branch information
elibarzilay committed Nov 15, 2017
1 parent 7ef335c commit 501da9d
Show file tree
Hide file tree
Showing 9 changed files with 133 additions and 122 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ notebooks. See the [documentation](docs/docker.md) for more on Docker use.
> To read the EULA for using the docker image, run \
> `docker run -it -p 8888:8888 microsoft/mmlspark eula`
#### GPU VM Setup
### GPU VM Setup

MMLSpark can be used to train deep learning models on a GPU node from a Spark
application. See the instructions for [setting up an Azure GPU
Expand Down
110 changes: 46 additions & 64 deletions docs/gpu-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,23 @@

## Requirements

CNTK training using MMLSpark in Azure requires an HDInsight Spark cluster and a
GPU virtual machine (VM). The GPU VM should be reachable via SSH from the
cluster, but no public SSH access (or even a public IP address) is required.
As an example, it can be on a private Azure virtual network (VNet), and within
this VNet, it can be addressed directly by its name and access the Spark
clsuter nodes (e.g., use the active NameNode RPC endpoint).

See the original [copyright and license notices](third-party-notices.txt) of
third party software used by MMLSpark.
CNTK training using MMLSpark in Azure requires an HDInsight Spark
cluster and a GPU virtual machine (VM). The GPU VM should be reachable
via SSH from the cluster, but no public SSH access (or even a public IP
address) is required, and the cluster's NameNode should be accessible
from the GPU machine via the HDFS RPC. As an example, it can be on a
private Azure virtual network (VNet), and within this VNet, it can be
addressed directly by its name and access the Spark clsuter nodes (e.g.,
use the active NameNode RPC endpoint).

(See the original [copyright and license
notices](third-party-notices.txt) of third party software used by
MMLSpark.)

### Data Center Compatibility

Currently, not all data centers have GPU VMs available. See [the Linux
VMs page](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/)
Currently, not all data centers have GPU VMs available. See [the Linux VMs
page](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/)
to check availability in your data center.

## Connect an HDI cluster and a GPU VM via the ARM template
Expand Down Expand Up @@ -44,21 +47,7 @@ the associated GPU VM:
- `gpuVirtualMachineName`: The name of the GPU virtual machine to create
- `gpuVirtualMachineSize`: The size of the GPU virtual machine to create

If you need to further configure the environment (for example, to change [the
class of VM
sizes](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/)
for HDI cluster nodes), modify the template directly before deployment. See
also [the guide for best ARM template
practices](https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-template-best-practices).
For the naming rules and restrictions for Azure resources please refer to the
[Naming conventions
article](https://docs.microsoft.com/en-us/azure/architecture/best-practices/naming-conventions).

There are actually three templates that are used for deployment:
- [`deploy-main-template.json`](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-main-template.json):
This is the main template. It referencs the following two child
templates — these are relative references so they are expected to be
found in the same base URL.
There are actually two additional templates that are used from this main template:
- [`spark-cluster-template.json`](https://mmlspark.azureedge.net/buildartifacts/0.9/spark-cluster-template.json):
A template for creating an HDI Spark cluster within a VNet, including
MMLSpark and its dependencies. (This template installs MMLSpark using
Expand All @@ -69,46 +58,40 @@ There are actually three templates that are used for deployment:
CNTK and other dependencies that MMLSpark needs for GPU training.
(This is done via a script action that runs
[`gpu-setup.sh`](https://mmlspark.azureedge.net/buildartifacts/0.9/gpu-setup.sh).)

Note that the last two child templates can also be deployed independently, if
Note that these child templates can also be deployed independently, if
you don't need both parts of the installation.

## Deploying an ARM template

### 1. Deploy an ARM template within the [Azure Portal](https://ms.portal.azure.com/)

An ARM template can be opened within the Azure Portal via the following REST
API:
[Click here to open the above
template](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fmmlspark.azureedge.net%2Fbuildartifacts%2F0.9%2Fdeploy-main-template.json)
in the Azure portal.

https://portal.azure.com/#create/Microsoft.Template/uri/<ARM-template-URI>
(If needed, you click the **Edit template** button to view and edit the
template.)

The URI can be one for either an *Azure Blob* or a *GitHub file*. For example,
This link is using the Azure Portal API:

https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fmystorage.blob.core.windows.net%2Fdeploy-main-template.json
https://portal.azure.com/#create/Microsoft.Template/uri/〈ARM-template-URI〉

(Note that the URL is percent-encoded.) Clicking on the above link will
open the template in the Portal. If needed, click the **Edit template** button
(see screenshot below) to view and edit the template.
where the template URI is percent-encoded.

![ARM template in Portal](http://image.ibb.co/gZ6iiF/arm_Template_In_Portal.png)
### 2. Deploy an ARM template with MMLSpark Azure CLI 2.0

### 2. Deploy an ARM template with [MMLSpark Azure CLI 2.0](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-arm.sh)
We also provide a convenient shell script to create a deployment on the
command line:

MMLSpark provides an Azure CLI 2.0 script
([`deploy-arm.sh`](../tools/deployment/deploy-arm.sh)) to deploy an ARM
template (such as
[`deploy-main-template.json`](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-main-template.json))
along with a parameter file (see
[deploy-parameters.template](../tools/deployment/deploy-parameters.template)
for a template of such a file).
* Download the [shell
script](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-arm.sh)
and make a local copy of it

> Note that you cannot use the
> [template file](../tools/deployment/deploy-main-template.json) from
> the source tree, since it requires additional resources that are
> created by the build (specifically, a working version of
> [`install-mmlspark.sh`](../tools/hdi/install-mmlspark.sh)).
* Create a JSON parameter file by downloading [this template
file](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-parameters.template)
and modify it according to your specification.

The script take the following arguments:
You can now run the script — it takes the following arguments:
- `subscriptionId`: The GUID that identifies your subscription (e.g.,
`01234567-89ab-cdef-0123-456789abcdef`), defaults to setting in your
`az` environment.
Expand All @@ -118,29 +101,28 @@ The script take the following arguments:
`East US`), note that this is required if creating a new resource
group.
- `deploymentName`: The name for this deployment.
- `templateLocation`: The URL of an ARM template file, or the path to
one. By default, it is set to `deploy-main-template.json` in the same
directory, but note that this will normally not work without the rest
of the required resources.
- `parametersFilePath`: The path to the parameter file, which you need
to create. Use `deploy-parameters.template` as a template for
creating a parameters file.
- `templateLocation`: The URL of an ARM template file. By default, it
is set to the above main template.
- `parametersFilePath`: The path to the parameter file, which you have
created.

Run the script with a `-h` or `--help` to see the flags that are used to
set these arguments:

./deploy-arm.sh -h

If no flags are specified on the command line, the script will prompt
you for all values. If needed, install the Azure CLI 2.0 using the
instruction found in the [Azure CLI Installation
Guide](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli).
you for all needed values.

> Note that the script uses the Azure CLI 2.0, see the
> [Azure CLI Installation Guide](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli)
> if you need to install it.
### 3. Deploy an ARM template with the [MMLSpark Azure PowerShell](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-arm.ps1)
### 3. Deploy an ARM template with the MMLSpark Azure PowerShell

MMLSpark also provides a [PowerShell
script](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-arm.ps1)
to deploy ARM templates, similar to the above bash script, run it with
to deploy ARM templates, similar to the above bash script. Run it with
`-?` to see the usage instructions (or use `get-help`). If needed,
install the Azure PowerShell cmdlets using the instructions in the
[Azure PowerShell
Expand All @@ -164,7 +146,7 @@ Azure will stop billing if a VM is in a "Stopped (**Deallocated**)" state,
which is different from the "Stopped" state. So make sure it is *Deallocated*
to avoid billing. In the Azure Portal, clicking the "Stop" button will put the
VM into a "Stopped (Deallocated)" state and clicking the "Start" button brings
it VM. See "[Properly Shutdown Azure VM to Save
it back up. See "[Properly Shutdown Azure VM to Save
Money](https://buildazure.com/2017/03/16/properly-shutdown-azure-vm-to-save-money/)"
for futher details.

Expand Down
22 changes: 3 additions & 19 deletions notebooks/gpu/401 - CNTK train on HDFS.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -96,12 +96,7 @@
"brainscriptText = \"\"\"\n",
" # ConvNet applied on CIFAR-10 dataset, with no data augmentation.\n",
"\n",
" command = TrainNetwork\n",
"\n",
" precision = \"double\"; traceLevel = 1 ; deviceId = \"auto\"\n",
"\n",
" rootDir = \"../../..\" ; dataDir = \"$$rootDir$$/DataSets/CIFAR-10\" ;\n",
" outputDir = \"./Output\" ;\n",
" parallelTrain = true\n",
"\n",
" TrainNetwork = {\n",
" action = \"train\"\n",
Expand Down Expand Up @@ -148,7 +143,7 @@
"\n",
" SGD = {\n",
" epochSize = 0\n",
" minibatchSize = 256\n",
" minibatchSize = 32\n",
"\n",
" learningRatesPerSample = 0.0015625*10:0.00046875*10:0.00015625\n",
" momentumAsTimeConstant = 0*20:607.44\n",
Expand All @@ -164,18 +159,7 @@
" dataParallelSGD = { gradientBits = 1 }\n",
" }\n",
" }\n",
"\n",
" reader = {\n",
" readerType = \"CNTKTextFormatReader\"\n",
" file = \"$$DataDir$$/Train_cntk_text.txt\"\n",
" randomize = true\n",
" keepDataInMemory = true # cache all data in memory\n",
" input = {\n",
" features = { dim = 3072 ; format = \"dense\" }\n",
" labels = { dim = 10 ; format = \"dense\" }\n",
" }\n",
" }\n",
"}\n",
" }\n",
"\"\"\""
]
},
Expand Down
55 changes: 33 additions & 22 deletions tools/deployment/deploy-arm.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@
.PARAMETER deploymentName
The deployment name.
.PARAMETER templateFilePath
Path of the template file to deploy.
Optional, defaults to deploy-main-template.json in this directory.
.PARAMETER templateLocation
URL of the template to deploy.
Optional, defaults to the one corresponding to this script.
.PARAMETER parametersFilePath
Path of the parameters file to use for the template, use
Expand Down Expand Up @@ -57,37 +57,48 @@ param(
[string]
$resourceGroupName,

[Parameter(Mandatory=$False)]
[string]
$resourceGroupLocation,

[Parameter(Mandatory=$False)]
[string]
$deploymentName,

[Parameter(Mandatory=$False)]
[string]
$templateFilePath = "deploy-main-template.json",
$templateLocation,

[Parameter(Mandatory=$True)]
[string]
$parametersFilePath
)

# <=<= this line is replaced with variables defined with `defvar -X` =>=>
$DOWNLOAD_URL = "$STORAGE_URL/$MML_VERSION"
# TODO: throw an error if $MML_VERSION is not defined

<#
.SYNOPSIS
Registers RPs
#>
Function RegisterRP {
Param(
[string]$ResourceProviderNamespace
)
Write-Host "Registering resource provider '$ResourceProviderNamespace'";
Register-AzureRmResourceProvider -ProviderNamespace $ResourceProviderNamespace;
Param(
[string]$ResourceProviderNamespace
)
Write-Host "Registering resource provider '$ResourceProviderNamespace'";
Register-AzureRmResourceProvider -ProviderNamespace $ResourceProviderNamespace;
}

#******************************************************************************
# Script body
# Execution begins here
#******************************************************************************

if (!$templateLocation) {
$templateLocation = $DOWNLOAD_URL + "/deploy-main-template.json";
}

$ErrorActionPreference = "Stop"

# sign in
Expand All @@ -101,29 +112,29 @@ Select-AzureRmSubscription -SubscriptionID $subscriptionId;
# Register RPs
$resourceProviders = @("microsoft.hdinsight");
if ($resourceProviders.length) {
Write-Host "Registering resource providers"
foreach ($resourceProvider in $resourceProviders) {
RegisterRP($resourceProvider);
}
Write-Host "Registering resource providers"
foreach ($resourceProvider in $resourceProviders) {
RegisterRP($resourceProvider);
}
}

#Create or check for existing resource group
$resourceGroup = Get-AzureRmResourceGroup -Name $resourceGroupName -ErrorAction SilentlyContinue
if (!$resourceGroup) {
Write-Host "Resource group '$resourceGroupName' does not exist. To create a new resource group, please enter a location.";
if (!$resourceGroupLocation) {
$resourceGroupLocation = Read-Host "resourceGroupLocation";
}
Write-Host "Creating resource group '$resourceGroupName' in location '$resourceGroupLocation'";
New-AzureRmResourceGroup -Name $resourceGroupName -Location $resourceGroupLocation
Write-Host "Resource group '$resourceGroupName' does not exist. To create a new resource group, please enter a location.";
if (!$resourceGroupLocation) {
$resourceGroupLocation = Read-Host "resourceGroupLocation";
}
Write-Host "Creating resource group '$resourceGroupName' in location '$resourceGroupLocation'";
New-AzureRmResourceGroup -Name $resourceGroupName -Location $resourceGroupLocation
} else {
Write-Host "Using existing resource group '$resourceGroupName'";
Write-Host "Using existing resource group '$resourceGroupName'";
}

# Start the deployment
Write-Host "Starting deployment...";
if (Test-Path $parametersFilePath) {
New-AzureRmResourceGroupDeployment -ResourceGroupName $resourceGroupName -TemplateFile $templateFilePath -TemplateParameterFile $parametersFilePath;
New-AzureRmResourceGroupDeployment -ResourceGroupName $resourceGroupName -TemplateUri $templateLocation -TemplateParameterFile $parametersFilePath;
} else {
New-AzureRmResourceGroupDeployment -ResourceGroupName $resourceGroupName -TemplateFile $templateFilePath;
New-AzureRmResourceGroupDeployment -ResourceGroupName $resourceGroupName -TemplateUri $templateLocation;
}
19 changes: 12 additions & 7 deletions tools/deployment/deploy-arm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,15 @@
# Copyright (C) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License. See LICENSE in project root for information.

# This script deploys a Spark Cluster and a GPU, see docs/gpu-setup.md
# for details.

# <=<= this line is replaced with variables defined with `defvar -X` =>=>
DOWNLOAD_URL="$STORAGE_URL/$MML_VERSION"
if [[ -z "$MML_VERSION" ]]; then
echo "Error: this script cannot be executed as-is" 1>&2; exit 1
fi

set -euo pipefail
# -e: exit if any command has a non-zero exit status
# -u: unset variables are an error
Expand Down Expand Up @@ -71,7 +80,8 @@ readarg subscriptionId "Subscription ID" "$cursub"
readarg -r resourceGroupName "Resource Group Name"
readarg deploymentName "Deployment Name"
readarg resourceGroupLocation "Resource Group Location"
readarg templateLocation "Template Location (Path/URL)" "$here/deploy-main-template.json"
readarg templateLocation "Template Location URL" \
"$DOWNLOAD_URL/deploy-main-template.json"
readarg -rf parametersFilePath "Parameters File"

if [[ "$subscriptionId" != "$cursub" ]]; then
Expand Down Expand Up @@ -99,12 +109,7 @@ echo "Starting deployment..."
args=()
if [[ -n "$deploymentName" ]]; then args+=(--name "$deploymentName"); fi
args+=(--resource-group "$resourceGroupName")
if [[ "$templateLocation" = "http://"* ]]; then args+=(--template-uri)
elif [[ "$templateLocation" = "https://"* ]]; then args+=(--template-uri)
elif [[ -r "$templateLocation" ]]; then args+=(--template-file)
else failwith "templateLocation is neither a URL, nor does it point at a file"
fi
args+=("$templateLocation")
args+=(--template-uri "$templateLocation")
args+=(--parameters "@$parametersFilePath")

az group deployment create "${args[@]}" || failwith "Deployment failed"
Expand Down
Loading

0 comments on commit 501da9d

Please sign in to comment.