Skip to content
This repository was archived by the owner on Nov 23, 2017. It is now read-only.

Specify type of EBS root volume #45

Open
wants to merge 2 commits into
base: branch-1.6
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ to launch, manage and shut down
on Amazon EC2. It automatically sets up Apache Spark and
[HDFS](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html)
on the cluster for you. This guide describes
how to use `spark-ec2` to launch clusters, how to run jobs on them, and how
to shut them down. It assumes you've already signed up for an EC2 account
how to use `spark-ec2` to launch clusters, how to run jobs on them, and how
to shut them down. It assumes you've already signed up for an EC2 account
on the [Amazon Web Services site](http://aws.amazon.com/).

`spark-ec2` is designed to manage multiple named clusters. You can
Expand Down Expand Up @@ -69,13 +69,15 @@ types, and the default type is `m3.large` (which has 2 cores and 7.5 GB
RAM). Refer to the Amazon pages about [EC2 instance
types](http://aws.amazon.com/ec2/instance-types) and [EC2
pricing](http://aws.amazon.com/ec2/#pricing) for information about other
instance types.
instance types.
- `--region=<ec2-region>` specifies an EC2 region in which to launch
instances. The default region is `us-east-1`.
- `--zone=<ec2-zone>` can be used to specify an EC2 availability zone
to launch instances in. Sometimes, you will get an error because there
is not enough capacity in one zone, and you should try to launch in
another.
- `--ebs-root-vol-type=<ebs-type>` can be used to specify the EBS
volume type to use. The default value is `gp2`.
- `--ebs-vol-size=<GB>` will attach an EBS volume with a given amount
of space to each node so that you can have a persistent HDFS cluster
on your nodes across cluster restarts (see below).
Expand Down Expand Up @@ -145,7 +147,7 @@ export AWS_ACCESS_KEY_ID=ABCDEFG1234567890123

You can edit `/root/spark/conf/spark-env.sh` on each machine to set Spark configuration options, such
as JVM options. This file needs to be copied to **every machine** to reflect the change. The easiest way to
do this is to use a script we provide called `copy-dir`. First edit your `spark-env.sh` file on the master,
do this is to use a script we provide called `copy-dir`. First edit your `spark-env.sh` file on the master,
then run `~/spark-ec2/copy-dir /root/spark/conf` to RSYNC it to all the workers.

The [configuration guide](configuration.html) describes the available configuration options.
Expand Down Expand Up @@ -195,20 +197,20 @@ In addition to using a single input file, you can also use a directory of files
This repository contains the set of scripts used to setup a Spark cluster on
EC2. These scripts are intended to be used by the default Spark AMI and is *not*
expected to work on other AMIs. If you wish to start a cluster using Spark,
please refer to http://spark-project.org/docs/latest/ec2-scripts.html
please refer to http://spark-project.org/docs/latest/ec2-scripts.html

## spark-ec2 Internals

The Spark cluster setup is guided by the values set in `ec2-variables.sh`.`setup.sh`
first performs basic operations like enabling ssh across machines, mounting ephemeral
drives and also creates files named `/root/spark-ec2/masters`, and `/root/spark-ec2/slaves`.
Following that every module listed in `MODULES` is initialized.
Following that every module listed in `MODULES` is initialized.

To add a new module, you will need to do the following:

1. Create a directory with the module's name.

2. Optionally add a file named `init.sh`. This is called before templates are configured
2. Optionally add a file named `init.sh`. This is called before templates are configured
and can be used to install any pre-requisites.

3. Add any files that need to be configured based on the cluster setup to `templates/`.
Expand Down
18 changes: 14 additions & 4 deletions spark_ec2.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,12 +249,15 @@ def parse_args():
"--resume", action="store_true", default=False,
help="Resume installation on a previously launched cluster " +
"(for debugging)")
parser.add_option(
"--ebs-root-vol-type", default="gp2",
help="Root EBS volume type (e.g. 'gp2', 'io1', 'st1', 'sc1', 'standard') (default: 'gp2')")
parser.add_option(
"--ebs-vol-size", metavar="SIZE", type="int", default=0,
help="Size (in GB) of each EBS volume.")
parser.add_option(
"--ebs-vol-type", default="standard",
help="EBS volume type (e.g. 'gp2', 'standard').")
"--ebs-vol-type", default="gp2",
help="EBS volume type (e.g. 'gp2', 'io1', 'st1', 'sc1', 'standard') (default: 'gp2')")
parser.add_option(
"--ebs-vol-num", type="int", default=1,
help="Number of EBS volumes to attach to each node as /vol[x]. " +
Expand Down Expand Up @@ -588,9 +591,16 @@ def launch_cluster(conn, opts, cluster_name):
print("Could not find AMI " + opts.ami, file=stderr)
sys.exit(1)

# Create block device mapping so that we can add EBS volumes if asked to.
# The first drive is attached as /dev/sds, 2nd as /dev/sdt, ... /dev/sdz
# Create block device mapping so that we can configure and add EBS volumes if asked to.
block_map = BlockDeviceMapping()
# add root ebs volume type
root_device = EBSBlockDeviceType()
root_device.volume_type = opts.ebs_root_vol_type
root_device.delete_on_termination = True
block_map['/dev/sda1'] = root_device

# add additional EBS volumes if asked to
# The first drive is attached as /dev/sds, 2nd as /dev/sdt, ... /dev/sdz
if opts.ebs_vol_size > 0:
for i in range(opts.ebs_vol_num):
device = EBSBlockDeviceType()
Expand Down