Crail-spark-terasort is a modified version of the original TeraSort program from https://github.com/ehiggs/spark-terasort. The original program generates, sorts, and validates 100 bytes (10b + 90b) key-value records as described by http://sortbenchmark.org/.
This version is optimized to work with the crail file system (http://crail.incubator.apache.org/), and does only sorting (and other controller experiments, see -n option in the next section).
News:
- February 19th, 2018: Code is updated to use Aapche Crail (Incubating)
- October 6th, 2017: Dependency to
crail-saprk-io
is updated and the pom artifact is renamed fromcrail-terasort
tocrail-spark-terasort
. Spark version is bumped from2.1
to2.2
(however, there is no API change between these two). - October 4th, 2017: The
crail-terasort
repository is renamed tocrail-sprak-terasort
.
The crail-terasort program takes following options :
-n,--testname <string> Name of the test valid tests are :
1. loadOnly: load and counts the input dataset
2. loadStore: load the input dataset and stores it
3. loadCount: load, shuffle, and then count the
resulting dataset
4. loadCountStore: load, shuffle, count, and then
store the resulting dataset
5. loadSort: load, shuffle, and then sort on key
the resulting dataset
6. loadSortStore: load, shuffle, sort on key, then
store the resulting dataset
the default is : loadSortStore
-i,--inputDir <string> Name of the input directory
-o,--outputDir <string> Name of the output directory
-S,--sync <int> Takes 0 or 1 to pass to the sync call to the output
FS while writing (default: 0)
-p,--partitionSize <long> Partition size, takes k,m,g,t suffixes
(default: input partition size, HDFS has 128MB)
-s,--useSerializer <string> You can use following serializers:
none : uses the Spark default serializer
kryo : optimized Kryo for TeraSort
byte : a simple byte[] serializer
f22 : an optimized crail-specific byte[] serializer
f22 requires CrailShuffleNativeRadixSorter
(see below for details)
-b,--bufferSize <int> Buffer size for Kryo (only valid for kryo)
-O --options <string,string> Sets properties on the Spark context. The first
string is the key, and the second is the value
-h,--help Show this help
Apart from the dependencies of the original program, crail-terasort depends upon
- crail-client (1.0) : https://github.com/apache/incubator-crail
- crail-spark-io (1.0) : https://github.com/zrlio/crail-spark-io
- spark (2.0.0+, current build is for 2.2.0)
crail-spark-io
extensions enable spark to run on the crail file system.
Hence, you must build and install zrlio artifacts in your local maven
repository. For details see their respective github repositories.
After that, compilation of crail-terasort is simple:
mvn -DSkipTests -T 1C install
After a successful compilation you should have crail-terasort-2.0.jar
in your ./target
folder. This is the crail-terasort app jar.
Note: These steps are optional and only for if you plan to use
f22
serializer. However, please note that the best performance is
only delivered by the combination of f22
with CrailShuffleNativeRadixSorter
.
The f22 serializer requires the use of CrailShuffleNativeRadixSorter
as the shuffle sorter. This sorter has dependencies on a native library
libjsort
(which builds on libboost
). You can find source of
libjsort
is in your cloned repository and build by :
cmake .
make
These steps should give you 'libjsort.so
in your directory. In case
your system does not satisfy boost dependency, you can explicitly
copy include/boost/sort
directory from https://github.com/boostorg/sort
to your libjsort
src directory as boost/sort
.
Then you should copy the file in your java_library_directory as defined
in the $HADOOP_HOME/etc/hadoop/yarn-env.sh
. As an example, we have :
export JAVA_LIBRARY_PATH=/path/to/libjsort/:$JAVA_LIBRARY_PATH
and set yarn-site.xml
accordingly to something appropriate:
<property>
<name>yarn.nodemanager.admin-env</name>
<value>LD_LIBRARY_PATH=/path/to/libjsort/:</value>
</property>
We assume that you have a working spark installation with crail-spark-io
extensions. Here we give detail instructions about how to generate,
sort, and validate crail-terasort program.
We recommend to use the original Terasort program from Ewan Higgs
to generate data : https://github.com/ehiggs/spark-terasort. Clone it,
then build it by simply executing mvn -DSkipTests -T 1C install
.
However the original program is written for an older version of spark.
Hence, we recommend you modify the pom.xml
file to update the
versions as :
diff --git a/pom.xml b/pom.xml
index e15c3e1..996973d 100644
--- a/pom.xml
+++ b/pom.xml
@@ -218,9 +218,9 @@
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
- <scala.version>2.10.4</scala.version>
- <scala.binary.version>2.10</scala.binary.version>
- <spark.version>1.2.1</spark.version>
+ <scala.version>2.11.8</scala.version>
+ <scala.binary.version>2.11</scala.binary.version>
+ <spark.version>2.0.0</spark.version>
</properties>
<dependencies>
Or alternatively, simply clone the updated repository from https://github.com/animeshtrivedi/spark-terasort
To generate data use the following command (adjust according to your config):
./bin/spark-submit -v \
--num-executors 10 --executor-cores 1 --executor-memory 4G \
--master yarn --class com.github.ehiggs.spark.terasort.TeraGen \
path/to/your/sprak-terasort/target/spark-terasort-1.0.jar \
1g /terasort-input-1g
Recommendation: We recommend to use the same number of executors
and partitions as you have number of machines. For example in the above
command, we use it to generate 1g
of dataset with 10
executors,
each with 1
core, with 10
partitions. The number of partitions
come from spark.default.parallelism
from the $SPARK_HOME/conf/spark-defaults.conf
file. We have it set as 10
for 10
machines. The output directory
name is specified as /terasort-input-1g
.
if you experience issues about missing dependencies, then try using
spark-terasort-1.0-jar-with-dependencies.jar
from the target
folder.
After generating data, you can use the following command to run crail-terasort app (adjust according to your config):
./bin/spark-submit -v \
--num-executors 10 --executor-cores 8 --executor-memory 8G --driver-memory 8G\
--master yarn --class com.ibm.crail.terasort.TeraSort \
path/to/your/crail-terasort/target/crail-terasort-2.0.jar \
-i /terasort-input-10g -o /terasort-output-10g
The default serializer is none, which means that whatever you use
(spark.serializer
) in $SPARK_HOME/conf/spark-defaults.conf
will be
used. Once you have the basic setup working, you can try using
kryo
or byte
.
Using f22
serializer requires you to set crail-terasort specific
sorter as well. You must set
spark.crail.shuffle.sorter com.ibm.crail.terasort.sorter.CrailShuffleNativeRadixSorter
spark.crail.serializer com.ibm.crail.terasort.serializer.F22Serializer
in your $SPARK_HOME/conf/spark-defaults.conf
.
Note: The f22 serializer is only compatible with CrailShuffleNativeRadixSorter!
Like input, for output validation we recommend to use the original program as (adjust according to your config):
./bin/spark-submit -v \
--num-executors 10 --executor-cores 8 --executor-memory 4G \
--master yarn --class com.github.ehiggs.spark.terasort.TeraValidate \
apps/jars/spark-terasort/target/spark-terasort-1.0.jar \
/terasort-output-10g
In the output you should see the following lines:
num records: 1000000000
checksum: 1dcd3a102148ad22
partitions are properly sorted
The checksum should be the same what you get if you run the Terasort program on the input directory.
PRs are always welcome. Please fork, and make necessary modifications you propose, and let us know.
If you have questions or suggestions, feel free to post at:
https://groups.google.com/forum/#!forum/zrlio-users
or email: [email protected]