Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark/h2o integration - sparkling water #30

Open
szilard opened this issue May 22, 2019 · 3 comments
Open

Spark/h2o integration - sparkling water #30

szilard opened this issue May 22, 2019 · 3 comments
Labels

Comments

@szilard
Copy link
Owner

szilard commented May 22, 2019

m5.2xlarge 8cores 30GB RAM
1M data

for comparison:

1m:
h2o 28.938 0.7623596
xgboost 12.685 0.7494959
lightgbm 6.965 0.7636987
@szilard
Copy link
Owner Author

szilard commented May 22, 2019

original API:

https://github.com/szilard/GBM-perf/blob/master/wip-testing/sparkling_water/sw-h2o.scala

scala> val dx_train = asH2OFrame(d_train.select("Month","DayofMonth","DayOfWeek","DepTime","UniqueCarrier",
     |       "Origin","Dest","Distance","dep_delayed_15min"))
dx_train: org.apache.spark.h2o.H2OFrame =
Frame key: frame_rdd_34_b0642c9b519a5b66158f83e818084ae1
   cols: 9
   rows: 1000000
 chunks: 6
   size: 57955931

scala> H2OFrameSupport.allStringVecToCategorical(dx_train)
res1: org.apache.spark.h2o.H2OFrame =
Frame key: frame_rdd_34_b0642c9b519a5b66158f83e818084ae1
   cols: 9
   rows: 1000000
 chunks: 6
   size: 12188619

scala> val elapsed = ( System.nanoTime - now )/1e9
elapsed: Double = 3.172993114
scala> val gbm_md = gbm.trainModel.get
gbm_md: hex.tree.gbm.GBMModel =
...

scala> elapsed
res3: Double = 28.644599689
scala> evaluator.evaluate(predictions)
res4: Double = 0.7623568809741097

@szilard
Copy link
Owner Author

szilard commented May 22, 2019

Pipeline API:

with OHE:

https://github.com/szilard/GBM-perf/blob/master/wip-testing/sparkling_water/sw-mllib-ohe.scala

doing 10 trees as this is slow:

scala> val gbm = new H2OGBM().setLabelCol("label").setFeaturesCol("features").
     |   setNtrees(10).setMaxDepth(10).setLearnRate(0.1)    //.setMaxBins(100)   not implemented??

scala> val model = pipeline.fit(d_train)

scala> val elapsed = ( System.nanoTime - now )/1e9
elapsed: Double = 132.769667071

slow with OHE 10 trees 136 sec vs 100 trees 28 sec (m5.2xlarge 8 cores) -- 50x

TODO: fix this (needs cast type):


val predictions = model.transform(d_test)

val evaluator = new BinaryClassificationEvaluator().setLabelCol("label").setRawPredictionCol("prediction_output").setMetricName("areaUnderROC")
evaluator.evaluate(predictions)

// TODO:
//evaluator.evaluate(predictions)
//java.lang.IllegalArgumentException: requirement failed: Column prediction_output must be of type equal to one of the following types: [double, struct<type:tinyint,size:int,indices:array<int>,values:array<double>>] but was actually of type struct<value:double>.

@szilard
Copy link
Owner Author

szilard commented May 22, 2019

directly with cats:

scala> val gbm = new H2OGBM().setLabelCol("dep_delayed_15min").
     |   setNtrees(100).setMaxDepth(10).setLearnRate(0.1)          // .setMaxBins(100)   not implemented??

scala> val model = pipeline.fit(d_train)
model: org.apache.spark.ml.PipelineModel = pipeline_679f2c3cfbeb

scala> val elapsed = ( System.nanoTime - now )/1e9
elapsed: Double = 31.183876731

TODO: fix this (needs cast type):

val predictions = model.transform(d_test)

val evaluator = new BinaryClassificationEvaluator().setLabelCol("label").setRawPredictionCol("prediction_output").setMetricName("areaUnderROC")
evaluator.evaluate(predictions)

// TODO:
//evaluator.evaluate(predictions)
//java.lang.IllegalArgumentException: requirement failed: Column prediction_output must be of type equal to one of the following types: [double, struct<type:tinyint,size:int,indices:array<int>,values:array<double>>] but was actually of type struct<value:double>.

@szilard szilard changed the title h2o/spark sparkling water Spark/h2o - sparkling water May 25, 2019
@szilard szilard changed the title Spark/h2o - sparkling water Spark/h2o integration - sparkling water May 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant