v0.6
New functionality:
-
Similar to Spark's
StringIndexer
, we have aValueIndexer
that can
be used for indexing any type of values instead of only strings. Not
only can it index these values, we also provide a reverse mapping via
IndexToValue
, similar to Spark'sIndexToString
transform. -
A new "clean missing" data estimator, example:
val cmd = new CleanMissingData() .setInputCols(Array("some-column")) .setOutputCols(Array("some-column")) .setCleaningMode(CleanMissingData.customOpt) .setCustomValue(someCustomValue) val cmdModel = cmd.fit(dataset) val result = cmdModel.transform(dataset)
-
New default featurization for date and timestamp spark types and our
internal image type. For featurization of date columns, convert
column to double features: year, day of week, month, day of month.
For featurization of timestamp columns, same as date and in addition:
hour of day, minute of hour, second of minute. For featurization of
image columns, use image data converted to double with width and
height info. -
Starting the docker image without an
ACCEPT_EULA
variable setting
would throw an error. Instead, we now start a tiny web server that
shows the EULA and replaces itself with the Jupyter interface when you
click theAGREE
button.
Breaking changes:
- Renamed
ImageTransform
toImageTransformer
.
Notable bug fixes and other changes:
-
Improved sample notebooks, and a new one: "303 - Transfer Learning by
DNN Featurization - Airplane or Automobile". -
Fix serialization bugs in generated python
PipelineStage
s.
Acknowledgments
Thanks to Ali Zaidi for some notebook beautifications.