  • Enabled optimization of tuple values via Categorical
    • This can be used with Keras to search over different kernel_size values for Conv2D or pool_size values for MaxPooling2D, for example:
    Conv2D(64, kernel_size=Categorical([(2, 2), (3, 3), (4, 4)]), activation="relu")
    MaxPooling2D(pool_size=Categorical([(1, 1), (3, 3)]))


  • Removed the "Validated Environment ..." log messages made when initializing an Experiment/OptPro

3.0.0 (2019-08-06) Artemis

This changelog entry combines the contents of all 3.0.0 pre-release entries

Artemis: Greek goddess of hunting

This is the most significant release since the birth of HyperparameterHunter, adding not only feature engineering, but also feature optimization. The goal of feature engineering in HyperparameterHunter is to enable you to manipulate your data however you need to, without imposing restrictions on what's allowed - all while seamlessly keeping track of your feature engineering steps so they can be learned from and optimized. In that spirit, feature engineering steps are defined by your very own functions. That may sound a bit silly at first, but it affords maximal freedom and customization, with only the minimal requirement that you tell your function what data you want from HyperparameterHunter, and you give it back when you're done playing with it.

The best way to really understand feature engineering in HyperparameterHunter is to dive into some code and check out the "examples/feature_engineering_examples" directory. In no time at all, you'll be ready to spread your wings by experimenting with the creative feature engineering steps only you can build. Let your faithful assistant, HyperparameterHunter, meticulously and lovingly record them for you, so you can optimize your custom feature functions just like normal hyperparameters.

You're a glorious peacock, and we just wanna let you fly.


  • Feature engineering via FeatureEngineer and EngineerStep

    • This will be a "brief" summary of the new features. For more detail, see the aforementioned "examples/feature_engineering_examples" directory or the extensively documented FeatureEngineer and EngineerStep classes
    • FeatureEngineer can be passed as the feature_engineer kwarg to either:
      1. Instantiate a CVExperiment, or
      2. Call the forge_experiment method of any Optimization Protocol
    • FeatureEngineer is just a container for EngineerSteps
      • Instantiate it with a simple list of EngineerSteps, or functions to construct EngineerSteps
    • Most important EngineerStep parameter is a function you define to perform your data transformation (whatever that is)
      • This function is often creatively referred to as a "step function"
    • Step function definitions have only two requirements:
      1. Name the data you want to transform in the signature's input parameters
      2. Return the data when you're done with it
    • Step functions may be given directly to FeatureEngineer, or wrapped in an EngineerStep for greater customization
    • Here are just a few step functions you might want to make:
    from hyperparameter_hunter import CVExperiment, FeatureEngineer, EngineerStep
    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import QuantileTransformer, StandardScaler
    from sklearn.impute import SimpleImputer
    def standard_scale(train_inputs, non_train_inputs):
      s = StandardScaler()
      train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values)
      non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values)
      return train_inputs, non_train_inputs
    def quantile_transform(train_targets, non_train_targets):
      t = QuantileTransformer(output_distribution="normal")
      train_targets[train_targets.columns] = t.fit_transform(train_targets.values)
      non_train_targets[train_targets.columns] = t.transform(non_train_targets.values)
      return train_targets, non_train_targets, t
    def set_nan(all_inputs):
      cols = [1, 2, 3, 4, 5]
      all_inputs.iloc[:, cols] = all_inputs.iloc[:, cols].replace(0, np.NaN)
      return all_inputs
    def impute_negative_one(all_inputs):
      all_inputs.fillna(-1, inplace=True)
      return all_inputs
    def impute_mean(train_inputs, non_train_inputs):
      imputer = SimpleImputer()
      train_inputs[train_inputs.columns] = imputer.fit_transform(train_inputs.values)
      non_train_inputs[train_inputs.columns] = imputer.transform(non_train_inputs.values)
      return train_inputs, non_train_inputs
    def sqr_sum_feature(all_inputs):
      all_inputs["my_sqr_sum_feature"] = all_inputs.agg(
          lambda row: np.sqrt(np.sum([np.square(_) for _ in row])),
      return all_inputs
    def upsample_train_data(train_inputs, train_targets):
      pos = pd.Series(train_targets["target"] == 1)
      train_inputs = pd.concat([train_inputs, train_inputs.loc[pos]], axis=0)
      train_targets = pd.concat([train_targets, train_targets.loc[pos]], axis=0)
      return train_inputs, train_targets
    # Any of the above can be wrapped by `EngineerStep`, or added directly to a `FeatureEngineer`'s `steps`
    # Below, assume we have already activated an `Environment`
    exp_0 = CVExperiment(
          EngineerStep(upsample_train_data, stage="intra_cv"),
  • Feature optimization

    • Categorical can be used to optimize feature engineering steps, either as EngineerStep instances or raw functions of the form expected by EngineerStep
    • Just throw your Categorical in with the rest of your FeatureEngineer.steps
    • Features can, of course, be optimized alongside standard model hyperparameters
    from hyperparameter_hunter import GBRT, Real, Integer, Categorical, FeatureEngineer, EngineerStep
    import numpy as np
    import pandas as pd
    from sklearn.linear_model import Ridge
    from sklearn.preprocessing import MinMaxScaler, QuantileTransformer, StandardScaler
    def standard_scale(train_inputs, non_train_inputs):
      s = StandardScaler()
      train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values)
      non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values)
      return train_inputs, non_train_inputs
    def min_max_scale(train_inputs, non_train_inputs):
      s = MinMaxScaler()
      train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values)
      non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values)
      return train_inputs, non_train_inputs
    # Pretend we already set up our `Environment` and we want to optimize the our scaler
    # We'll also throw in some standard hyperparameter optimization - This is HyperparameterHunter, after all
    optimizer_0 = GBRT()
      dict(alpha=Real(0.5, 1.0), max_iter=Integer(500, 2000), solver="svd"), 
      feature_engineer=FeatureEngineer([Categorical([standard_scale, min_max_scale])])
    # Then we remembered we should probably transform our target, too
    # OH NO! After transforming our targets, we'll need to `inverse_transform` the predictions!
    # OH YES! HyperparameterHunter will gladly accept a fitted transformer as an extra return value, 
    #   and save it to call `inverse_transform` on predictions 
    def quantile_transform(train_targets, non_train_targets):
      t = QuantileTransformer(output_distribution="normal")
      train_targets[train_targets.columns] = t.fit_transform(train_targets.values)
      non_train_targets[train_targets.columns] = t.transform(non_train_targets.values)
      return train_targets, non_train_targets, t
    # We can also tell HyperparameterHunter to invert predictions using a callable, rather than a fitted transformer
    def log_transform(all_targets):
      all_targets = np.log1p(all_targets)
      return all_targets, np.expm1
    optimizer_1 = GBRT()
      Ridge, {}, feature_engineer=FeatureEngineer([
          Categorical([standard_scale, min_max_scale]),
          Categorical([quantile_transform, log_transform]),
  • Categorical.optional

    • As Categorical is the means of optimizing EngineerSteps in simple lists, it became necessary to answer the question of whether that crazy new feature you've been cooking up in the lab should even be included at all
    • So the optional kwarg was added to Categorical to appease the mad scientist in us all
    • If True (default=False), the search space will include not only the categories you explicitly provide, but also the omission of the current EngineerStep entirely
    • optional is only intended for use in optimizing EngineerSteps. Don't expect it to work elsewhere
    • Brief example:
    from hyperparameter_hunter import DummySearch, Categorical, FeatureEngineer, EngineerStep
    from sklearn.linear_model import Ridge
    def standard_scale(train_inputs, non_train_inputs):
      """Pretend this function scales data using SKLearn's `StandardScaler`"""
      return train_inputs, non_train_inputs
    def min_max_scale(train_inputs, non_train_inputs):
      """Pretend this function scales data using SKLearn's `MinMaxScaler`"""
      return train_inputs, non_train_inputs
    # Pretend we already set up our `Environment` and we want to optimize the our scaler
    optimizer_0 = DummySearch()
      Ridge, {}, feature_engineer=FeatureEngineer([
          Categorical([standard_scale, min_max_scale])
    # `optimizer_0` above will try each of our scaler functions, but what if we shouldn't use either?
    optimizer_1 = DummySearch()
      Ridge, {}, feature_engineer=FeatureEngineer([
          Categorical([standard_scale, min_max_scale], optional=True)
    # `optimizer_1`, using `Categorical.optional`, will search the same points as `optimizer_0`, plus
    #   a `FeatureEngineer` where the step is skipped completely, which would be the equivalent of
    #   no `FeatureEngineer` at all in this example
  • Enable OptPro's to identify similar_experiments when using a search space whose dimensions include Categorical.optional EngineerSteps at indexes that may differ from those of the candidate Experiments

  • Add callbacks kwarg to CVExperiment, which enables providing LambdaCallbacks for an Experiment right when you're initializing it

    • Functions like the existing experiment_callbacks kwarg of Environment
  • Improve Metric.direction inference to check the name of the metric_function for "error"/"loss" strings, after checking the name itself

    • This means that an Environment.metrics value of {"mae": "median_absolute_error"} will be correctly inferred to have direction="min", making it easier to use short aliases for those extra-long error metric names


  • Fix bug causing descendants of SKOptimizationProtocol to break when given non-string base_estimators
  • Fix bug causing ScoringMixIn to incorrectly keep track of the metrics to record for different dataset types
  • Fix bug preventing get_clean_predictions from working with multi-output datasets
  • Fix incorrect leaderboard sorting when evaluations are tied (again)
  • Fix bug causing metrics to be evaluated using the transformed targets/predictions, rather than the inverted (original space) predictions, after performing target transformation via EngineerStep
    • Adds new Environment kwarg: save_transformed_metrics, which dictates whether metrics are calculated using transformed targets/predictions (True), or inverted data (False)
    • Default value of save_transformed_metrics is chosen based on dtype of targets. See #169
  • Fix bug causing BayesianOptPro to break, or fail experiment matching, when using an exclusively-Categorical search space. For details, see #154
  • Fix bug causing Informed Optimization Protocols to break after the tenth optimization round when attempting to fit optimizer with EngineerStep dicts, rather than proper instances
    • This was caused by the EngineerSteps stored in saved experiment descriptions not being reinitialized in order to be compatible with the current search space
    • See PR #139 or "tests/integration_tests/feature_engineering/" for details
  • Fix broken inverse target transformation of LightGBM predictions
  • Fix incorrect "source_script" recorded in CVExperiment description files when executed within an Optimization Protocol
  • Fix bug causing :mod:data.data_chunks to be excluded from installation


  • Metrics are now always invoked with NumPy arrays
    • Note that this is unlike EngineerStep functions, which always receive Pandas DataFrames as input and should always return DataFrames
  • DatasetSentinel functions in Environment now retrieve data transformed by feature engineering
  • data
    • Rather than being haphazardly stored in an assortment of experiment attributes, datasets are now managed by both :mod:data and the overhauled :mod:callbacks.wranglers module
    • Affects custom user callbacks that used experiment datasets. See the next section for details
  • Add warn_on_re_ask kwarg to all OptPro initializers. If True (default=False), a warning will be logged whenever the internal optimizer suggests a point that has already been evaluated--before returning a new, random point to evaluate instead
  • model_init_params kwarg of both CVExperiment and all OptPros is now optional. If not given, it will be evaluated as the default initialization parameters to model_initializer
  • Convert file module to space directory module, containing space_core and dimensions
    • space.dimensions is the new home of the dimension classes used to define hyperparameter search spaces via :meth:optimization.protocol_core.BaseOptPro.forge_experiment: Real, Integer, and Categorical
    • space.space_core houses :class:Space, which is only used internally
  • Convert and file modules to optimization directory module, containing protocol_core and the backends directory
    • has been moved to
    • optimization.backends contains skopt.engine and skopt.protocols, the latter of which is the new location of the original file
    • optimization.backends.skopt.engine is a partial vendorization of Scikit-Optimize's Optimizer class, which acts as the backend for :class:optimization.protocol_core.SKOptPro
      • For additional information on the partial vendorization of key Scikit-Optimize components, see the optimization.backends.skopt README. A copy of Scikit-Optimize's original LICENSE can also be found in optimization.backends.skopt


  • OptPros' set_experiment_guidelines method renamed to forge_experiment
    • set_experiment_guidelines will be removed in v3.2.0
  • Optimization Protocols in :mod:hyperparameter_hunter.optimization renamed to use "OptPro"
    • This change affects the following optimization protocol classes:
      • BayesianOptimization -> BayesianOptPro
      • GradientBoostedRegressionTreeOptimization -> GradientBoostedRegressionTreeOptPro
        • GBRT alias unchanged
      • RandomForestOptimization -> RandomForestOptPro
        • RF alias unchanged
      • ExtraTreesOptimization -> ExtraTreesOptPro
        • ET alias unchanged
      • DummySearch -> DummyOptPro
    • This change also affects the base classes for optimization protocols defined in :mod:hyperparameter_hunter.optimization.protocol_core that are not available in the package namespace
    • The original names will continue to be available until their removal in v3.2.0
  • lambda_callback kwargs dealing with "experiment" and "repetition" time steps have been shortened
    • These four kwargs have been changed to the following values:
      • on_experiment_start -> on_exp_start
      • on_experiment_end -> on_exp_end
      • on_repetition_start -> on_rep_start
      • on_repetition_end -> on_rep_end
    • In summary, "experiment" is shortened to "exp", and "repetition" is shortened to "rep"
    • The originals will continue to be available until their removal in v3.2.0
    • This deprecation will break any custom callbacks created by subclassing BaseCallback (which is not the officially supported method), rather than using lambda_callback
      • To fix such callbacks, simply rename the above methods

Breaking Changes

  • Any custom callbacks (lambda_callback or otherwise) that accessed the experiment’s datasets will need to be updated to access their new locations. The new syntax is described in detail in :mod:data.data_core, but the general idea is as follows:
    1. Experiments have four dataset attributes: data_train, data_oof, data_holdout, data_test
    2. Each dataset has three data_chunks: input, target, prediction
    3. Each data_chunk has six attributes. The first five pertain to the experiment division for which the data is collected: d (initial data), run , fold , rep, and final
    4. The sixth attribute of each data_chunk is T , which contains the transformed states of the five attributes described in step 3
      • Transformations are applied by feature engineering
      • Inversions of those transformations (if applicable) are stored in the five normal data_chunk attributes from step 3
  • Ignore Pandas version during dataset hashing for more consistent Environment keys. See #166

0.0.1 (2018-06-14)


  • Initial release