vw hyperopt plans

Plans for a new vw-hyperopt script

We have vw-hypersearch, but it can handle only one hyperparameter and the golden-section search works only for unimodal (e.g. convex) functions.

vw-experiment

vw-experiment is a simple script, which computes test and train loss. It will be used from vw-hyperopt, but it is useful by itself.

vw-experiment 
  --train=train.dat
  --test=test.dat
  --vw=../vw
  --train_loss_examples=1e5

vw-hyperopt

Example usage:

vw-hyperopt --train=train.dat --test=test.dat \
  vw --loss_function=[hinge,logistic,squared] \
  --l1=[1e-10..0.005]L -q=[ff]O -b=[18..23]IO --passes=[2,4,8]O

Semantics:

[a,b,c] ... try the listed values (numbers or strings) for a given parameter
[a,b,c]O ... try also omitting the parameter
[min..max] ... range of real values
[min..max]I ... range of integer values
[min..max]L ... range of real values with logarithmic scale
[min..max]O ... try also omitting the parameter
modifiers I, L and O can be combined

VW parameters with special handling:

ALWAYS:

-c --cache is always added for speedup

FORBIDDEN:

-k --kill_cache is not forwarded to vw (but the cache file is deleted)
-d --data is overriden by --train and --test
-t --testonly
-f --final_regressor
-a --audit
--readable_model arg
--invert_hash arg

QUESTIONABLE:

-i --initial_regressor
--holdout_off
--save_resume
--cache_file

vw-hyperopt parameters:

--train training data [required]
--test development test data [recommended]
--train_loss_examples=N number of examples for computing train loss (via vw --examples -t -d train.dat). 0 means do not compute train loss. "all" means use the whole train.dat. Default is 100,000.
--save_models all/only the best
--save_logs
--jobs=N ... N parallel jobs, default=autodetect based on number of cores
--noise compute also the irreducible error (loss) via vw-overfit
--plot tikz,png
--search exhaustive, random,... We could have also --randseed, --timeout, --rounds (of hill-climbing)

Drawing plots

It would be nice if vw-hyperopt could produce (e.g. png) plots with:

test loss if --test
train loss if --train_loss_examples
irreducible error (loss) if --noise
progressive validation error if --pve
time_train if --time_train
time_test if --time_test

My understanding of Variance-Bias Tradeoff

(Note that "corresponds to" here means "is an estimate of".) Train loss corresponds to Bias^2 + noise. Test loss corresponds to Bias^2 + noise + Variance. The difference between train loss and test loss corresponds to the Variance. The amount of Variance corresponds to the amount of over-training. (http://scott.fortmann-roe.com/docs/BiasVariance.html)

Rationale:

If test loss curve is close to the noise cure, no more hyperparameter tuning can help. You must add new features to the train data.

If over-training is the problem, there are several things you can do about it:

get more training data
apply (higher) regularization (--l1 or --l2)
try bagging with -B
restrain the options below for fighting high Bias (except the first one or two)

If high Bias is the problem (i.e. underfitting):

make sure the training data is shuffled
higher -b (--bit_precision)
lower/no regularization
more --passes or higher --learning_rate
get more features, either truly new features or nonlinear combinations via --quadratic, --cubic, --stage_poly, --lrq, --ngram, --nn etc.

TODO: use https://metacpan.org/pod/Parallel::ForkManager

Provide feedback

Saved searches

Use saved searches to filter your results more quickly