Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Parameter validation and normalization #7

Merged
merged 10 commits into from
May 26, 2023

Conversation

acalejos
Copy link
Owner

@acalejos acalejos commented May 19, 2023

Closes #3

Need to add documentation about parameters and many more test cases, but the main implementation is done

default: false,
doc: """
Whether to use RAPIDS Memory Manager for memory allocation.
This This option is only applicable when XGBoost is built (compiled)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This This option is only applicable when XGBoost is built (compiled)
This option is only applicable when XGBoost is built (compiled)

type: :boolean,
default: false,
doc: """
Whether to use RAPIDS Memory Manager for memory allocation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only available in certain instances it should probably be a library config set globally and then grabbed from the application env.

Users would do:

config :exgboost, use_rmm: true

Comment on lines 27 to 29
* `:gbtree`: tree-based models
* `:gblinear`: linear models
* `:dart`: tree-based models with dropouts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `:gbtree`: tree-based models
* `:gblinear`: linear models
* `:dart`: tree-based models with dropouts
* `:gbtree` - tree-based models
* `:gblinear` - linear models
* `:dart` - tree-based models with dropouts

I think this looks better

* `:dart`: tree-based models with dropouts
"""
],
verbosity: [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is verbosity in both?

@@ -0,0 +1,968 @@
defmodule EXGBoost.Parameters do
@global_params [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh if these are set globally perhaps they should be config values

that all parameters are valid strings which is what XGBoost is expecting.
"""
],
nthread: [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also be a configuration value. We have it as a config for EXLA. I can't think of a scenario where you'd want to set this per call to predict/train

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to

nthread: [
      type: :non_neg_integer,
      default: Application.compile_env(:exgboost, :nthread, 0),
      doc: """
      Number of threads to use for training and prediction. If `0`, then the
      number of threads is set to the number of cores.  This can be set globally
      using the `:exgboost` application environment variable `:nthread`
      or on a per booster basis.  If set globally, the value will be used for
      all boosters unless overridden by a specific booster.

      To set the number of threads globally, add the following to your `config.exs`:

      ````elixir
      config :exgboost, nthread: 4
      ````
      """
    ]

Reason is to allow a global config while also allowing per booster config which xgboost supports

`:silent`, `:warning`, `:info`, `:debug`
"""
],
validate_parameters: [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't make this an option. You should choose one behavior or the other. I prefer that you perform validation by default

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the main reason I am currently allowing this is to allow "power users" to use params that perhaps are not perfectly captured using the validators. I guess it's kind of a fail safe in case a validation is broken, it would at least allow people to pass the exact string params they want

Thinking in particular stuff like ndcg@n, map@n: ‘n’ can be assigned as an integer to cut off the top positions in the lists for evaluation.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus the default is true

doc: ~S'''
Minimum loss reduction required to make a further partition on a leaf node
of the tree. The larger `gamma` is, the more conservative the algorithm will
be. Valid range is [0, $\\infty$].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you set up the docs to render latex if you do this. Check the axon mix.exs for how

then the building process will give up further partitioning. In linear regression task,
this simply corresponds to minimum number of instances needed to be in each node.
The larger `min_child_weight` is, the more conservative the algorithm will be.
Valid range is `[0, Nx.Constants.infinity()]`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latex or Elixir code?

Comment on lines 143 to 148
* `:uniform`: each training instance has an equal probability of being selected.
Typically set subsample >= 0.5 for good results.
* `:gradient_based`: the selection probability for each training instance is proportional
to the regularized absolute value of gradients. subsample may be set to as low as 0.1
without loss of model accuracy. Note that this sampling method is only supported when
`tree_method` is set to `gpu_hist`; other tree methods only support `:uniform` sampling.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `:uniform`: each training instance has an equal probability of being selected.
Typically set subsample >= 0.5 for good results.
* `:gradient_based`: the selection probability for each training instance is proportional
to the regularized absolute value of gradients. subsample may be set to as low as 0.1
without loss of model accuracy. Note that this sampling method is only supported when
`tree_method` is set to `gpu_hist`; other tree methods only support `:uniform` sampling.
* `:uniform`- each training instance has an equal probability of being selected.
Typically set subsample >= 0.5 for good results.
* `:gradient_based` - the selection probability for each training instance is proportional
to the regularized absolute value of gradients. subsample may be set to as low as 0.1
without loss of model accuracy. Note that this sampling method is only supported when
`tree_method` is set to `gpu_hist`; other tree methods only support `:uniform` sampling.

type: :keyword_list,
doc: """
This is a family of parameters for subsampling of columns.
All `colsample_by*` parameters have a range of `(0, 1]`, the default value of `1`, and specify the fraction of columns to be subsampled.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All `colsample_by*` parameters have a range of `(0, 1]`, the default value of `1`, and specify the fraction of columns to be subsampled.
All `colsample_by*` parameters have a range of `(0, 1]`, a default value of `1`, and each specifies the fraction of columns to be subsampled.

doc: """
This is a family of parameters for subsampling of columns.
All `colsample_by*` parameters have a range of `(0, 1]`, the default value of `1`, and specify the fraction of columns to be subsampled.
`colsample_by*` parameters work cumulatively. For instance, the combination
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`colsample_by*` parameters work cumulatively. For instance, the combination
`colsample_by` parameters work cumulatively. For instance, the combination

default: 1,
doc: """
L2 regularization term on weights. Increasing this value will make model more conservative.
Valid range is `[0, :infinity]`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder to choose between latex, elixir code, atoms

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll do latex, definitely need to normalize that notation

Valid range is `[0, :infinity]`.
"""
],
reg_lambda: [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required if it's an alias? Can't we just settle on one?

Valid range is `[0, Nx.Constants.infinity()]`.
"""
],
reg_alpha: [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the aliases supported by xgboost itself so I thought we could too but we could easily take it out

* `:exact`: Exact greedy algorithm. Enumerates all split candidates.
* `:approx`: Approximate greedy algorithm using sketching and histogram.
* `:hist`: Faster histogram optimized approximate greedy algorithm.
* `:gpu_hist`: GPU implementation of hist algorithm.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work out of the box? Do we need to build a gpu version?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not currently working as I haven't added NIFs for CUDA support yet. I can take it out until then

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I would remove until then, make an issue for GPU support and we can track all of adding this back in there

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#9

modular way to construct and to modify the trees. This is an advanced parameter that
is usually set automatically, depending on some other parameters. However, it could be
also set explicitly by a user. The following updaters exist:
* `:grow_colmaker`: non-distributed column-based construction of trees.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format line between with hyphen

lib/exgboost.ex Outdated
accepts a keyword list of options that can be used to configure the training process. See the
[XGBoost documentation](https://xgboost.readthedocs.io/en/latest/parameter.html) for the full list of options.

`Exgbost.train/2` uses the `Exgboost.Training.train/1` function to perform the actual training. `Exgboost.Training.train/1`
`Exgbost.train/2` uses the `EXGBoost.Training.train/1` function to perform the actual training. `EXGBoost.Training.train/1`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`Exgbost.train/2` uses the `EXGBoost.Training.train/1` function to perform the actual training. `EXGBoost.Training.train/1`
`EXGBoost.train/2` uses the `EXGBoost.Training.train/1` function to perform the actual training. `EXGBoost.Training.train/1`

@@ -1,13 +1,13 @@
defmodule Exgboost.ArrayInterface do
defmodule EXGBoost.ArrayInterface do
@moduledoc false
@typedoc """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably have this floating closer to the @type

Copy link
Collaborator

@seanmor5 seanmor5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left 2 comments, but after that it should be good

Copy link
Collaborator

@seanmor5 seanmor5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left 2 comments, but after that it should be good

@acalejos acalejos merged commit 19cb7d4 into main May 26, 2023
@acalejos acalejos deleted the normalize_and_flatten_params branch May 26, 2023 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Normalize Booster Objectives
2 participants