WeightDecay for L1 norm #159

mcabbott · 2023-09-06T04:15:55Z

As I learned here FluxML/MLJFlux.jl#221 (comment) , since the gradient of L1 norm is even simpler than the gradient of L2 norm it can, obviously, be implemented as an optimisation rule.

This quick PR adds it to the same WeightDecay struct. Below is a check that this does what you expect.

using Flux: Flux, Dense, gradient, state
using Optimisers
using Optimisers: setup, update

input = [1,2]
model = Dense([1 -2; 3 -4.0])

grads = Flux.gradient(model) do m
  result = m(input)
  sum(result)
end

# Check L2 norm via WeightDecay (nothing new!)

pen_l2(x::AbstractArray) = sum(abs2, x)/2

grads_L2 = Flux.gradient(model) do m
  result = m(input)
  penalty = sum(pen_l2, Flux.params(m))
  sum(result) + 0.42 * penalty
end

update(
  setup(Descent(0.1), model),
  model, grads_L2[1])[2] |> Flux.state

update(
  setup(OptimiserChain(WeightDecay(0.42), Descent(0.1)), model),
  model, grads[1])[2] |> Flux.state

# Do exactly the same thing for L1 (needs this PR)

pen_l1(x::AbstractArray) = sum(abs, x)

grads_L1 = Flux.gradient(model) do m
  result = m(input)
  penalty = sum(pen_l1, Flux.params(m))
  sum(result) + 0.42 * penalty
end

update(
  setup(Descent(0.1), model),
  model, grads_L1[1])[2] |> Flux.state

update(
  setup(OptimiserChain(WeightDecay(0.0, 0.42), Descent(0.1)), model),
  model, grads[1])[2] |> Flux.state
  
# Both give (weight = [0.858 -2.158; 2.858 -4.158], bias = [-0.1, -0.1], σ = ())

PR Checklist

Tests are added
Documentation, if applicable

darsnack

Looks good. It might be worth it to add examples to the docstring now that the rule is sufficiently complex.

darsnack · 2023-09-06T12:12:30Z

src/rules.jl


 # Parameters
 - Weight decay (`γ`): Decay applied to weights during optimisation.
+- Sign decay (`ζ`): umm


Suggested change

- Sign decay (`ζ`): umm

- Sign decay (`ζ`): Signed decay applied to weights during optimization.

Yea I meant to write some words! I think we can do better than "Weight decay (γ): Decay applied to weights" too, as this is pretty circular.

Though not 100% accurate, I think even "L1/L2 regularization coefficient" would be more informative.

darsnack · 2023-09-06T13:33:50Z

An alternative API is to add SignedDecay (or something) if we find WeightDecay(0.0, 0.004) too weird.

ToucheSir · 2023-09-06T15:04:39Z

I thought about that too, but this seems more straightforward if one wants to combine L1 and L2. We don't currently have a Parallel-esque rule which feeds the same gradient into two different rules, though now that I say it such a composite rule could be a nice addition.

mcabbott · 2023-09-06T15:11:38Z

Yes I wondered about an independent rule, but then thought precisely that you may want a bit of L1 and a bit of L2. And also, perhaps, that if you know about this trick for L2, then this proximity may help you discover the similar trick for L1.

I gave it the next unused greek letter. It's sort-of neat that each different rule you may wish to chain uses a different field name, as adjust!(..., zeta=0.1) etc. never modifies two unrelated things.

darsnack · 2023-09-06T15:18:04Z

For what it's worth, I'm okay with a single rule. But just to push the other side bit more, you don't need a Parallel-esque construct for these rules to compose. OptimiserChain(WeightDecay(0.004), SignedDecay(0.004), Descent(0.1)) works just fine (since it depends on x not dx).

ToucheSir · 2023-09-06T15:21:27Z

Ah you're right, I got my wires crossed there.

FWIW, the AdamW paper uses λ for the weight decay term, which PyTorch borrows for its optimizer documentation but does not use in any API.

darsnack · 2023-09-06T17:13:34Z

Another option is to have SignedDecay(zeta) = WeightDecay(0, zeta). I'm okay with all options, just throwing things out for consideration.

ablaom · 2023-09-07T23:34:22Z

Thanks for considering this contribution @mcabbott.

Another convention, adopted in elastic net and elsewhere in statistics is to have an overall lambda parameter and an L1/L2 mixture parameter alpha. This is what we do in MLJFlux.

https://github.com/FluxML/MLJFlux.jl/blob/b449d80d1d5606298bae0ded1992ee35c5c099c0/src/penalizers.jl#L11

But I don't have a strong opinion.

mcabbott · 2023-09-07T23:52:00Z

Ah that is a nice idea.

It sounds like lambda is more standard. I don't know where we got gamma, possibly I just invented something other than Flux's .wd:

https://github.com/FluxML/Flux.jl/blob/95737ffc9aa989f31d5fecd9a887a9c25f4fd865/src/optimise/optimisers.jl#L690-L692

It only matters because of adjust!, but I guess we can add a deprecation.

ablaom · 2023-09-07T23:55:18Z

Yes, but I have also seen the roles of lambda and alpha reversed :-(

mcabbott · 2023-09-08T02:48:05Z

I wish I was surprised...

Now changed to lambda alpha. This seems fairly natural to have as one struct not two.

Not easily accessible from Flux, but shouldn't break anything:

julia> Flux.setup(Flux.WeightDecay(0.1), [1,2.0]) |> dump
Optimisers.Leaf{WeightDecay, Nothing}
  rule: WeightDecay
    lambda: Float64 0.1
    alpha: Float64 0.0
  state: Nothing nothing
  frozen: Bool false

src/rules.jl

ablaom · 2023-09-08T21:40:46Z

src/rules.jl

+  λ, α = T(o.lambda), T(o.alpha)
+  ℓ1 = λ * α
+  ℓ2 = λ * (1 - α)
+  dx′ = @lazy dx + ℓ2 * x + ℓ1 * sign(x)


I wonder if there is a factor of two missing here. Consider ordinary scalar case: The derivative of x^2 (l2 penalty) is 2x while the derivative of |x| (l1 penalty) is sign(x). So either

dx′ = @lazy dx + ℓ2 * 2x + ℓ1 * sign(x)

or

dx′ = @lazy dx + ℓ2 * x + ℓ1 * sign(x) / 2

would be more correct. I think the first is better, but it is also breaking, I guess.

I'm not aware of any other implementations which add the factor of two. It's likely considered that it will be folded into λ. Of course, none of them also try to use l1/l2 together!

x^2 (l2 penalty)

I'm sure all conventions exist, but the most common one seems to be to take norm(x)^2/2 as the L2 penalty. I think the present code & docs should agree on this choice.

For L1 it surely has to be just norm(x,1).

Yes, all conventions exist, but I'm going to push back one last time on what it is "most common". The first place I looked, just now, is the Wiki page on regularisation and there is no 1/2 in front of the L2 penalty, when mixed in with L1. In fact, the formula and notation correspond exactly to my first case and what we implement in MLJFlux presently.

For me "LP penalty" is the LP norm to the power of p, which is always sum of |x_i|^p over i.

I understand how the 1/2 started to appear in isolation (ie. when ignoring L1 regularisation), because it simplifies the derivative. But in the context of using both, we should compare apples to apples, so the 1/2 makes no sense, unless you say the LP penalty is "\frac{1}{p} |x|^p" which I have not seen.

I don't think we can have anything other than lambda * x as the L2 penalty. That's so commonly accepted as the standard weight decay implementation that anything else would be unnecessarily surprising. So it's either /2 for the L1 portion or no change. I'm inclined to consider the /2 only because we are using the elastic net coefficient convention. If we had separate coefficients for each term, I would prefer to avoid any extra factors and lump it all into the coefficients.

Agreed. Add 1/2 to the L1 part or separate into two optimisers.

ablaom · 2024-01-16T22:51:51Z

@mcabbott Do you have some time to push this along? The project to update MLJFlux to use explicit parameters is waiting on this.

mcabbott · 2024-01-17T00:18:05Z

I had a go locally & will try to find the branch

mcabbott · 2024-02-02T21:01:31Z

Ok dbcea29 pushes what I had locally, way back when. It leaves WeightDecay alone, and makes a new struct for the combined L1 and L2 story. I called this NormReg although perhaps there's a better name.

Is this a good design? We could instead have a new struct which does only L1. And then (if we want to support a mixture) have some function which returns a chain of L1 and L2, using existing structs. Maybe that would be better.

Edit: And f70aa9c changes to a separate SignDecay struct for L1 alone. No function for a combination. Maybe that's the minimal thing.

Maybe they should not have the same field name lambda, ideas for what might be better?

ToucheSir · 2024-02-03T19:35:49Z

PyTorch wasn't very helpful as inspiration, but optax uses the term "decay rate" in their implementation of weight decay. A little verbose but pretty clear.

Alternatively, the sklearn regression models call this L1/L2 coefficient alpha. The ElasticNet page specifically refers to it as a "penalty (term)", which is another idea for a plain English word.

ablaom · 2024-02-04T20:37:07Z

The separate SignDecay option, as currently implemented here, would suit me fine. In this way, I can confidently use the two decays without looking up documentation to sort out the notation and convention about 1, or 1/2. (In Elastic net I have seen the roles of alpha and lambda reversed in some implementations.)

mcabbott · 2024-02-05T00:39:50Z

Another argument against having a mixture parameter: In most practical use, knowing that λ = 1e-3 is a useful amount of L2 for your problem does not imply that this is the right amount of L1... you are going to have to search. In which case just changing the mixture / angle parameter isn't really better than changing some other κ instead.

Last commits change the name of the L1 penalty coefficient to "kappa", because it's next door and not used in this package yet (hence adjust(st, kappa=0.1) will hit exactly one thing).

ToucheSir · 2024-02-06T17:31:51Z

If you'll allow me to bikeshed names one more time: I feel like we should not be pulling out greek characters that have not been used in the literature before, even if they are represented as English words instead of the original symbols. Are there no descriptive terms we can use instead of kappa (and maybe lambda too)?

Otherwise LGTM.

ablaom · 2024-02-06T19:41:43Z

I think they could have the same name. They're both regularization parameters in separate structs. I think lambda (or its unicode equivalent) is pretty standard for a generic reg. param.

In the same vein that eta is used for learning rate in all the variations of Optimiser's grad descent.

mcabbott · 2024-02-06T20:42:19Z

Good point about eta being used everywhere, maybe just re-using lambda is best.

Maybe this is done? CI on julia > 1.6 might be fixed later by #166

mcabbott · 2024-02-07T16:09:25Z

If either of you clicks approve I can merge this, and then rebase #160

* WeightDecay for L1 norm * better words * change to lambda alpha, add tests * change to lambda, add tests * tweaks * shashed in October - makes two structs instead * version with simple SignDecay instead * change SignDecay penalty to be called kappa * restore depwarn for WeightDecay, was called gamma * change kappa back to lambda

darsnack reviewed Sep 6, 2023

View reviewed changes

ablaom mentioned this pull request Sep 7, 2023

Stop using implicit style differentiating FluxML/MLJFlux.jl#221

Closed

1 task

mcabbott marked this pull request as ready for review September 8, 2023 03:30

darsnack reviewed Sep 8, 2023

View reviewed changes

src/rules.jl Outdated Show resolved Hide resolved

src/rules.jl Outdated Show resolved Hide resolved

src/rules.jl Outdated Show resolved Hide resolved

src/rules.jl Outdated Show resolved Hide resolved

ablaom reviewed Sep 8, 2023

View reviewed changes

mcabbott mentioned this pull request Sep 9, 2023

Add L1 regularisation to WeightDecay FluxML/Flux.jl#2330

Closed

3 tasks

mcabbott mentioned this pull request Feb 2, 2024

Add all-keyword constructors, much like @kwdef #160

Merged

2 tasks

mcabbott added 4 commits February 6, 2024 16:12

WeightDecay for L1 norm

7dbf0a2

better words

9fef930

change to lambda alpha, add tests

3231780

change to lambda, add tests

6f278ae

mcabbott added 6 commits February 6, 2024 16:12

tweaks

61e1043

shashed in October - makes two structs instead

6a4b0a4

version with simple SignDecay instead

97ab9d7

change SignDecay penalty to be called kappa

186aabd

restore depwarn for WeightDecay, was called gamma

692b5da

change kappa back to lambda

b367438

mcabbott force-pushed the l1norm branch from 9df7b35 to b367438 Compare February 6, 2024 21:13

ToucheSir approved these changes Feb 7, 2024

View reviewed changes

mcabbott merged commit e60b71e into FluxML:master Feb 7, 2024
3 of 4 checks passed

mcabbott deleted the l1norm branch February 7, 2024 18:07

mcabbott mentioned this pull request Feb 9, 2024

Add SignDecay for L1 norm FluxML/Flux.jl#2377

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WeightDecay for L1 norm #159

WeightDecay for L1 norm #159

mcabbott commented Sep 6, 2023 •

edited

Loading

darsnack left a comment

darsnack Sep 6, 2023

mcabbott Sep 6, 2023

ToucheSir Sep 6, 2023

darsnack commented Sep 6, 2023

ToucheSir commented Sep 6, 2023

mcabbott commented Sep 6, 2023

darsnack commented Sep 6, 2023

ToucheSir commented Sep 6, 2023

darsnack commented Sep 6, 2023

ablaom commented Sep 7, 2023

mcabbott commented Sep 7, 2023

ablaom commented Sep 7, 2023

mcabbott commented Sep 8, 2023

ablaom Sep 8, 2023 •

edited

Loading

ToucheSir Sep 8, 2023

mcabbott Sep 9, 2023

ablaom Sep 9, 2023

darsnack Sep 10, 2023

ablaom Sep 11, 2023 •

edited

Loading

ablaom commented Jan 16, 2024

mcabbott commented Jan 17, 2024

mcabbott commented Feb 2, 2024 •

edited

Loading

ToucheSir commented Feb 3, 2024 •

edited

Loading

ablaom commented Feb 4, 2024

mcabbott commented Feb 5, 2024

ToucheSir commented Feb 6, 2024

ablaom commented Feb 6, 2024 •

edited

Loading

mcabbott commented Feb 6, 2024

mcabbott commented Feb 7, 2024 •

edited

Loading

	- Sign decay (`ζ`): umm
	- Sign decay (`ζ`): Signed decay applied to weights during optimization.

WeightDecay for L1 norm #159

WeightDecay for L1 norm #159

Conversation

mcabbott commented Sep 6, 2023 • edited Loading

PR Checklist

darsnack left a comment

Choose a reason for hiding this comment

darsnack Sep 6, 2023

Choose a reason for hiding this comment

mcabbott Sep 6, 2023

Choose a reason for hiding this comment

ToucheSir Sep 6, 2023

Choose a reason for hiding this comment

darsnack commented Sep 6, 2023

ToucheSir commented Sep 6, 2023

mcabbott commented Sep 6, 2023

darsnack commented Sep 6, 2023

ToucheSir commented Sep 6, 2023

darsnack commented Sep 6, 2023

ablaom commented Sep 7, 2023

mcabbott commented Sep 7, 2023

ablaom commented Sep 7, 2023

mcabbott commented Sep 8, 2023

ablaom Sep 8, 2023 • edited Loading

Choose a reason for hiding this comment

ToucheSir Sep 8, 2023

Choose a reason for hiding this comment

mcabbott Sep 9, 2023

Choose a reason for hiding this comment

ablaom Sep 9, 2023

Choose a reason for hiding this comment

darsnack Sep 10, 2023

Choose a reason for hiding this comment

ablaom Sep 11, 2023 • edited Loading

Choose a reason for hiding this comment

ablaom commented Jan 16, 2024

mcabbott commented Jan 17, 2024

mcabbott commented Feb 2, 2024 • edited Loading

ToucheSir commented Feb 3, 2024 • edited Loading

ablaom commented Feb 4, 2024

mcabbott commented Feb 5, 2024

ToucheSir commented Feb 6, 2024

ablaom commented Feb 6, 2024 • edited Loading

mcabbott commented Feb 6, 2024

mcabbott commented Feb 7, 2024 • edited Loading

mcabbott commented Sep 6, 2023 •

edited

Loading

ablaom Sep 8, 2023 •

edited

Loading

ablaom Sep 11, 2023 •

edited

Loading

mcabbott commented Feb 2, 2024 •

edited

Loading

ToucheSir commented Feb 3, 2024 •

edited

Loading

ablaom commented Feb 6, 2024 •

edited

Loading

mcabbott commented Feb 7, 2024 •

edited

Loading