Deprecate `active` field, and `auto` keyword #1301

DhairyaLGandhi · 2020-08-03T06:26:31Z

The auto functionality generally can be implemented without the extra noise that we see with having to define extra methods that can be easily disambiguated during training and during inference, including forcing dropout. This currently implements Dropout the layer and dropout the kernel to behave consistently with how they are expected to be called.

The current system has been a bit difficult to scale and is open to bugs without defining special handlers which may be misinterpreted. We already can support the same features with lesser potential to cause confusion.

PR Checklist

Tests are added
Entry in NEWS.md
Documentation, if applicable
Final review from @dhairyagandhi96 (for API changes).

CarloLucibello · 2021-02-07T08:43:04Z

I don't understand what this PR is trying to do, the OP is not very clear. Also, the PR contains some unrelated code. This is a draft PR not being worked on since august, shouldn't be blocking v0.12 release

CarloLucibello · 2021-02-07T08:46:57Z

btw, from what I gather this goes in the direction thing were before the introduction of testmode! and layer's activity state. It's been discussed already, we want to be able to set a layer to testmode or trainmode independently of being in a gradient context or not. So current master is fine in my opinion, but maybe I didn't understand what this PR is trying to do

DhairyaLGandhi · 2021-02-08T12:12:08Z

That is the job of the functional layers which this PR was working towards, its not blocking as much as needs to be prioritised to get remove a lot of the code smell accumulated over the course of changing the norm layers manually. We would yet have the different behaviours as master, and should remove a lot of the keyword handling.

It aims to remove the isactive, and testmode! kinds of helpers, and make them available separately. You can see how it works in the dropout case. The unrelated work is my bad; removed it.

darsnack · 2021-02-08T15:41:47Z

AFAICT this PR just removes the active field in favor of istraining(). But the reason that active was introduced was because the automated behavior of istraining() is not desired in all cases. Is there a plan here to address that in a different way? I couldn't see how it was supposed to work in the dropout case.

darsnack · 2021-02-08T15:46:41Z

I'm all for a more functional design btw, but I guess I just keep seeing it mentioned, and I have no clue what's actually being proposed.

DhairyaLGandhi · 2021-02-08T15:49:25Z

There's a few ways, one is to tell users to use the function directly which should address ambiguities, and another is to make the active part of the type signature so it's value is known at compile time for cse. It also reduces our reliance on small unions, which would be nice. Ultimately, we can retain the current default and have overrides in place for other cases.

darsnack · 2021-02-08T16:02:45Z

I don't think telling the user to use the function is a good approach. IIRC the place I first encountered this issue, I had a model where I wanted to perform an adversarial attack and the batch norm behavior was messing up my results. Swapping out a layer for a function in my model is not something I want to do.

Lifting it into the type parameters seems like a nice idea though. But we'd still have auto as a keyword?

ToucheSir · 2021-02-08T16:11:58Z

How would in-place updates work if active becomes a type param? e.g. fine-tuning a model where you still want gradients but no batchnorm updates. Wouldn't that require constructing a new layer struct (which we've already determined is a no-go)? I think it would help to have a lot more detail on the proposed functional layer approach before we pull the trigger on anything like this.

darsnack · 2021-02-08T16:16:32Z

Well, I think the functional approach should be part of this PR? If there's a desire to deprecate without the replacement, then I'd strongly push against that.

Wouldn't that require constructing a new layer struct

I thought about that too. It does depend on the frequency at which you are constructing a copy with a different active field. There would be an allocation penalty, but it should only be for the outermost wrapper, right? Unless we want to deep copy the subfields which yeah is a no-go AFAICT.

darsnack · 2021-02-08T16:17:52Z

But +1 for more details

DhairyaLGandhi · 2021-02-08T16:18:22Z

Not deep copy, and yes the allocation would be for the wrapper.

I was planning on adding the functional part as well fwiw

CarloLucibello · 2021-02-08T16:20:16Z

its not blocking as much as needs to be prioritised

I'll remove the v0.12 label then, the milestones should contain only release-blocking issues/PR at this point

DhairyaLGandhi · 2021-02-08T16:20:54Z

How often does one switch out the active field while training? And is the allocation an issue where it would be tested?

DhairyaLGandhi · 2021-02-08T16:21:55Z

I wouldn't want to ship the norm layers as they are.

darsnack · 2021-02-08T16:24:00Z

Is that relevant to this PR or to #1495?

ToucheSir · 2021-02-08T17:03:46Z

How often does one switch out the active field while training?

For fine-tuning, probably only once. For something like gradual unfreezing though, you'd need a mechanism for only switching out a few layers at a time. To me this implies some sort of lens interface, which doesn't currently exist in Flux.

DhairyaLGandhi · 2021-02-20T11:56:21Z

Could you say more about the "switching out a few layers at a time"?

ToucheSir · 2021-02-20T18:15:26Z

Sure, here's some pseudo-Julia of gradual unfreezing:

model = Chain(block1, block2, ...blockN, mlp_head) # pretrained

testmode!(model[1:end-1])

unfrozen = 1
for i in 1:epochs
  train!(model, loss, opt, data)

  if i % 2 == 0 # unfreeze one layer group (block) every other epoch
    trainmode!(model[end-unfrozen])
    unfrozen += 1
  end
end

To my knowledge, there is no way to do this ergonomically (let alone efficiently) in Flux right now without train/testmode! (writing one's own wrapper over functor doesn't count). I also don't see how active is a code smell but something like BatchNorm mutating its parameters on the forward pass isn't. Both make sense to remove if/when we move to a more functional layer model, but in the meantime both are a necessary evil for a feature-complete ML framework.

CarloLucibello · 2021-02-21T16:22:24Z

To my knowledge, there is no way to do this ergonomically (let alone efficiently) in Flux right now

The purpose of testmode! and trainmode! is to switch normalization and dropout layers to inference or training mode. This a control we have that is currently decoupled from which parameters we decide to train.
The current way to unfreeze layers is something like this

model = Chain(block1, block2, ...blockN, mlp_head) # pretrained
ps = Flux.params(model[end])
unfrozen = 1
for i in 1:epochs
  train!(ps, loss, opt, data)

  if i % 2 == 0 # unfreeze one layer group (block) every other epoch
    # ps = Flux.params(ps..., Flux.params(model[end-unfrozen])...)
    # or
    ps = Flux.params(model[end-unfrozen:end]) 
    unfrozen += 1
  end
end

We could think about extending trainmode! like this

trainmode!(ps, submodel)

Where now trainmode! sets the state of the normalization layers and also adds the parameters to ps

ToucheSir · 2021-02-21T16:39:21Z

Sorry, I meant to write "in Flux right now without train/testmode!" (i.e. immutably without mutating layer params). I also forgot to include param gathering in the code snippet, so we should be on the same page with that.

We could think about extending trainmode! like this
trainmode!(ps, submodel)

Is that required if no gradients are being generated when the layer is in testmode! anyhow? Personally I find the explicit params management in your code snippet far clearer.

CarloLucibello · 2021-02-21T17:13:24Z

Actually, I think the correct management of activation mode and training parameters would look like this

model = Chain(block1, block2, ...blockN, mlp_head) # pretrained
testmode!(model)

ps = Flux.params(model[end])
trainmode!(model[end], :auto)

unfrozen = 1
for i in 1:epochs
  train!(ps, loss, opt, data)
  if i % 2 == 0 # unfreeze one layer group (block) every other epoch
     push!(ps, model[end-unfrozen])
     trainmode!(model[end-unfrozen], :auto)
     unfrozen += 1
  end
end

We can provide better documentation for the use case, and also implement the push! if we haven't already.
I agree that trainmode!(ps, block) is obscure and doesn't add much convenience

DhairyaLGandhi · 2021-02-21T17:34:21Z

I find the explicit approach clearer as well, so I feel we are in agreement over the intention of the PR. I find the notion of pushing into params a bit odd, and err on that.

The correct implementation would essentially generate ps (in the loop, not mutate) based on the layers intended to be trained.

darsnack · 2021-02-21T23:40:25Z

Bringing the discussion back to active which is not about trainable parameters. It is about accumulating normalization statistics. I'm fine with lifting the field into the type (though I'd be curious to see how big of a performance benefit this is). But I want to be clear that users should be able to make statistic accumulation (or dropout for the dropout layers) inactive regardless of whether there's a gradient path or not.

IMO the code smell around active comes from :auto which only exists to provide a cute Zygote trick that I don't think is that useful in practice. I almost always use trainmode!/testmode! no matter what.

ToucheSir · 2021-02-22T00:16:43Z

Interesting, I'm ambivalent on the implementation of :auto, but very much appreciate what it + istraining enable and sorely miss them when using PyTorch**.

I'm fine with lifting the field into the type

My point was that doing this would make testmode! and trainmode! impossible to implement because you can't change the type of a value in-place like you can a field. You would have to find every reference to the layer and change that, which with the current API would require shenanigans like chain.layers[i] = testmode(chain[i]) because most containers are effectively immutable.

** IMO layers should not be mutating their internal state when called outside of a gradient context unless the user has specifically opted into it with something like trainmode!. This includes RNNs, but that's a design discussion for another day.

darsnack · 2021-02-22T00:53:17Z

Haha I guess everyone has different preference.

Yeah, I agree that a lens interface would make an immutable trainmode/testmode easier. I'm also ambivalent to the implementation (the current one is fine IMO), just wanted to make sure we still allow a hard mode as well.

DhairyaLGandhi · 2021-02-22T07:01:44Z

In #1509 I have added a NormConfig which holds these configurations and is compiled away (hopefully), which could be used for layer activeness.

For the active field - layers are and remain opinionated implementations of certain operations. They are expected to work a certain way in and out of training contexts. The functions are supposed to be less opinionated and run the operations directly. So it's odd that there's push back against telling users to use the functional forms when wanting to opt out of the layers' behavior.

darsnack · 2021-02-22T13:52:22Z

I can't really tell from the PR you linked...if I do testmode!(m) and then use m in a gradient call, are the statistic accumulated or not? I'm trying to figure out what behavior is proposed for the layers vs the functions (I don't care if the implementation is a field, config struct, or type parameter).

ToucheSir · 2021-02-22T16:40:46Z

Being opinionated is fine, but removing features and not providing viable alternatives not so much. For example, the whole point of batchnorm is that it's a container that encapsulates some global statistics. Replacing that with a batchnorm function doesn't make any sense! There's also the matter of saving and loading params. If I read in an arbitrary model and want to freeze a subset of normalization layers, how do I do that with these "less opinionated functions" without tearing apart and rebuilding the rest of the network manually via a bunch of functor calls that will break the moment the topology changes slightly?

To stop us from arguing this in circles, do you mind transcribing @CarloLucibello's snippet above to show how it would be done without active? Otherwise I'm not convinced active can be removed.

DhairyaLGandhi · 2021-02-22T17:04:24Z

Thanks, but if you see from the pr I linked to, I am capturing all the "configuration" in a single place and dispatching. It's a bit crude, but for small models it gave a decent speedup (about 30%) and for others it was faster but not by as much. Also helps inference, which is a nice byproduct.

It mirrors how we do things with ConvDims. I am not suggesting that we can do it for every layer anyway, but saying we do have a path forward for viable alternatives.

The first snippet from #1301 (comment) much more clear than the one with modes.

IMO the code smell around active comes from :auto

To me the repeated testmode code is it.

darsnack · 2021-02-22T17:08:20Z

I don't think anyone has a problem with the approach of NormConfig. It's just that the title of this PR and the discussion seems to imply some kind of deprecation of active concept (not just the field implementation).

DhairyaLGandhi · 2021-02-22T17:18:54Z

I definitely do think we need a better way forward for active than use the fields with the modes - encouraging those seems wrong. I want code to be meaningful, and to know exactly what the layers are doing by looking at the forward pass, instead of go down streams of helpers and utilities and special cases. With conv, I agree that the first pass doesn't say much, but with the nnlib machinery, it is clear where the functions are defined what is happening.

ToucheSir · 2021-02-22T17:34:26Z

I mean yes +:100:, but until we have a way to do the active + modes thing without the field + mutation I think it has to stay. That's why I proposed looking into a functional lens interface for updating parts of a network ergonomically upthread.

with the nnlib machinery, it is clear where the functions are defined what is happening

I'll open a GH discussion so we can continue this thread, but I don't see how using NNLib directly makes sense for layers like Conv or BatchNorm that carry state in the form of parameters. Even if said state is immutable, it still needs to exist somewhere and few users are going to be happy with manually keeping track of weight + bias arrays to pass into the NNLib function when they can just create a Conv and let Flux track those for them.

CarloLucibello · 2024-12-15T13:21:23Z

closing as stale

Dhairya Gandhi added 4 commits July 13, 2020 03:07

add multisteplr optimiser

9ce973f

use milstones

aea6cda

deprecate active Bool

1372eba

make dropout work

8d1d0ec

DhairyaLGandhi added the breaking label Aug 3, 2020

DhairyaLGandhi marked this pull request as draft August 3, 2020 06:27

DhairyaLGandhi changed the title ~~Deprecate auto field~~ Deprecate auto field, and active Aug 3, 2020

DhairyaLGandhi changed the title ~~Deprecate auto field, and active~~ Deprecate active field, and auto keyword Aug 3, 2020

add alpha dropout

f49ede6

DhairyaLGandhi closed this Dec 1, 2020

DhairyaLGandhi reopened this Dec 1, 2020

DhairyaLGandhi added this to the v0.12 milestone Feb 5, 2021

rm uneeded code

2add2b1

CarloLucibello removed this from the v0.12 milestone Feb 8, 2021

darsnack mentioned this pull request Feb 17, 2021

Fix Norm Layers, Again #1509

Closed

8 tasks

CarloLucibello closed this Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate `active` field, and `auto` keyword #1301

Deprecate `active` field, and `auto` keyword #1301

DhairyaLGandhi commented Aug 3, 2020 •

edited

Loading

CarloLucibello commented Feb 7, 2021

CarloLucibello commented Feb 7, 2021

DhairyaLGandhi commented Feb 8, 2021 •

edited

Loading

darsnack commented Feb 8, 2021 •

edited

Loading

darsnack commented Feb 8, 2021

DhairyaLGandhi commented Feb 8, 2021 •

edited

Loading

darsnack commented Feb 8, 2021

ToucheSir commented Feb 8, 2021

darsnack commented Feb 8, 2021

darsnack commented Feb 8, 2021

DhairyaLGandhi commented Feb 8, 2021

CarloLucibello commented Feb 8, 2021

DhairyaLGandhi commented Feb 8, 2021

DhairyaLGandhi commented Feb 8, 2021

darsnack commented Feb 8, 2021

ToucheSir commented Feb 8, 2021

DhairyaLGandhi commented Feb 20, 2021

ToucheSir commented Feb 20, 2021 •

edited

Loading

CarloLucibello commented Feb 21, 2021

ToucheSir commented Feb 21, 2021

CarloLucibello commented Feb 21, 2021

DhairyaLGandhi commented Feb 21, 2021

darsnack commented Feb 21, 2021

ToucheSir commented Feb 22, 2021 •

edited

Loading

darsnack commented Feb 22, 2021

DhairyaLGandhi commented Feb 22, 2021 •

edited

Loading

darsnack commented Feb 22, 2021 •

edited

Loading

ToucheSir commented Feb 22, 2021

DhairyaLGandhi commented Feb 22, 2021

darsnack commented Feb 22, 2021 •

edited

Loading

DhairyaLGandhi commented Feb 22, 2021

ToucheSir commented Feb 22, 2021

CarloLucibello commented Dec 15, 2024

Deprecate active field, and auto keyword #1301

Deprecate active field, and auto keyword #1301

Conversation

DhairyaLGandhi commented Aug 3, 2020 • edited Loading

PR Checklist

CarloLucibello commented Feb 7, 2021

CarloLucibello commented Feb 7, 2021

DhairyaLGandhi commented Feb 8, 2021 • edited Loading

darsnack commented Feb 8, 2021 • edited Loading

darsnack commented Feb 8, 2021

DhairyaLGandhi commented Feb 8, 2021 • edited Loading

darsnack commented Feb 8, 2021

ToucheSir commented Feb 8, 2021

darsnack commented Feb 8, 2021

darsnack commented Feb 8, 2021

DhairyaLGandhi commented Feb 8, 2021

CarloLucibello commented Feb 8, 2021

DhairyaLGandhi commented Feb 8, 2021

DhairyaLGandhi commented Feb 8, 2021

darsnack commented Feb 8, 2021

ToucheSir commented Feb 8, 2021

DhairyaLGandhi commented Feb 20, 2021

ToucheSir commented Feb 20, 2021 • edited Loading

CarloLucibello commented Feb 21, 2021

ToucheSir commented Feb 21, 2021

CarloLucibello commented Feb 21, 2021

DhairyaLGandhi commented Feb 21, 2021

darsnack commented Feb 21, 2021

ToucheSir commented Feb 22, 2021 • edited Loading

darsnack commented Feb 22, 2021

DhairyaLGandhi commented Feb 22, 2021 • edited Loading

darsnack commented Feb 22, 2021 • edited Loading

ToucheSir commented Feb 22, 2021

DhairyaLGandhi commented Feb 22, 2021

darsnack commented Feb 22, 2021 • edited Loading

DhairyaLGandhi commented Feb 22, 2021

ToucheSir commented Feb 22, 2021

CarloLucibello commented Dec 15, 2024

Deprecate `active` field, and `auto` keyword #1301

Deprecate `active` field, and `auto` keyword #1301

DhairyaLGandhi commented Aug 3, 2020 •

edited

Loading

DhairyaLGandhi commented Feb 8, 2021 •

edited

Loading

darsnack commented Feb 8, 2021 •

edited

Loading

DhairyaLGandhi commented Feb 8, 2021 •

edited

Loading

ToucheSir commented Feb 20, 2021 •

edited

Loading

ToucheSir commented Feb 22, 2021 •

edited

Loading

DhairyaLGandhi commented Feb 22, 2021 •

edited

Loading

darsnack commented Feb 22, 2021 •

edited

Loading

darsnack commented Feb 22, 2021 •

edited

Loading