Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate active field, and auto keyword #1301

Closed
wants to merge 6 commits into from

Conversation

DhairyaLGandhi
Copy link
Member

@DhairyaLGandhi DhairyaLGandhi commented Aug 3, 2020

The auto functionality generally can be implemented without the extra noise that we see with having to define extra methods that can be easily disambiguated during training and during inference, including forcing dropout. This currently implements Dropout the layer and dropout the kernel to behave consistently with how they are expected to be called.

The current system has been a bit difficult to scale and is open to bugs without defining special handlers which may be misinterpreted. We already can support the same features with lesser potential to cause confusion.

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable
  • Final review from @dhairyagandhi96 (for API changes).

@DhairyaLGandhi DhairyaLGandhi marked this pull request as draft August 3, 2020 06:27
@DhairyaLGandhi DhairyaLGandhi changed the title Deprecate auto field Deprecate auto field, and active Aug 3, 2020
@DhairyaLGandhi DhairyaLGandhi changed the title Deprecate auto field, and active Deprecate active field, and auto keyword Aug 3, 2020
@DhairyaLGandhi DhairyaLGandhi reopened this Dec 1, 2020
@DhairyaLGandhi DhairyaLGandhi added this to the v0.12 milestone Feb 5, 2021
@CarloLucibello
Copy link
Member

I don't understand what this PR is trying to do, the OP is not very clear. Also, the PR contains some unrelated code. This is a draft PR not being worked on since august, shouldn't be blocking v0.12 release

@CarloLucibello
Copy link
Member

btw, from what I gather this goes in the direction thing were before the introduction of testmode! and layer's activity state. It's been discussed already, we want to be able to set a layer to testmode or trainmode independently of being in a gradient context or not. So current master is fine in my opinion, but maybe I didn't understand what this PR is trying to do

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Feb 8, 2021

That is the job of the functional layers which this PR was working towards, its not blocking as much as needs to be prioritised to get remove a lot of the code smell accumulated over the course of changing the norm layers manually. We would yet have the different behaviours as master, and should remove a lot of the keyword handling.

It aims to remove the isactive, and testmode! kinds of helpers, and make them available separately. You can see how it works in the dropout case. The unrelated work is my bad; removed it.

@darsnack
Copy link
Member

darsnack commented Feb 8, 2021

AFAICT this PR just removes the active field in favor of istraining(). But the reason that active was introduced was because the automated behavior of istraining() is not desired in all cases. Is there a plan here to address that in a different way? I couldn't see how it was supposed to work in the dropout case.

@darsnack
Copy link
Member

darsnack commented Feb 8, 2021

I'm all for a more functional design btw, but I guess I just keep seeing it mentioned, and I have no clue what's actually being proposed.

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Feb 8, 2021

There's a few ways, one is to tell users to use the function directly which should address ambiguities, and another is to make the active part of the type signature so it's value is known at compile time for cse. It also reduces our reliance on small unions, which would be nice. Ultimately, we can retain the current default and have overrides in place for other cases.

@darsnack
Copy link
Member

darsnack commented Feb 8, 2021

I don't think telling the user to use the function is a good approach. IIRC the place I first encountered this issue, I had a model where I wanted to perform an adversarial attack and the batch norm behavior was messing up my results. Swapping out a layer for a function in my model is not something I want to do.

Lifting it into the type parameters seems like a nice idea though. But we'd still have auto as a keyword?

@ToucheSir
Copy link
Member

How would in-place updates work if active becomes a type param? e.g. fine-tuning a model where you still want gradients but no batchnorm updates. Wouldn't that require constructing a new layer struct (which we've already determined is a no-go)? I think it would help to have a lot more detail on the proposed functional layer approach before we pull the trigger on anything like this.

@darsnack
Copy link
Member

darsnack commented Feb 8, 2021

Well, I think the functional approach should be part of this PR? If there's a desire to deprecate without the replacement, then I'd strongly push against that.

Wouldn't that require constructing a new layer struct

I thought about that too. It does depend on the frequency at which you are constructing a copy with a different active field. There would be an allocation penalty, but it should only be for the outermost wrapper, right? Unless we want to deep copy the subfields which yeah is a no-go AFAICT.

@darsnack
Copy link
Member

darsnack commented Feb 8, 2021

But +1 for more details

@DhairyaLGandhi
Copy link
Member Author

Not deep copy, and yes the allocation would be for the wrapper.

I was planning on adding the functional part as well fwiw

@CarloLucibello
Copy link
Member

its not blocking as much as needs to be prioritised

I'll remove the v0.12 label then, the milestones should contain only release-blocking issues/PR at this point

@CarloLucibello CarloLucibello removed this from the v0.12 milestone Feb 8, 2021
@DhairyaLGandhi
Copy link
Member Author

How often does one switch out the active field while training? And is the allocation an issue where it would be tested?

@DhairyaLGandhi
Copy link
Member Author

I wouldn't want to ship the norm layers as they are.

@darsnack
Copy link
Member

darsnack commented Feb 8, 2021

Is that relevant to this PR or to #1495?

@ToucheSir
Copy link
Member

How often does one switch out the active field while training?

For fine-tuning, probably only once. For something like gradual unfreezing though, you'd need a mechanism for only switching out a few layers at a time. To me this implies some sort of lens interface, which doesn't currently exist in Flux.

@darsnack darsnack mentioned this pull request Feb 17, 2021
8 tasks
@DhairyaLGandhi
Copy link
Member Author

Could you say more about the "switching out a few layers at a time"?

@ToucheSir
Copy link
Member

ToucheSir commented Feb 20, 2021

Sure, here's some pseudo-Julia of gradual unfreezing:

model = Chain(block1, block2, ...blockN, mlp_head) # pretrained

testmode!(model[1:end-1])

unfrozen = 1
for i in 1:epochs
  train!(model, loss, opt, data)

  if i % 2 == 0 # unfreeze one layer group (block) every other epoch
    trainmode!(model[end-unfrozen])
    unfrozen += 1
  end
end

To my knowledge, there is no way to do this ergonomically (let alone efficiently) in Flux right now without train/testmode! (writing one's own wrapper over functor doesn't count). I also don't see how active is a code smell but something like BatchNorm mutating its parameters on the forward pass isn't. Both make sense to remove if/when we move to a more functional layer model, but in the meantime both are a necessary evil for a feature-complete ML framework.

@CarloLucibello
Copy link
Member

To my knowledge, there is no way to do this ergonomically (let alone efficiently) in Flux right now

The purpose of testmode! and trainmode! is to switch normalization and dropout layers to inference or training mode. This a control we have that is currently decoupled from which parameters we decide to train.
The current way to unfreeze layers is something like this

model = Chain(block1, block2, ...blockN, mlp_head) # pretrained
ps = Flux.params(model[end])
unfrozen = 1
for i in 1:epochs
  train!(ps, loss, opt, data)

  if i % 2 == 0 # unfreeze one layer group (block) every other epoch
    # ps = Flux.params(ps..., Flux.params(model[end-unfrozen])...)
    # or
    ps = Flux.params(model[end-unfrozen:end]) 
    unfrozen += 1
  end
end

We could think about extending trainmode! like this

trainmode!(ps, submodel)

Where now trainmode! sets the state of the normalization layers and also adds the parameters to ps

@ToucheSir
Copy link
Member

Sorry, I meant to write "in Flux right now without train/testmode!" (i.e. immutably without mutating layer params). I also forgot to include param gathering in the code snippet, so we should be on the same page with that.

We could think about extending trainmode! like this

trainmode!(ps, submodel)

Is that required if no gradients are being generated when the layer is in testmode! anyhow? Personally I find the explicit params management in your code snippet far clearer.

@CarloLucibello
Copy link
Member

Actually, I think the correct management of activation mode and training parameters would look like this

model = Chain(block1, block2, ...blockN, mlp_head) # pretrained
testmode!(model)

ps = Flux.params(model[end])
trainmode!(model[end], :auto)

unfrozen = 1
for i in 1:epochs
  train!(ps, loss, opt, data)
  if i % 2 == 0 # unfreeze one layer group (block) every other epoch
     push!(ps, model[end-unfrozen])
     trainmode!(model[end-unfrozen], :auto)
     unfrozen += 1
  end
end

We can provide better documentation for the use case, and also implement the push! if we haven't already.
I agree that trainmode!(ps, block) is obscure and doesn't add much convenience

@DhairyaLGandhi
Copy link
Member Author

I find the explicit approach clearer as well, so I feel we are in agreement over the intention of the PR. I find the notion of pushing into params a bit odd, and err on that.

The correct implementation would essentially generate ps (in the loop, not mutate) based on the layers intended to be trained.

@darsnack
Copy link
Member

Bringing the discussion back to active which is not about trainable parameters. It is about accumulating normalization statistics. I'm fine with lifting the field into the type (though I'd be curious to see how big of a performance benefit this is). But I want to be clear that users should be able to make statistic accumulation (or dropout for the dropout layers) inactive regardless of whether there's a gradient path or not.

IMO the code smell around active comes from :auto which only exists to provide a cute Zygote trick that I don't think is that useful in practice. I almost always use trainmode!/testmode! no matter what.

@ToucheSir
Copy link
Member

ToucheSir commented Feb 22, 2021

Interesting, I'm ambivalent on the implementation of :auto, but very much appreciate what it + istraining enable and sorely miss them when using PyTorch**.

I'm fine with lifting the field into the type

My point was that doing this would make testmode! and trainmode! impossible to implement because you can't change the type of a value in-place like you can a field. You would have to find every reference to the layer and change that, which with the current API would require shenanigans like chain.layers[i] = testmode(chain[i]) because most containers are effectively immutable.

** IMO layers should not be mutating their internal state when called outside of a gradient context unless the user has specifically opted into it with something like trainmode!. This includes RNNs, but that's a design discussion for another day.

@darsnack
Copy link
Member

Haha I guess everyone has different preference.

Yeah, I agree that a lens interface would make an immutable trainmode/testmode easier. I'm also ambivalent to the implementation (the current one is fine IMO), just wanted to make sure we still allow a hard mode as well.

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Feb 22, 2021

In #1509 I have added a NormConfig which holds these configurations and is compiled away (hopefully), which could be used for layer activeness.

For the active field - layers are and remain opinionated implementations of certain operations. They are expected to work a certain way in and out of training contexts. The functions are supposed to be less opinionated and run the operations directly. So it's odd that there's push back against telling users to use the functional forms when wanting to opt out of the layers' behavior.

@darsnack
Copy link
Member

darsnack commented Feb 22, 2021

I can't really tell from the PR you linked...if I do testmode!(m) and then use m in a gradient call, are the statistic accumulated or not? I'm trying to figure out what behavior is proposed for the layers vs the functions (I don't care if the implementation is a field, config struct, or type parameter).

@ToucheSir
Copy link
Member

Being opinionated is fine, but removing features and not providing viable alternatives not so much. For example, the whole point of batchnorm is that it's a container that encapsulates some global statistics. Replacing that with a batchnorm function doesn't make any sense! There's also the matter of saving and loading params. If I read in an arbitrary model and want to freeze a subset of normalization layers, how do I do that with these "less opinionated functions" without tearing apart and rebuilding the rest of the network manually via a bunch of functor calls that will break the moment the topology changes slightly?

To stop us from arguing this in circles, do you mind transcribing @CarloLucibello's snippet above to show how it would be done without active? Otherwise I'm not convinced active can be removed.

@DhairyaLGandhi
Copy link
Member Author

Thanks, but if you see from the pr I linked to, I am capturing all the "configuration" in a single place and dispatching. It's a bit crude, but for small models it gave a decent speedup (about 30%) and for others it was faster but not by as much. Also helps inference, which is a nice byproduct.

It mirrors how we do things with ConvDims. I am not suggesting that we can do it for every layer anyway, but saying we do have a path forward for viable alternatives.

The first snippet from #1301 (comment) much more clear than the one with modes.

IMO the code smell around active comes from :auto

To me the repeated testmode code is it.

@darsnack
Copy link
Member

darsnack commented Feb 22, 2021

I don't think anyone has a problem with the approach of NormConfig. It's just that the title of this PR and the discussion seems to imply some kind of deprecation of active concept (not just the field implementation).

@DhairyaLGandhi
Copy link
Member Author

I definitely do think we need a better way forward for active than use the fields with the modes - encouraging those seems wrong. I want code to be meaningful, and to know exactly what the layers are doing by looking at the forward pass, instead of go down streams of helpers and utilities and special cases. With conv, I agree that the first pass doesn't say much, but with the nnlib machinery, it is clear where the functions are defined what is happening.

@ToucheSir
Copy link
Member

I mean yes +:100:, but until we have a way to do the active + modes thing without the field + mutation I think it has to stay. That's why I proposed looking into a functional lens interface for updating parts of a network ergonomically upthread.

with the nnlib machinery, it is clear where the functions are defined what is happening

I'll open a GH discussion so we can continue this thread, but I don't see how using NNLib directly makes sense for layers like Conv or BatchNorm that carry state in the form of parameters. Even if said state is immutable, it still needs to exist somewhere and few users are going to be happy with manually keeping track of weight + bias arrays to pass into the NNLib function when they can just create a Conv and let Flux track those for them.

@CarloLucibello
Copy link
Member

closing as stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants