-
-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate active
field, and auto
keyword
#1301
Conversation
auto
field, and active
active
field, and auto
keyword
I don't understand what this PR is trying to do, the OP is not very clear. Also, the PR contains some unrelated code. This is a draft PR not being worked on since august, shouldn't be blocking v0.12 release |
btw, from what I gather this goes in the direction thing were before the introduction of testmode! and layer's activity state. It's been discussed already, we want to be able to set a layer to testmode or trainmode independently of being in a gradient context or not. So current master is fine in my opinion, but maybe I didn't understand what this PR is trying to do |
That is the job of the functional layers which this PR was working towards, its not blocking as much as needs to be prioritised to get remove a lot of the code smell accumulated over the course of changing the norm layers manually. We would yet have the different behaviours as master, and should remove a lot of the keyword handling. It aims to remove the isactive, and testmode! kinds of helpers, and make them available separately. You can see how it works in the dropout case. The unrelated work is my bad; removed it. |
AFAICT this PR just removes the |
I'm all for a more functional design btw, but I guess I just keep seeing it mentioned, and I have no clue what's actually being proposed. |
There's a few ways, one is to tell users to use the function directly which should address ambiguities, and another is to make the |
I don't think telling the user to use the function is a good approach. IIRC the place I first encountered this issue, I had a model where I wanted to perform an adversarial attack and the batch norm behavior was messing up my results. Swapping out a layer for a function in my model is not something I want to do. Lifting it into the type parameters seems like a nice idea though. But we'd still have |
How would in-place updates work if |
Well, I think the functional approach should be part of this PR? If there's a desire to deprecate without the replacement, then I'd strongly push against that.
I thought about that too. It does depend on the frequency at which you are constructing a copy with a different |
But +1 for more details |
Not deep copy, and yes the allocation would be for the wrapper. I was planning on adding the functional part as well fwiw |
I'll remove the v0.12 label then, the milestones should contain only release-blocking issues/PR at this point |
How often does one switch out the active field while training? And is the allocation an issue where it would be tested? |
I wouldn't want to ship the norm layers as they are. |
Is that relevant to this PR or to #1495? |
For fine-tuning, probably only once. For something like gradual unfreezing though, you'd need a mechanism for only switching out a few layers at a time. To me this implies some sort of lens interface, which doesn't currently exist in Flux. |
Could you say more about the "switching out a few layers at a time"? |
Sure, here's some pseudo-Julia of gradual unfreezing: model = Chain(block1, block2, ...blockN, mlp_head) # pretrained
testmode!(model[1:end-1])
unfrozen = 1
for i in 1:epochs
train!(model, loss, opt, data)
if i % 2 == 0 # unfreeze one layer group (block) every other epoch
trainmode!(model[end-unfrozen])
unfrozen += 1
end
end To my knowledge, there is no way to do this ergonomically (let alone efficiently) in Flux right now without |
The purpose of model = Chain(block1, block2, ...blockN, mlp_head) # pretrained
ps = Flux.params(model[end])
unfrozen = 1
for i in 1:epochs
train!(ps, loss, opt, data)
if i % 2 == 0 # unfreeze one layer group (block) every other epoch
# ps = Flux.params(ps..., Flux.params(model[end-unfrozen])...)
# or
ps = Flux.params(model[end-unfrozen:end])
unfrozen += 1
end
end We could think about extending
Where now |
Sorry, I meant to write "in Flux right now without train/testmode!" (i.e. immutably without mutating layer params). I also forgot to include param gathering in the code snippet, so we should be on the same page with that.
Is that required if no gradients are being generated when the layer is in |
Actually, I think the correct management of activation mode and training parameters would look like this model = Chain(block1, block2, ...blockN, mlp_head) # pretrained
testmode!(model)
ps = Flux.params(model[end])
trainmode!(model[end], :auto)
unfrozen = 1
for i in 1:epochs
train!(ps, loss, opt, data)
if i % 2 == 0 # unfreeze one layer group (block) every other epoch
push!(ps, model[end-unfrozen])
trainmode!(model[end-unfrozen], :auto)
unfrozen += 1
end
end We can provide better documentation for the use case, and also implement the |
I find the explicit approach clearer as well, so I feel we are in agreement over the intention of the PR. I find the notion of pushing into params a bit odd, and err on that. The correct implementation would essentially generate ps (in the loop, not mutate) based on the layers intended to be trained. |
Bringing the discussion back to IMO the code smell around |
Interesting, I'm ambivalent on the implementation of
My point was that doing this would make ** IMO layers should not be mutating their internal state when called outside of a gradient context unless the user has specifically opted into it with something like |
Haha I guess everyone has different preference. Yeah, I agree that a lens interface would make an immutable |
In #1509 I have added a For the active field - layers are and remain opinionated implementations of certain operations. They are expected to work a certain way in and out of training contexts. The functions are supposed to be less opinionated and run the operations directly. So it's odd that there's push back against telling users to use the functional forms when wanting to opt out of the layers' behavior. |
I can't really tell from the PR you linked...if I do |
Being opinionated is fine, but removing features and not providing viable alternatives not so much. For example, the whole point of batchnorm is that it's a container that encapsulates some global statistics. Replacing that with a To stop us from arguing this in circles, do you mind transcribing @CarloLucibello's snippet above to show how it would be done without |
Thanks, but if you see from the pr I linked to, I am capturing all the "configuration" in a single place and dispatching. It's a bit crude, but for small models it gave a decent speedup (about 30%) and for others it was faster but not by as much. Also helps inference, which is a nice byproduct. It mirrors how we do things with The first snippet from #1301 (comment) much more clear than the one with modes.
To me the repeated testmode code is it. |
I don't think anyone has a problem with the approach of |
I definitely do think we need a better way forward for active than use the fields with the modes - encouraging those seems wrong. I want code to be meaningful, and to know exactly what the layers are doing by looking at the forward pass, instead of go down streams of helpers and utilities and special cases. With conv, I agree that the first pass doesn't say much, but with the nnlib machinery, it is clear where the functions are defined what is happening. |
I mean yes +:100:, but until we have a way to do the
I'll open a GH discussion so we can continue this thread, but I don't see how using NNLib directly makes sense for layers like Conv or BatchNorm that carry state in the form of parameters. Even if said state is immutable, it still needs to exist somewhere and few users are going to be happy with manually keeping track of weight + bias arrays to pass into the NNLib function when they can just create a |
closing as stale |
The
auto
functionality generally can be implemented without the extra noise that we see with having to define extra methods that can be easily disambiguated during training and during inference, including forcingdropout
. This currently implementsDropout
the layer anddropout
the kernel to behave consistently with how they are expected to be called.The current system has been a bit difficult to scale and is open to bugs without defining special handlers which may be misinterpreted. We already can support the same features with lesser potential to cause confusion.
PR Checklist
@dhairyagandhi96
(for API changes).