You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that gradients with regards to shared parameters are not working correctly. This is most evident when working with invertible architectures.
Take the following example. We apply a simple coupling layer F and then apply its inverse B. If F and B have the same parameters, they cancel out and the result is just the input vector (to machine precision). This happens regardless of the specific parameters ps - which means that any gradient should be zero.
What happens instead is that the AD engine cannot tell that the parameters are tied and returns something else. For this simple example this gradient is in fact the same as for the non-tied parameters.
#LuxHelpers
using Lux, ComponentArrays, Random, Zygote
rng = Random.default_rng()
#Define a coupling Layer and its inverse:
struct LeapFrog{T} <: Lux.AbstractExplicitLayer
sub_net::T
end
(frog::LeapFrog)(x,ps,st) = (frog.sub_net(x[1],ps,st)[1]+x[2],x[1]),st
Lux.initialparameters(rng::AbstractRNG,frog::LeapFrog) = Lux.initialparameters(rng::AbstractRNG,frog.sub_net)
Lux.initialstates(rng::AbstractRNG,frog::LeapFrog) = Lux.initialstates(rng::AbstractRNG,frog.sub_net)
struct BackFrog{T} <: Lux.AbstractExplicitLayer
sub_net::T
end
(frog::BackFrog)(x,ps,st) = (x[2],x[1]-frog.sub_net(x[2],ps,st)[1]),st
Lux.initialparameters(rng::AbstractRNG,frog::BackFrog) = Lux.initialparameters(rng::AbstractRNG,frog.sub_net)
Lux.initialstates(rng::AbstractRNG,frog::BackFrog) = Lux.initialstates(rng::AbstractRNG,frog.sub_net)
#Setup a Chain that applies the layer and the inverse in sequence:
D = Dense(1=>1)
F = LeapFrog(D)
B = BackFrog(D)
C = Chain(;f1=F,b1=B)
ps, st = Lux.setup(rng,C)
ps_share = Lux.share_parameters(ps, (("f1","b1"),))
#For shared parameters, the Chain is just the identity:
v = ([1.],[1.])
C(v,ps,st) #not v
C(v,ps_share,st)# v
#Toy loss
toy_loss(P) = C(v,P,st)[1] |> sum |> sum
toy_loss(ps) #Random
toy_loss(ps_share) #2.0
#Take the gradient; it should be zero for the shared parameters
grad(p) = Zygote.gradient(toy_loss,p)
grad(ps) == grad(ps_share) #True
This doesn't seem to be an issue with Zygote either:
I'm aware that this is tagged as experimental. Still, it's a neural networks library. If something cannot be used with gradients, perhaps it shouldn't be exposed to users?
What are the plans with regards to shared_parameters? It would be of immense importance to my work; please let me know if there's something I could do to help.
The text was updated successfully, but these errors were encountered:
Closing this since the behavior here is expected, and one is supposed to use Optimisers / a functors-based approach which accumulates the gradients exactly once.
Hey. Please excuse the flurry of bug reports.
It seems that gradients with regards to shared parameters are not working correctly. This is most evident when working with invertible architectures.
Take the following example. We apply a simple coupling layer
F
and then apply its inverseB
. IfF
andB
have the same parameters, they cancel out and the result is just the input vector (to machine precision). This happens regardless of the specific parametersps
- which means that any gradient should be zero.What happens instead is that the AD engine cannot tell that the parameters are tied and returns something else. For this simple example this gradient is in fact the same as for the non-tied parameters.
This doesn't seem to be an issue with Zygote either:
I'm aware that this is tagged as experimental. Still, it's a neural networks library. If something cannot be used with gradients, perhaps it shouldn't be exposed to users?
What are the plans with regards to
shared_parameters
? It would be of immense importance to my work; please let me know if there's something I could do to help.The text was updated successfully, but these errors were encountered: