Gradients of shared parameters do not behave as expected #420

arthur-bizzi · 2023-10-09T16:55:30Z

Hey. Please excuse the flurry of bug reports.

It seems that gradients with regards to shared parameters are not working correctly. This is most evident when working with invertible architectures.

Take the following example. We apply a simple coupling layer F and then apply its inverse B. If F and B have the same parameters, they cancel out and the result is just the input vector (to machine precision). This happens regardless of the specific parameters ps - which means that any gradient should be zero.

What happens instead is that the AD engine cannot tell that the parameters are tied and returns something else. For this simple example this gradient is in fact the same as for the non-tied parameters.

#LuxHelpers
using Lux, ComponentArrays, Random, Zygote
rng = Random.default_rng()

#Define a coupling Layer and its inverse:
struct LeapFrog{T} <: Lux.AbstractExplicitLayer
    sub_net::T
end 
(frog::LeapFrog)(x,ps,st) = (frog.sub_net(x[1],ps,st)[1]+x[2],x[1]),st
Lux.initialparameters(rng::AbstractRNG,frog::LeapFrog) = Lux.initialparameters(rng::AbstractRNG,frog.sub_net)
Lux.initialstates(rng::AbstractRNG,frog::LeapFrog) = Lux.initialstates(rng::AbstractRNG,frog.sub_net)

struct BackFrog{T} <: Lux.AbstractExplicitLayer
    sub_net::T
end 
(frog::BackFrog)(x,ps,st) = (x[2],x[1]-frog.sub_net(x[2],ps,st)[1]),st
Lux.initialparameters(rng::AbstractRNG,frog::BackFrog) = Lux.initialparameters(rng::AbstractRNG,frog.sub_net)
Lux.initialstates(rng::AbstractRNG,frog::BackFrog) = Lux.initialstates(rng::AbstractRNG,frog.sub_net)

#Setup a Chain that applies the layer and the inverse in sequence:
D = Dense(1=>1)
F = LeapFrog(D)
B = BackFrog(D)
C = Chain(;f1=F,b1=B)
ps, st = Lux.setup(rng,C)
ps_share = Lux.share_parameters(ps, (("f1","b1"),))

#For shared parameters, the Chain is just the identity:
v = ([1.],[1.])
C(v,ps,st) #not v
C(v,ps_share,st)# v

#Toy loss
toy_loss(P) = C(v,P,st)[1] |> sum |> sum
toy_loss(ps) #Random
toy_loss(ps_share) #2.0

#Take the gradient; it should be zero for the shared parameters
grad(p) = Zygote.gradient(toy_loss,p)
grad(ps) == grad(ps_share) #True

This doesn't seem to be an issue with Zygote either:

using ReverseDiff

psc = ps |> ComponentArray
psc_share = ps_share |> ComponentArray
grad_rev(p) = ReverseDiff.gradient(toy_loss,p)
grad_rev(psc) == grad_rev(psc_share)#True

I'm aware that this is tagged as experimental. Still, it's a neural networks library. If something cannot be used with gradients, perhaps it shouldn't be exposed to users?

What are the plans with regards to shared_parameters? It would be of immense importance to my work; please let me know if there's something I could do to help.

The text was updated successfully, but these errors were encountered:

avik-pal · 2023-10-09T17:51:53Z

If you use Optimisers.jl, https://fluxml.ai/Optimisers.jl/stable/#Tied-Parameters should know how to perform the updates

avik-pal · 2024-02-26T00:59:11Z

Closing this since the behavior here is expected, and one is supposed to use Optimisers / a functors-based approach which accumulates the gradients exactly once.

avik-pal closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradients of shared parameters do not behave as expected #420

Gradients of shared parameters do not behave as expected #420

arthur-bizzi commented Oct 9, 2023 •

edited

Loading

avik-pal commented Oct 9, 2023

avik-pal commented Feb 26, 2024

Gradients of shared parameters do not behave as expected #420

Gradients of shared parameters do not behave as expected #420

Comments

arthur-bizzi commented Oct 9, 2023 • edited Loading

avik-pal commented Oct 9, 2023

avik-pal commented Feb 26, 2024

arthur-bizzi commented Oct 9, 2023 •

edited

Loading