-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add trainables
#171
add trainables
#171
Conversation
Implemented custom rule for trainables1
9.833 μs (54 allocations: 2.34 KiB)
trainables2
11.625 μs (94 allocations: 4.30 KiB)
trainables3
22.625 μs (189 allocations: 5.70 KiB)
gradient trainables1
29.584 μs (213 allocations: 268.50 KiB)
gradient trainables2
1.825 ms (8419 allocations: 601.59 KiB)
gradient trainables3
307.000 μs (2636 allocations: 377.53 KiB) |
I'll focus on using BenchmarkTools
using Optimisers
using Functors
using Zygote, Flux
using ChainRulesCore
function trainables1(x)
arrays = AbstractArray[]
exclude(x) = Optimisers.isnumeric(x)
fmap(x; exclude, walk = Optimisers._TrainableStructWalk()) do y
push!(arrays, y)
return y
end
return arrays
end
function ∇trainables1(x, Δ)
exclude(x) = Optimisers.isnumeric(x)
i = 0
return fmapstructure(x; exclude, walk = Optimisers._TrainableStructWalk()) do _
return Δ[i+=1]
end
end
function ChainRulesCore.rrule(::typeof(trainables1), x)
y = trainables1(x)
trainables_back(Δ) = (NoTangent(), ∇trainables1(x, unthunk(Δ)))
return y, trainables_back
end
############
using Functors: AbstractWalk, _map, _values, execute, ExcludeWalk
struct TrainableWalk2 <: AbstractWalk end
function (walk::TrainableWalk2)(recurse, x, ys...)
x_children = Optimisers.trainable(x)
ys_children = map(Optimisers.trainable, ys)
res = map(recurse, x_children, ys_children...)
return reduce(vcat, values(res),init=[])
end
function trainables2(x)
exclude(x) = Optimisers.isnumeric(x) && Functors.isleaf(x)
return execute(ExcludeWalk(TrainableWalk2(), x ->[x], exclude), x)
end
struct TrainableWalk3 <: AbstractWalk end
function (walk::TrainableWalk3)(recurse, x, ys...)
x_children = Optimisers.trainable(x)
ys_children = map(Optimisers.trainable, ys)
res = map(recurse, x_children, ys_children...)
return vcat(values(res)...)
end
function trainables3(x)
exclude(x) = Optimisers.isnumeric(x)
return execute(ExcludeWalk(TrainableWalk3(), x ->[x], exclude), x)
end
function floss(ps)
sum([sum(abs2, p) for p in ps])
end
using Flux
function perf()
m = Chain(Dense(128 => 128, relu),
Dense(128 => 128, relu),
BatchNorm(128),
x -> x^2,
Dense(128 => 128, relu),
Dense(128 => 128, relu))
println("trainables1")
@btime floss(trainables1($m))
println("trainables2")
@btime floss(trainables2($m))
println("trainables3")
@btime floss(trainables3($m))
println()
println("gradient trainables1")
@btime gradient(m -> floss(trainables1(m)), $m)
println("gradient trainables2")
@btime gradient(m -> floss(trainables2(m)), $m)
println("gradient trainables3")
@btime gradient(m -> floss(trainables3(m)), $m)
nothing
end
Zygote.refresh()
perf() |
9242bb2
to
06f786c
Compare
could anyone review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only long term suggestion is to address FluxML/Functors.jl#81 and adjust fleaves
to match the performance here. But these are implementation details that don't affect the API, and this seems ready as is.
One API consideration before merging and releasing is whether this needs to be separate from #173. It's trivial to ignore the path if it's not relevant. Also, while okay for Functors, I'm not a fan of duplicated |
Co-authored-by: Kyle Daruwalla <[email protected]>
Co-authored-by: Kyle Daruwalla <[email protected]>
I kept #173 separate to simplify the review of this one. Let's continue the discussion there. |
* trainables * trainables * cl/trainables * trainables * test second order derivatives * add doc section * fix test * Update src/trainables.jl
An alternative to #57 adding the
trainables
method that returns a vector of arrays.I'm playing with different implementations at the moment. The output of the
perf()
function istrainables1
is the fastest but since it is mutating it needs a customrrule
for differentiation. Probably the rrule fordestructor
can be adapted for this casehttps://github.com/FluxML/Optimisers.jl/blob/master/src/destructure.jl
The gradients of the other two implementations are very slow, also in these cases we would need a custom rule.
TODO
fmap
)