Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use destructure from Optimisers.jl #1901

Merged
merged 8 commits into from
Mar 21, 2022
Merged

Conversation

mcabbott
Copy link
Member

@mcabbott mcabbott commented Mar 5, 2022

This deletes destructure here to use the function from Optimisers.jl, of the same name and almost the same behaviour, except that:

But, of course, there are bugs, Flux's tests fail locally.

Should also add more tests... Now includes tests from these issues:

Fixes #1826
Fixes #1767
Fixes #1601
Fixes #1727

Fixes #1733, closes #1742 (replaces)

@mcabbott mcabbott marked this pull request as draft March 5, 2022 22:45
@mcabbott mcabbott added this to the v0.13 milestone Mar 5, 2022
@mcabbott
Copy link
Member Author

mcabbott commented Mar 7, 2022

Failures in Metalhead look like this, I don't see them on its master, not sure if they are caused by Flux:

Inception3: Error During Test at /home/runner/work/Flux.jl/Flux.jl/downstream/test/convnets.jl:74
[332](https://github.com/FluxML/Flux.jl/runs/5451760541?check_suite_focus=true#step:6:332)
  Test threw exception
[333](https://github.com/FluxML/Flux.jl/runs/5451760541?check_suite_focus=true#step:6:333)
  Expression: size(m(rand(Float32, 299, 299, 3, 2))) == (1000, 2)
[334](https://github.com/FluxML/Flux.jl/runs/5451760541?check_suite_focus=true#step:6:334)
  MethodError: no method matching cat_channels(::Array{Float32, 4}, ::Array{Float32, 4}, ::Array{Float32, 4}, ::Array{Float32, 4})
[335](https://github.com/FluxML/Flux.jl/runs/5451760541?check_suite_focus=true#step:6:335)
  Closest candidates are:
[336](https://github.com/FluxML/Flux.jl/runs/5451760541?check_suite_focus=true#step:6:336)
    cat_channels(::Any, ::Any) at ~/work/Flux.jl/Flux.jl/downstream/src/utilities.jl:37
[337](https://github.com/FluxML/Flux.jl/runs/5451760541?check_suite_focus=true#step:6:337)
  Stacktrace:
[338](https://github.com/FluxML/Flux.jl/runs/5451760541?check_suite_focus=true#step:6:338)
    [1] (::Parallel{typeof(Metalhead.cat_channels), Tuple{Chain{Tuple{Conv{2, 4, typeof(identity),

Failures in FastAI and OperatorLearning are about params export. Haven't checked whether they run the rest of their tests after failure.

@darsnack
Copy link
Member

darsnack commented Mar 7, 2022

Metalhead failures are unrelated to this PR: FluxML/Metalhead.jl#127

Project.toml Outdated
version = "0.13.0-DEV"
version = "0.12.99"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't forget to change this back! Just trying to ensure downstream tests do run.

@@ -4,8 +4,6 @@ using Zygote: IdSet
import Functors: Functors, @functor, functor, fmap, isleaf
using SparseArrays: AbstractSparseArray

trainable(m) = functor(m)[1]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caused:

julia> using Flux
[ Info: Precompiling Flux [587475ba-b771-5e3f-ad9e-33799f191a9c]
WARNING: Method definition trainable(Any) in module Optimisers at /Users/me/.julia/dev/Optimisers/src/interface.jl:69 overwritten in module Flux at /Users/me/.julia/dev/Flux/src/functor.jl:7.
  ** incremental compilation may be fatally broken for this module **

@mcabbott
Copy link
Member Author

mcabbott commented Mar 8, 2022

@gabrevaya this might solve the example here: #1353 (comment) But what it opt?

julia> res = GalacticOptim.solve(optprob, opt, data, cb = cb)
ERROR: UndefVarError: opt not defined

@mcabbott mcabbott marked this pull request as ready for review March 8, 2022 05:10
@codecov-commenter
Copy link

codecov-commenter commented Mar 8, 2022

Codecov Report

Merging #1901 (91d88fe) into master (ed78e8a) will decrease coverage by 0.14%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1901      +/-   ##
==========================================
- Coverage   86.57%   86.42%   -0.15%     
==========================================
  Files          18       18              
  Lines        1445     1422      -23     
==========================================
- Hits         1251     1229      -22     
+ Misses        194      193       -1     
Impacted Files Coverage Δ
src/functor.jl 88.13% <ø> (-0.20%) ⬇️
src/utils.jl 95.71% <ø> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed78e8a...91d88fe. Read the comment docs.

@@ -27,7 +28,7 @@ jobs:
- {user: SciML, repo: DiffEqFlux.jl, group: Layers}
- {user: SciML, repo: NeuralPDE.jl, group: NNPDE}
- {user: SciML, repo: OperatorLearning.jl, group: All}
if: contains(github.event.pull_request.labels.*.name, 'run downstream test')
# if: contains(github.event.pull_request.labels.*.name, 'run downstream test')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should be removed before merging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are thinking to save CI time?

Only the NNPDE one is long, 45 minutes. So I think downstream tests are quicker in total than the main tests. I was wondering whether it would be simplest to always run them, but not sure.

Maybe ideal would be to set things up that the basic test on Julia 1 must pass before any others are run. I think CUDA.jl has something like that. But not sure how easy it is to set up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if they don't take more time it's still wasted energy in most cases. Adding a label to sensitive PRs seems an effortless compromise.

@CarloLucibello
Copy link
Member

As an extra layer of caution we could rerun downstream tests for NeuralPDE and DiffEqFlux with group All

@mcabbott
Copy link
Member Author

mcabbott commented Mar 12, 2022

Tests for DiffEqFlux.jl don't run, sounds like there is a manifest checked in which wants NNlib 0.7, but not sure where:

┌ Warning: Could not use exact versions of packages in manifest, re-resolving
[637](https://github.com/FluxML/Flux.jl/runs/5516554350?check_suite_focus=true#step:6:637)
└ @ Pkg.Operations /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1488
[638](https://github.com/FluxML/Flux.jl/runs/5516554350?check_suite_focus=true#step:6:638)
┌ Info: Not compatible with this release. No problem.
[639](https://github.com/FluxML/Flux.jl/runs/5516554350?check_suite_focus=true#step:6:639)
│   exception =
[640](https://github.com/FluxML/Flux.jl/runs/5516554350?check_suite_focus=true#step:6:640)
│    Unsatisfiable requirements detected for package NNlib [872c559c]:
[641](https://github.com/FluxML/Flux.jl/runs/5516554350?check_suite_focus=true#step:6:641)

Tests for NeuralPDE.jl fail, not sure why.

3794
Neural adapter Poisson equation, strategy: QuasiRandomTraining
[3795](https://github.com/FluxML/Flux.jl/runs/5516554400?check_suite_focus=true#step:6:3795)
NeuralAdapter: Test Failed at /home/runner/work/Flux.jl/Flux.jl/downstream/test/neural_adapter_tests.jl:117
[3796](https://github.com/FluxML/Flux.jl/runs/5516554400?check_suite_focus=true#step:6:3796)
  Expression: ≈(u_predict, u_real, rtol = 0.1)
[3797](https://github.com/FluxML/Flux.jl/runs/5516554400?check_suite_focus=true#step:6:3797)
   Evaluated: [-0.0854166448535627 -0.08352062584364406 … 0.07674157080594485 0.07800274129936807; -0.08466057494953605 -0.08276370383344656 … 0.07734593215866198 0.0786038386001856; … ; -0.010857086294706109 -0.009036067653524105 … 0.1257043933646262 0.12661947080281216; -0.010144510500895154 -0.008325800824682428 … 0.1260858784271197 0.12699759752385437] ≈ [0.0 0.0 … 0.0 0.0; 0.0 4.998355282382539e-5 … 4.9983552823825304e-5 1.9487653202985864e-19; … ; 0.0 4.9983552823825304e-5 … 4.99835528238252e-5 1.9487653202985828e-19; 0.0 1.9487653202985864e-19 … 1.9487653202985828e-19 7.597871817923733e-34] (rtol=0.1)
[3798](https://github.com/FluxML/Flux.jl/runs/5516554400?check_suite_focus=true#step:6:3798)
Stacktrace:
[3799](https://github.com/FluxML/Flux.jl/runs/5516554400?check_suite_focus=true#step:6:3799)
 [1] top-level scope
[3800](https://github.com/FluxML/Flux.jl/runs/5516554400?check_suite_focus=true#step:6:3800)
   @ /opt/hostedtoolcache/julia/1.7.2/x64/share/julia/stdlib/v1.7/Test/src/Test.jl:445

Maybe @ChrisRackauckas knows?

Both packages use destructure a lot.

@ChrisRackauckas
Copy link
Member

We don't intentionally restrict NNLib to v0.7 anywhere that I know of. I'm rather surprised that it's not using v0.8 given the compats I see on the repos.

@ChrisRackauckas
Copy link
Member

Test deps were done incorrectly. A new patch fixes that.

@mcabbott
Copy link
Member Author

Great. Any idea why NeuralPDE tests fail, or how to isolate one?

@ChrisRackauckas
Copy link
Member

Re-run the NeuralPDE one and see if you get the same result or if it was just a bad initial condition? It's not failing other downstream tests

@mcabbott mcabbott closed this Mar 13, 2022
@mcabbott mcabbott reopened this Mar 13, 2022
@ChrisRackauckas
Copy link
Member

@yuehhua and @avik-pal there was another breaking change in GeometricFlux?

@mcabbott the NeuralPDE failure looks real, as in, that's one of the few cases that uses Chain instead of FastChain, and it just so happens to now be failing but I cannot reproduce it in the repo's CI. @zoemcc can maybe help isolate this?

@yuehhua
Copy link
Member

yuehhua commented Mar 13, 2022

@ChrisRackauckas If there are breaking changes, I will tag minor release for GeometricFlux. Currently, the breaking changes in GeometricFlux will be the message-passing network part. Did something happened here?

@mcabbott
Copy link
Member Author

mcabbott commented Mar 14, 2022

For this:
https://github.com/SciML/NeuralPDE.jl/blob/master/test/neural_adapter_tests.jl#L1-L121
I can't seem to load the necessary packages. Can anyone isolate this? Is it a bug in the new implementation, or just a change like only trainable parameters, or reliance on a bug in the old?

For DiffEqFlux, current error looks unrelated

┌ Warning: Assignment to `gradsc` in soft scope is ambiguous because a global variable by the same name exists: `gradsc` will be treated as a new local. Disambiguate by using `local gradsc` to suppress this warning or `global gradsc` to assign to the existing global variable.
[1262](https://github.com/FluxML/Flux.jl/runs/5526004094?check_suite_focus=true#step:6:1262)
└ @ ~/work/Flux.jl/Flux.jl/downstream/test/neural_de.jl:447
[1263](https://github.com/FluxML/Flux.jl/runs/5526004094?check_suite_focus=true#step:6:1263)
Test Summary:   | Pass  Broken  Total
[1264](https://github.com/FluxML/Flux.jl/runs/5526004094?check_suite_focus=true#step:6:1264)
Neural DE Tests |  149      50    199
[1265](https://github.com/FluxML/Flux.jl/runs/5526004094?check_suite_focus=true#step:6:1265)
Test Summary:             | Pass  Total
[1266](https://github.com/FluxML/Flux.jl/runs/5526004094?check_suite_focus=true#step:6:1266)
Augmented Neural DE Tests |    8      8
[1267](https://github.com/FluxML/Flux.jl/runs/5526004094?check_suite_focus=true#step:6:1267)
Neural Graph DE: Error During Test at /home/runner/.julia/packages/SafeTestsets/A83XK/src/SafeTestsets.jl:25
[1268](https://github.com/FluxML/Flux.jl/runs/5526004094?check_suite_focus=true#step:6:1268)
  Got exception outside of a @test
[1269](https://github.com/FluxML/Flux.jl/runs/5526004094?check_suite_focus=true#step:6:1269)
  LoadError: MethodError: no method matching GeometricFlux.GCNConv(::GraphSignals.FeaturedGraph{GraphSignals.SparseGraph{false, SparseArrays.SparseMatrixCSC{Float32, Int64}, Vector{Int64}, Int64}, FillArrays.Fill{Float32, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, FillArrays.Fill{Float32, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, FillArrays.Fill{Float32, 1, Tuple{Base.OneTo{Int64}}}}, ::Pair{Int64, Int64})
[1270](https://github.com/FluxML/Flux.jl/runs/5526004094?check_suite_focus=true#step:6:1270)
  Closest candidates are:
[1271](https://github.com/FluxML/Flux.jl/runs/5526004094?check_suite_focus=true#step:6:1271)
    GeometricFlux.GCNConv(::A, ::B, ::F) where {A<:(AbstractMatrix), B, F} at ~/.julia/packages/GeometricFlux/wJWol/src/layers/conv.jl:27
[1272](https://github.com/FluxML/Flux.jl/runs/5526004094?check_suite_focus=true#step:6:1272)
    GeometricFlux.GCNConv(::Pair{Int64, Int64}, ::Any; init, bias) at ~/.julia/packages/GeometricFlux/wJWol/src/layers/conv.jl:32
[1273](https://github.com/FluxML/Flux.jl/runs/5526004094?check_suite_focus=true#step:6:1273)

but it's unclear to me whether https://github.com/SciML/DiffEqFlux.jl/blob/master/test/runtests.jl runs all tests after one failure, or gives up.

@mcabbott
Copy link
Member Author

Any luck isolating the NeuralPDE failure? (I think Chris meant to tag @zoemcc above, but version with a link won't notify.)

@ChrisRackauckas
Copy link
Member

Looks like it was a fluke 🎉

@mcabbott
Copy link
Member Author

Note that 91d88fe reverts the version & downstream changes --- they now pass in 2 mins for packages which don't declare compatibility with Flux 0.13.

Still easy to investigate locally by using Optimisers: destructure. But perhaps 10 days is enough, and we should merge this & revisit if there are in fact bugs?

@ChrisRackauckas
Copy link
Member

Ahh, I thought that was a rebase. Yeah sorry, it seems like I messed up the ping earlier. Feel free to merge, and if NeuralPDE has a failure then at least we know why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment