Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiating with a NN #2284

Open
swilliamson7 opened this issue Jan 28, 2025 · 0 comments
Open

Differentiating with a NN #2284

swilliamson7 opened this issue Jan 28, 2025 · 0 comments

Comments

@swilliamson7
Copy link
Collaborator

I don't really know what to title this issue, but basically I'm having trouble differentiating my model when I use Lux.jl. The error output I'm getting is also incredibly sparse, and doesn't say much about what's going on.

Basically, I defined a one-layer NN and added it in to the RHS of my model, and I now want to use Enzyme to compute derivatives. I'm seeing the following behaviors when testing everything:

  1. When I run a one-day integration with Enzyme.jl without using checkpointing, everything seems to be okay and the model finishes running

  2. When I run a ten-day integration with Enzyme.jl, still without checkpointing, I get the following output in my terminal

swilliamson@CRIOS-A66253 ~/D/G/S/eddy-stresses> julia --project=. eddy_paper.jl &>flux_nn_output_nocp_10days.txt                                                :) main!?#
[1]    68956 killed     julia --project=. eddy_paper.jl &> flux_nn_output_nocp_10days.txt

with no actual error output. This is similar to when I've run out of memory in the past, but I shouldn't be running out of memory in a ten-day integration

  1. In response to (2) I instead tried running the ten-day integration with Enzyme.jl and Checkpointing.jl, my terminal now warns me:
swilliamson@CRIOS-A66253 ~/D/G/S/eddy-stresses> julia --project=. eddy_paper.jl &>flux_nn_output_withcp_10days.txt                                              :( main!?#
[1]    69589 abort      julia --project=. eddy_paper.jl &> flux_nn_output_withcp_10days.txt

with the specific error message

GC error (probable corruption)
Allocations: 6376696578 (Pool: 6376685678; Big: 10900); GC: 1352

!!! ERROR in jl_ -- ABORTING !!!

thread 0 ptr queue:
~~~~~~~~~~ ptr queue top ~~~~~~~~~~
Memory{Float64}(22, 0x3d611a240)[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
==========
Memory{Float64}(22, 0x3d611a310)[-0.132311, -0.127292, -0.122289, -0.135518, -0.130366, -0.125266, -0.138521, -0.133286, -0.128096, 0.122237, 0.119279, 0.116337, 0.125068, 0.12206, 0.119058, 0.127711, 0.124677, 0.12163, 0.110448, 0.111002, 0.10225, 0.10284]
==========
~~~~~~~~~~ ptr queue bottom ~~~~~~~~~~

[69589] signal 6: Abort trap: 6
in expression starting at /Users/swilliamson/Documents/GitHub/ShallowWaters_work/eddy-stresses/eddy_paper_run_experiments.jl:1
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 6376696578 (Pool: 6376685678; Big: 10900); GC: 1352

I'm running

  • Julia v0.11.3,
  • Enzyme v0.13.28,
  • Checkpointing v0.9.7,
  • and Lux v1.6.0.

I'm also seeing substantial slowdowns when I use Enzyme on the Lux NN, versus if I just handwrite a single layer NN, so getting to these errors takes a long time. All my code is in a private repo, but @wsmoses should have access. I'm also happy to elaborate if anything is unclear. Any and all advice and assistance here is greatly appreciated!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant