Differentiating with a NN #2284

swilliamson7 · 2025-01-28T19:27:36Z

I don't really know what to title this issue, but basically I'm having trouble differentiating my model when I use Lux.jl. The error output I'm getting is also incredibly sparse, and doesn't say much about what's going on.

Basically, I defined a one-layer NN and added it in to the RHS of my model, and I now want to use Enzyme to compute derivatives. I'm seeing the following behaviors when testing everything:

When I run a one-day integration with Enzyme.jl without using checkpointing, everything seems to be okay and the model finishes running
When I run a ten-day integration with Enzyme.jl, still without checkpointing, I get the following output in my terminal

swilliamson@CRIOS-A66253 ~/D/G/S/eddy-stresses> julia --project=. eddy_paper.jl &>flux_nn_output_nocp_10days.txt                                                :) main!?#
[1]    68956 killed     julia --project=. eddy_paper.jl &> flux_nn_output_nocp_10days.txt

with no actual error output. This is similar to when I've run out of memory in the past, but I shouldn't be running out of memory in a ten-day integration

In response to (2) I instead tried running the ten-day integration with Enzyme.jl and Checkpointing.jl, my terminal now warns me:

swilliamson@CRIOS-A66253 ~/D/G/S/eddy-stresses> julia --project=. eddy_paper.jl &>flux_nn_output_withcp_10days.txt                                              :( main!?#
[1]    69589 abort      julia --project=. eddy_paper.jl &> flux_nn_output_withcp_10days.txt

with the specific error message

GC error (probable corruption)
Allocations: 6376696578 (Pool: 6376685678; Big: 10900); GC: 1352

!!! ERROR in jl_ -- ABORTING !!!

thread 0 ptr queue:
~~~~~~~~~~ ptr queue top ~~~~~~~~~~
Memory{Float64}(22, 0x3d611a240)[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
==========
Memory{Float64}(22, 0x3d611a310)[-0.132311, -0.127292, -0.122289, -0.135518, -0.130366, -0.125266, -0.138521, -0.133286, -0.128096, 0.122237, 0.119279, 0.116337, 0.125068, 0.12206, 0.119058, 0.127711, 0.124677, 0.12163, 0.110448, 0.111002, 0.10225, 0.10284]
==========
~~~~~~~~~~ ptr queue bottom ~~~~~~~~~~

[69589] signal 6: Abort trap: 6
in expression starting at /Users/swilliamson/Documents/GitHub/ShallowWaters_work/eddy-stresses/eddy_paper_run_experiments.jl:1
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 6376696578 (Pool: 6376685678; Big: 10900); GC: 1352

I'm running

Julia v0.11.3,
Enzyme v0.13.28,
Checkpointing v0.9.7,
and Lux v1.6.0.

I'm also seeing substantial slowdowns when I use Enzyme on the Lux NN, versus if I just handwrite a single layer NN, so getting to these errors takes a long time. All my code is in a private repo, but @wsmoses should have access. I'm also happy to elaborate if anything is unclear. Any and all advice and assistance here is greatly appreciated!!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differentiating with a NN #2284

Differentiating with a NN #2284

swilliamson7 commented Jan 28, 2025

Differentiating with a NN #2284

Differentiating with a NN #2284

Comments

swilliamson7 commented Jan 28, 2025