-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TapeGlobal
and thoughts on the tape variants
#843
Comments
I don't see a way to do global/shared tapes without using some form of Arc/Mutex. FWIW the optimizers.update returns an error that indicates whether some tensors that were supposed to have gradients did not, so that was intended to capture this error. |
|
Are these variants the only way to reuse a taped tensor more than once during a forward pass? E.g. when x is used twice in TapeGlobal::init();
let x = dev
.tensor(1.0)
.put_tape(TapeGlobal);
let y = x.clone() * x;
dbg!(TapeGlobalTensor(y).backward()); What's the difference between this and retaping x (instead of cloning)? I'm trying to continue work on #437, but am suspicious that some of the retaping is incorrect (particularly lines 105 and 115 in rl-ppo-continuous.rs). |
TapeGlobal
Here's another tape-tracking API to consider, as implemented in client code:
TapeGlobal
does what it says: It maintains a global, thread-local, tape, thus avoiding the partitioned-gradients problem*.Here's how you use it:
Let me know if you'd like a PR with something like this.
Thoughts on the tape variants
As you know, I have very non-standard model inputs and outputs, and I've been playing with different tape tracking APIs in the hope of finding one that's easy to use and not error-prone, specifically:
OwnedTape<_, _>
,Arc<Mutex<OwnedTape<_, _>>>
,Arc<Mutex<Arc<Mutex<OwnedTape<_, _>>>>>
, andGlobalTape
.Here are my thoughts:
OwnedTape<_, _>
: Simple and avoidsArc
, but suffers from the partitioned-gradients problem, so the programmer must either 1) mentally track which tape has which gradients, or 2) do a final accumulate across all the model outputs to merge all the tapes into one.Arc<Mutex<OwnedTensor<_, _>>>
: UsesArc
(sorta bad) and only partially solves the partitioned-gradients problem.Arc<Mutex<Arc<Mutex<OwnedTensor<_, _>>>>>
: Uses nestedArc<Mutex<_>>
(bad code smell) and still does not fully solve the partitioned-gradients problem. E.g., if you have dependency paths that originate from model parameters and never interact with the main tape (this can occur in a conditional computation setting). This behavior isn't terrible, because these gradients will be zero, but you will get anunwrap
error if you expect them to be defined.TapeGlobal
: This does fully solve the partitioned-gradients problem, at the expense of maintaining an ugly global variable and losing type parameterization.I'm personally still not sure which API I prefer between 1, 3, and 4 (I don't love any of them).
But I think 2 is essentially useless and rather dangerous, as you can accidentally omit gradients from your tape, and the shared state makes the tape accumulation difficult to reason about.
*Partitioned-gradients problem: When the gradients for you network are partitioned across several tapes and you as a programmer have to worry about which tape has which gradients.
The text was updated successfully, but these errors were encountered: