You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, after training the rep_reader, the coeff variable used in the control pipeline need to be customized solely by experiment, and the value changes a lot, take the primary_emotions as a example, here's the values I found:
This makes it challenging for RepControl to adapt to new models.
My finding is that by introducing the pca_model's explained_variance_ratio_ into control progress can make the manipulation progress more "gentle" / "accurate".
Here's the key modifications:
In the rep_readers.py :
def get_rep_directions(self, model, tokenizer, hidden_states, hidden_layers, **kwargs):
"""Get PCA components for each layer"""
directions = {}
# like directions, save the variance ratio for each layer
variance_ratio = {}
for layer in hidden_layers:
........
self.n_components = pca_model.n_components_
variance_ratio[layer] = pca_model.explained_variance_ratio_
self.variance_ratio = variance_ratio
return directions
Each layer's variance_ratio represents how sparse or variably distributed the direction is, which can be interpreted as a 'confidence' score in the control section for that layer.
So, when manipulating the output, the activation variable is calculated as:
Applying this method seems to allow all the 7B models I've tested to adapt a common coeff value, approximately around 2.0.
Theoretically, I came up with this idea when I saw the code of WrappedBlock using the controller (activations) to manipulate the tensor using a simple linear approach.
So, I just take the variance_ratio into account in a most simple way.
Maybe, by extracting the PCA model's underlying singular vector can gain better control over this.
Thanks for sharing this great work!
The text was updated successfully, but these errors were encountered:
By adding this ratio to the activation doesn't mean totally stable control.
The coeff have to revise to adapt the prompt, take 'anger' emotion control as an example, a happy scenario may need larger coeff than a neutral one to make the response looks anger. It seems to the activation needs to be revised accordingly.
I have noticed the newly added piecewise_linear operator there, and I am trying to add some code in parallel there to implement this feature.
Currently, after training the
rep_reader
, thecoeff
variable used in the control pipeline need to be customized solely by experiment, and the value changes a lot, take theprimary_emotions
as a example, here's the values I found:This makes it challenging for RepControl to adapt to new models.
My finding is that by introducing the pca_model's
explained_variance_ratio_
into control progress can make the manipulation progress more "gentle" / "accurate".Here's the key modifications:
In the rep_readers.py :
Each layer's variance_ratio represents how sparse or variably distributed the direction is, which can be interpreted as a 'confidence' score in the control section for that layer.
So, when manipulating the output, the
activation
variable is calculated as:Applying this method seems to allow all the 7B models I've tested to adapt a common coeff value, approximately around 2.0.
Theoretically, I came up with this idea when I saw the code of WrappedBlock using the controller (activations) to manipulate the tensor using a simple linear approach.
So, I just take the variance_ratio into account in a most simple way.
Maybe, by extracting the PCA model's underlying singular vector can gain better control over this.
Thanks for sharing this great work!
The text was updated successfully, but these errors were encountered: