Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing RepControl by introducing the pca_model's explained_variance_ratio_ #22

Open
semicircle opened this issue Nov 21, 2023 · 2 comments

Comments

@semicircle
Copy link

Currently, after training the rep_reader, the coeff variable used in the control pipeline need to be customized solely by experiment, and the value changes a lot, take the primary_emotions as a example, here's the values I found:

# LLaMA-2-Chat-13B coeff=3.0-3.5
# mistralai/Mistral-7B-Instruct-v0.1 coeff=0.5
# HuggingFaceH4/zephyr-7b-beta coeff=0.3
# openchat/openchat_3.5 coeff=0.2

This makes it challenging for RepControl to adapt to new models.

My finding is that by introducing the pca_model's explained_variance_ratio_ into control progress can make the manipulation progress more "gentle" / "accurate".

Here's the key modifications:
In the rep_readers.py :

def get_rep_directions(self, model, tokenizer, hidden_states, hidden_layers, **kwargs):
        """Get PCA components for each layer"""
        directions = {}

        # like directions, save the variance ratio for each layer
        variance_ratio = {}

        for layer in hidden_layers:

             ........

            self.n_components = pca_model.n_components_
            variance_ratio[layer] = pca_model.explained_variance_ratio_
           
        self.variance_ratio = variance_ratio
        return directions

Each layer's variance_ratio represents how sparse or variably distributed the direction is, which can be interpreted as a 'confidence' score in the control section for that layer.

So, when manipulating the output, the activation variable is calculated as:

coeff=0.2
coeff_with_variance = 2.0

activations = {}
activations_with_variance = {}

for layer in layer_id:
    activations[layer] = torch.tensor(coeff * rep_reader.directions[layer] * rep_reader.direction_signs[layer]).to(model.device).half()
   
    variance_ratio = rep_reader.variance_ratio[layer][0]
    # print(variance_ratio)
    activations_with_variance[layer] = torch.tensor(coeff_with_variance * rep_reader.directions[layer] * rep_reader.direction_signs[layer] * variance_ratio).to(model.device).half()

Applying this method seems to allow all the 7B models I've tested to adapt a common coeff value, approximately around 2.0.

Theoretically, I came up with this idea when I saw the code of WrappedBlock using the controller (activations) to manipulate the tensor using a simple linear approach.
So, I just take the variance_ratio into account in a most simple way.
Maybe, by extracting the PCA model's underlying singular vector can gain better control over this.

Thanks for sharing this great work!

@andyzoujm
Copy link
Owner

Hi,

Thanks for putting this together. Seems practically very useful. Feel free to open a PR if you'd like to integrate this into the library.

Best,
Andy

@semicircle
Copy link
Author

Hi,

Some updates on this:

By adding this ratio to the activation doesn't mean totally stable control.
The coeff have to revise to adapt the prompt, take 'anger' emotion control as an example, a happy scenario may need larger coeff than a neutral one to make the response looks anger. It seems to the activation needs to be revised accordingly.

I have noticed the newly added piecewise_linear operator there, and I am trying to add some code in parallel there to implement this feature.

Thanks~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants