Enhancing RepControl by introducing the pca_model's `explained_variance_ratio_` #22

semicircle · 2023-11-21T11:30:45Z

Currently, after training the rep_reader, the coeff variable used in the control pipeline need to be customized solely by experiment, and the value changes a lot, take the primary_emotions as a example, here's the values I found:

# LLaMA-2-Chat-13B coeff=3.0-3.5
# mistralai/Mistral-7B-Instruct-v0.1 coeff=0.5
# HuggingFaceH4/zephyr-7b-beta coeff=0.3
# openchat/openchat_3.5 coeff=0.2

This makes it challenging for RepControl to adapt to new models.

My finding is that by introducing the pca_model's explained_variance_ratio_ into control progress can make the manipulation progress more "gentle" / "accurate".

Here's the key modifications:
In the rep_readers.py :

def get_rep_directions(self, model, tokenizer, hidden_states, hidden_layers, **kwargs):
        """Get PCA components for each layer"""
        directions = {}

        # like directions, save the variance ratio for each layer
        variance_ratio = {}

        for layer in hidden_layers:

             ........

            self.n_components = pca_model.n_components_
            variance_ratio[layer] = pca_model.explained_variance_ratio_
           
        self.variance_ratio = variance_ratio
        return directions

Each layer's variance_ratio represents how sparse or variably distributed the direction is, which can be interpreted as a 'confidence' score in the control section for that layer.

So, when manipulating the output, the activation variable is calculated as:

coeff=0.2
coeff_with_variance = 2.0

activations = {}
activations_with_variance = {}

for layer in layer_id:
    activations[layer] = torch.tensor(coeff * rep_reader.directions[layer] * rep_reader.direction_signs[layer]).to(model.device).half()
   
    variance_ratio = rep_reader.variance_ratio[layer][0]
    # print(variance_ratio)
    activations_with_variance[layer] = torch.tensor(coeff_with_variance * rep_reader.directions[layer] * rep_reader.direction_signs[layer] * variance_ratio).to(model.device).half()

Applying this method seems to allow all the 7B models I've tested to adapt a common coeff value, approximately around 2.0.

Theoretically, I came up with this idea when I saw the code of WrappedBlock using the controller (activations) to manipulate the tensor using a simple linear approach.
So, I just take the variance_ratio into account in a most simple way.
Maybe, by extracting the PCA model's underlying singular vector can gain better control over this.

Thanks for sharing this great work!

The text was updated successfully, but these errors were encountered:

andyzoujm · 2023-11-29T08:18:48Z

Hi,

Thanks for putting this together. Seems practically very useful. Feel free to open a PR if you'd like to integrate this into the library.

Best,
Andy

semicircle · 2023-12-14T11:56:09Z

Hi,

Some updates on this:

By adding this ratio to the activation doesn't mean totally stable control.
The coeff have to revise to adapt the prompt, take 'anger' emotion control as an example, a happy scenario may need larger coeff than a neutral one to make the response looks anger. It seems to the activation needs to be revised accordingly.

I have noticed the newly added piecewise_linear operator there, and I am trying to add some code in parallel there to implement this feature.

Thanks~

andyzoujm mentioned this issue Nov 29, 2023

How to automate the [threshold] parameter in Honest example? #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing RepControl by introducing the pca_model's `explained_variance_ratio_` #22

Enhancing RepControl by introducing the pca_model's `explained_variance_ratio_` #22

semicircle commented Nov 21, 2023

andyzoujm commented Nov 29, 2023

semicircle commented Dec 14, 2023

Enhancing RepControl by introducing the pca_model's explained_variance_ratio_ #22

Enhancing RepControl by introducing the pca_model's explained_variance_ratio_ #22

Comments

semicircle commented Nov 21, 2023

andyzoujm commented Nov 29, 2023

semicircle commented Dec 14, 2023

Enhancing RepControl by introducing the pca_model's `explained_variance_ratio_` #22

Enhancing RepControl by introducing the pca_model's `explained_variance_ratio_` #22