Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset in example honest notebook #16

Open
Zijian007 opened this issue Nov 7, 2023 · 13 comments
Open

Dataset in example honest notebook #16

Zijian007 opened this issue Nov 7, 2023 · 13 comments

Comments

@Zijian007
Copy link

I notice in the paper and the example Jupyter code, the output of ASSISTANT(response), or the statement is truncated, I would like to know the reason. Thank you so much!

@andyzoujm
Copy link
Owner

To extract functions (such as being honest), we're collecting neural activity at every token position in the response, as described in step 2 of the LAT scan in the paper.

@Zijian007
Copy link
Author

Thank you for you reply!

I notice in the data process

for idx in range(1, len(tokens) - 5):
, the input tokens are truncated, I want to know why not input all the tokens into the model? Thank you!

@andyzoujm
Copy link
Owner

It was a design choice since unfinished sentences don't have strong indication of honesty/dishonesty. But it might not matter that much.

@Jeffwang87
Copy link

what is the difference between [true false] and [false true] in the label for honesty dataset

@justinphan3110cais justinphan3110cais changed the title why the output of ASSISTANT(response) need to be truncated? Dataset in example honest notebook Nov 17, 2023
@justinphan3110cais
Copy link
Collaborator

it is corresponding to the pairs in the train, in the train we randomly shuffle so some them have [0] index as honesty and some of them have [1] index as honesty

@Dakingrai
Copy link

  1. Why true_statements is only considered for both training and testing? Why were false_statements not considered to generate untruthful_statements?
  2. Why are train labels randomly shuffled? Shouldn't they have [1] for honest_statements and [0] for untruthful_statements?

@shamikbosefj
Copy link

@andyzoujm , I'm trying to understand the dataset in honesty_function_dataset(). Only the true_statements are being used to create the train_set and I'm not sure why. I believe @Dakingrai also asked a similar question, but it was never answered

@joshlevy89
Copy link

joshlevy89 commented May 28, 2024

@shamikbosefj i am not a contributor to this repo but my guess is that since the true_statements are getting truncated with the functional stimulation paradigm and prefixed with "imagine you are a truthful..." or "imagine you are an untruthful...", it doesn't really matter. In the end, the statement is not completed so it leaves the door open to whatever ideas for completion, depending on whether the LLM is asked to be truthful or untruthful . This is probably better than creating each pair out of a true and false statement because it reduces the amount of variability, and so the activations are likely to vary only on the honesty/dishonesty axis. Now I'm sure they could have also used false statements to create separate pairs (again, truncating the statement and prefixing as described) but they don't need that many pairs so it probably just wasn't necessary.

As to you other question, they need to shuffle in order to create variability on the axis of interest (honesty/dishonesty) I believe. Otherwise, when PCA is done on the difference, there's no variability over the pairs in the direction of the vector in that axis of interest.

See also: #23 (comment)

@shamikbosefj
Copy link

Thanks for the explanation, @joshlevy89
In the line following the shuffled test data, there's a strange operation

reshaped_data = np.array([[honest, untruthful] for honest, untruthful in 
                          zip(honest_statements[:-1], untruthful_statements[1:])]).flatten()
test_data = reshaped_data[ntrain:ntrain*2].tolist()

Why does it skip the first and last of the honest and untruthful sets respectively? Is this just a way to ensure that the same text isn't picked as honest and untruthful?

@joshlevy89
Copy link

@shamikbosefj hm, i'm not sure about that line either. i'm not sure what it's trying to accomplish. i think you could probably be replaced by something simpler...
test_data = np.array(combined_data[ntrain:ntrain+ntrain//2]).flatten().tolist()

@shamikbosefj
Copy link

@joshlevy89 I'm wondering if it's due to the random.shuffle() call earlier. That edits the array in place, so maybe they wanted to use the original values again?

@shamikbosefj
Copy link

shamikbosefj commented Jun 13, 2024

@justinphan3110 @joshlevy89 @andyzoujm I think there's a bug in the dataset creation of the honesty_function_dataset() in utils.py . The size of the train_data is 1024, but the size of train_labels is 512 (See attached screenshot). This is different to the test_data where both data and labels have the same size (512). If this is not a bug, can someone please explain this discrepancy?

Updated utils.py
image

Result
image

@ivyllll
Copy link

ivyllll commented Sep 30, 2024

@shamikbosefj i am not a contributor to this repo but my guess is that since the true_statements are getting truncated with the functional stimulation paradigm and prefixed with "imagine you are a truthful..." or "imagine you are an untruthful...", it doesn't really matter. In the end, the statement is not completed so it leaves the door open to whatever ideas for completion, depending on whether the LLM is asked to be truthful or untruthful . This is probably better than creating each pair out of a true and false statement because it reduces the amount of variability, and so the activations are likely to vary only on the honesty/dishonesty axis. Now I'm sure they could have also used false statements to create separate pairs (again, truncating the statement and prefixing as described) but they don't need that many pairs so it probably just wasn't necessary.

bb8609d611a8ba3a370d5ac21cf1aa1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants