-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset in example honest notebook #16
Comments
To extract functions (such as being honest), we're collecting neural activity at every token position in the response, as described in step 2 of the LAT scan in the paper. |
Thank you for you reply! I notice in the data process
|
It was a design choice since unfinished sentences don't have strong indication of honesty/dishonesty. But it might not matter that much. |
what is the difference between [true false] and [false true] in the label for honesty dataset |
it is corresponding to the pairs in the train, in the train we randomly shuffle so some them have [0] index as honesty and some of them have [1] index as honesty |
|
@andyzoujm , I'm trying to understand the dataset in |
@shamikbosefj i am not a contributor to this repo but my guess is that since the true_statements are getting truncated with the functional stimulation paradigm and prefixed with "imagine you are a truthful..." or "imagine you are an untruthful...", it doesn't really matter. In the end, the statement is not completed so it leaves the door open to whatever ideas for completion, depending on whether the LLM is asked to be truthful or untruthful . This is probably better than creating each pair out of a true and false statement because it reduces the amount of variability, and so the activations are likely to vary only on the honesty/dishonesty axis. Now I'm sure they could have also used false statements to create separate pairs (again, truncating the statement and prefixing as described) but they don't need that many pairs so it probably just wasn't necessary. As to you other question, they need to shuffle in order to create variability on the axis of interest (honesty/dishonesty) I believe. Otherwise, when PCA is done on the difference, there's no variability over the pairs in the direction of the vector in that axis of interest. See also: #23 (comment) |
Thanks for the explanation, @joshlevy89
Why does it skip the first and last of the honest and untruthful sets respectively? Is this just a way to ensure that the same text isn't picked as honest and untruthful? |
@shamikbosefj hm, i'm not sure about that line either. i'm not sure what it's trying to accomplish. i think you could probably be replaced by something simpler... |
@joshlevy89 I'm wondering if it's due to the |
@justinphan3110 @joshlevy89 @andyzoujm I think there's a bug in the dataset creation of the |
|
I notice in the paper and the example Jupyter code, the output of ASSISTANT(response), or the statement is truncated, I would like to know the reason. Thank you so much!
The text was updated successfully, but these errors were encountered: