This is a code and dataset repository for the paper "Working Memory Capacity of ChatGPT: An Empirical Study", which has been accepted by AAAI 2024 Conference on Artificial Intelligence.
Here we created a dataset to test the working memory capacity of language models. We choose the N-back task because it is widely used in cognitive science as a measure of working memory capacity. To create the N-back task dataset, we generated 30 blocks of trials for
Prompt Example. Here we only focus on the base version of verbal N-back tasks. We use the following format of prompts for
User:
Instruction: as a language model, you are asked to perform a 1-back task. A letter will be presented on every trial. Your task is to respond with 'm' whenever the letter presented is the same as the previous letter, and '-' whenever the letter presented is different from the previous letter. A strict rule is that you must not output anything other than 'm' or '-'. Now begins the task.
User:
{letter}
Model:
{-}(because this is the first letter)
User:
{letter}
Model:
{m/-}
...
User:
Instruction: as a language model, you are asked to perform a 2-back task. A letter will be presented on every trial. Your task is to respond with 'm' whenever the letter presented is the same as the letter two trials ago, and '-' whenever the letter presented is different from the letter two trials ago. A strict rule is that you must not output anything other than 'm' or '-'. Now begins the task.
User:
{letter}
Model:
{-}(because this is the first letter)
User:
{letter}
Model:
{m/-}
...
User:
Instruction: as a language model, you are asked to perform a 3-back task. A letter will be presented on every trial. Your task is to respond with 'm' whenever the letter presented is the same as the letter three trials ago, and '-' whenever the letter presented is different from the letter three trials ago. A strict rule is that you must not output anything other than 'm' or '-'. Now begins the task.
User:
{letter}
Model:
{-}(because this is the first letter)
User:
{letter}
Model:
{m/-}
...
Metrics. We use exact match of the extraction results to calculate the hit rate, false alarm rate, and accuracy.